HoG Extractors: CPU vs. GPU
I tested my own implementation of the pyramidal HoG descriptor (pHoG), my serial implementation of a vanilla HoG (vHoG), my GPU implementation of a patch-based HoG descriptor (patch-GPU) and my GPU implementation of a keypoint based HoG descriptor (keypoints-GPU) I created 5 test images that range in size from 64x64 to 1024x1024, incremented by powers of two. The plot below shows the image size along the x-axis and the y-axis shows the time required to compute a HoG feature vector from the entire image (the time is expressed in log base 10 of seconds).
The top two curves are CPU implementations and the bottom two curves are GPU implementations. Despite having to compute features for multiple pyramid levels, the pHoG implementation (red curve) is slightly faster than my vanilla HoG implementation. This is likely due to the pHoG usage of vectorized operations and NumPy/scikits.image routines for things like Canny edge detection and logical indexing. My vanilla HoG descriptor only uses NumPy for basic operations like computing the gradient. Both of these implementations require around 1 second to compute the HoG feature vector for a modest-sized test case (256x256).
The Bottom Line
The bottom two curves are from my GPU implementations. They are about 2-3 orders of magnitude faster than the CPU implementations, with the patch-based GPU fastest. For the 256x256 case, the patch-based GPU implementation requires about 0.003 seconds to compute the HoG feature vector. This improvement may not sound like much, but consider a couple of the cases that will be important when working with Poselets later on. There are 65536 pixels in a 256x256 image. When running a single Poselet classifier over every pixel in an image, the CPU implementations would require around 328 minutes (roughly 5.5 hours!) to extract feature vectors from all of these pixels. This is not even including the time per pixel required to actually execute the classifier and store the results. With the GPU implementations, however, it would only require 65 seconds to extract the features for every pixel in such an image. If you're executing this kind of operation on a large set of images, or on images larger than 256x256, you start to see why the GPU becomes absolutely necessary for even basic experiments.