LFPW-inspired Fiducial Classifiers
The most basic kind of support vector machine classifier we can compute from the annotated LFPW data is one that recognizes only a single fiducial type (e.g. recognizes only the tip of the nose, only the corner of the eye). In fact, this is the approach taken in  and then these low-level classifiers are coordinated with a conditional probability model. So it is natural to begin by creating similar single-fiducial classifiers. The way this works is to fix a fiducial type and then:
- Load in an image from the training data.
- Read the annotation data and locate the (x,y) location of the fiducial within the current image.
- Use the patch-GPU implementation to extract the HoG feature vector from a region around the fiducial.
- Add this feature vector as a row in a data matrix and set the corresponding entry of the labels array to +1.
- Repeat this process until all of the training images have been exhausted, yielding around 3600 positive training samples.
- Randomly choose images and sub-patch locations to extract negative training examples. I used an equal number of negative samples as positive samples, but this can be changed for performance.
- Append the negative training samples to your data matrix and set the corresponding labels to 0.
- Pass your data matrix and labels to the scikits.learn model of your choice (e.g. with cross-validation, auto-weighting).
I have created a script that performs the above steps to train a single-fiducial classifier. As a test case, I trained such a classifier to recognize the left corner of the left eyebrow, which is one of the features in the data set from . After training, I then executed the classifier on every pixel of one of the test images. Below I show a plot for visualizing the results.
The plot on the left is the binary classification output resulting from running the scikits.learn predict() function on the extracted HoG features for every pixel in the image. As you can see, there are a number of false positives. However, the classifier offers a good cluster of detections in and around the left-corner of each eyebrow region, as expected. The far right image is the non-normalized probability that a given pixel belongs to the left-corner-of-left-eyebrow class. Normalizing this map would yield one of the conditional probability images used for the analysis performed in .
Because the programming took such a long time, I was unable to quantify the performance of the classifiers with the usual metrics (such as a confusion matrix, ROC curve, or precision-recall). But it does appear that simple bootstrapping would help eliminate many of the false positives. I discuss this idea a bit more in the Future Work section.
Poselets are support vector machine classifiers that detect the co-occurence of many keypoints rather than just one keypoint as in the result discussed above. In the case of faces, a single poselet might capture the combinaton of left-ear, left-eye, and left-cheek-bone; or a poselet might capture chin, mouth, and tip-of-the-nose. The hope is that one poselet detector gives you more bang for your buck than a single keypoint detector and that poselets can yield more conditional information about the locations of keypoints far away from pixels where a positive is detected.
To train a poselet, one uses the following steps:
- Randomly select an image file from the training data set and randomly select a patch within that file. This will be a seed patch.
- Treat the keypoint annotations within the seed patch as a purely geometrical constellation of (x,y) pairs. For every other image in the training set, compute the Procrustes distance between the seed-patch keypoints and the same keypoints as they appear in the training image (with a slight penalty term for visibility differences). This gives a list of distances between the seed patch and every other training image.
- Keep the top 250 closest image patches from the training data. These, along with the seed patch, will be considered positive training samples. Use the HoG extractor code to compute feature vectors for all of these patches and add them as rows in a data matrix with corresponding entries of 1 in the array of labels.
- Randomly select patches and extract HoG features from the training data to use as negative examples. Repeat until you have 500 such examples and append them to the data matrix (and set the correspond entries of the label array to 0).
- Train a support vector machine classifier using scikits.learn with these data and label arrays. This classifier object is a poselet.
One of the main goals of this project was to create Python code that could be used to produce Poselet classifiers. I am proud to report that I achieved this goal, and with the help of my GPU HoG implementations, the training time per-poselet is a reasonable 5-10 minutes on my personal desktop computer (2.9GHz AMD processor and 12GB RAM; nVidia GeForce 9500 GT with compute capability 1.1). Because the programming took a long time, I have not yet extensively tested the Poselet classifiers, but I have done some visualization examples with more detailed explanation below.
The top row is an example poselet seed patch. The bottom row is the top 5 closest patches from the training set, ranked by Procrustes / visibility distance.
Below I show a similar test case as the single-fiducial plot near the top of the page. I am using the poselet classifier that was trained to find patches similar to those shown above.
The left plot is binary classification. It shows a streak through the image on which there are many false positives, but in general the detector fires in places near to the facial region. The far right plot is the probability of classification for this poselet. Notice that it is highest in a region centrally close to the actual face. Overall this is a successful detector output; when combined with conditional information from other poselets, this could be used to predict where the facial keypoints should be located with high accuracy.
In summary, my main contribution was to create Python programs that replicate the results of  and allow me to construct poselets as in  in a reasonable amount of time on my local desktop machine. Both types of classifiers appear to be working crudely and giving intuitively expected results. Each has a high false alarm rate, but this is expected to be reduced with further effort in cross-validation and bootstrapping. By utilizing GPU kernels for the feature extraction code, the process to form the data into a matrix that is presentable to scikits.learn was accelerated tremendously making the performance improving tasks of cross-validation and bootstrapping much easier to perform in future work.