Multi-Attribute Spaces: Calibration for Attribute Fusion and Similarity Search

Keywords: face search, attributes, calibration, normalization, extreme value theory, w-scores
Fall 2011 - Summer 2012
teaser image


Recent work has shown that visual attributes are a powerful approach for applications such as recognition, image description and retrieval. However, fusing multiple attribute scores -- as required during multi-attribute queries or similarity searches -- presents a significant challenge. Scores from different attribute classifiers cannot be combined in a simple way; the same score for different attributes can mean different things. In this work, we show how to construct normalized "multi-attribute spaces" from raw classifier outputs, using techniques based on the statistical Extreme Value Theory. Our method calibrates each raw score to a probability that the given attribute is present in the image. We describe how these probabilities can be fused in a simple way to perform more accurate multi-attribute searches, as well as enable attribute-based similarity searches. A significant advantage of our approach is that the normalization is done after-the-fact, requiring neither modification to the attribute classification system nor ground truth attribute annotations. We demonstrate results on a large data set of nearly 2 million face images and show significant improvements over prior work. We also show that perceptual similarity of search results increases by using contextual attributes.

This research was supported in part by ONR SBIR Award N00014-11-C-0243 and ONR MURI Award N00014-08-1-0638.


  • "Multi-Attribute Spaces: Calibration for Attribute Fusion and Similarity Search,"
    Proceedings of the 25th IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
    June 2012.


Attribute Calibration Software

Attribute Calibration Software:

C/C++/Python code to calibrate attribute classifier outputs using the w-score formulation, based on the Extreme Value Theory. For non-commercial uses only.
MugHunt Online Face Search Engine

MugHunt Online Face Search Engine:

An online face search engine built using this research. This search engine is a joint collaboration with Securics, Inc..


An overview of the score calibration algorithm

An overview of the score calibration algorithm:

SVM decision scores are normalized by fitting a Weibull distribution (the red curve) to the tail of the opposite side of the classifier (marked as 1), where instances of the attribute of interest ("male" in this case) are outliers (circled, and marked as 2). The CDF of this distribution (the blue curve) is used to produce the normalized attribute w-scores. Note that no assumptions are made about the entire score distribution (which can vary greatly); our model is applied to only the tail of the distribution, which is much better behaved.
Multi-attribute search results

Multi-attribute search results:

Comparisons between the weighted SVM decision score fusion approach of Kumar et al. (left) and our multi-attribute space fusion approach (right) for the top five results on a selection of queries, made over nearly 2 million face images from the web. Without proper normalization (left), certain attributes can dominate a query, e.g., gender in the first query.
Target-attribute similarity search results

Target-attribute similarity search results:

Results for similarity searches using a set of target attributes and a query image. By calibrating attribute distances in a local neighborhood around the normalized target attribute values, we can compute similarity in a consistent manner for any set of attributes and query images, despite the fact that perceptual similarity changes quite drastically with the attribute values in question.
Contextual attributes for improving target-attribute similarity searches

Contextual attributes for improving target-attribute similarity searches:

Using the task template shown in (a), workers were asked to rate the similarity search results for the target image with respect to attributes "blonde hair" and "rosy cheeks." The results of our algorithm are shown in (b) using only the query attributes, and in (c) using additional contextual attributes as well. The latter looks better (e.g., see the relative order of the highlighted faces). Our quantitative experiments used hundreds of thousands of pairwise comparisons to assess hypotheses H2: (b) is similar to human rankings, H3: (c) is similar to human rankings, and H4: (c) is better than (b).
Statistical significance of user-study results

Statistical significance of user-study results:

Summary of quantitative experiments on target-based similarity search, using hypotheses H2, H3, and H4. For each set of query attributes (on the outside of the circles) the hypothesis was evaluated for each cumulative set of 5 ranks (moving from the center circle outwards: top-5, top-10, etc). A ** means the results were statistically significant at p=0.01, * means p=0.05, and - means not statistically significant. Both algorithmic rankings (query-only and query+context) match human rankings well. Adding contextual attributes improves results for some queries but not others.