Automatic Discovery of Query-Class-Dependent Models for Multimodal Search

Back to Project List



In many search applications, there are cues available from a variety of information sources. In broadcast news video databases, this information might be conveyed through the visual content of the video image, the timing and pitch of the news anchor's speech, and the closed caption transcript of words being spoken. In search, the task is then to appropriately utilize all of these various cues to best answer the query. In typical systems, various methods are developed to search each cue separately (like a query-by-example image search for the visual modality and a keyword text search over the text transcripts) and then to combine the results. It quickly becomes apparent that different queries see different levels of success resulting from various single-modality searches. A search for a named person might fair better using only text keywords, while sports-related queries might benefit most from an example query image. Some example queries and the likelihood of success for various search methods are shown in the figure below. Clearly, we need a mechanism for adapting the search approach according to the query. One such approach that has been proposed is to define "classes" of queries, where queries within each class benefit from similar search approaches. Incoming queries can then be classified into one of these query classes and the according strategy can be applied. The problem with this approach is that it is difficult for humans to manually define such classes. We therefore propose to use collections of queries (with known relevance labels in the search set) to automatically learn classes of queries and the optimal models for each query.

Performance and Semantic Space

What criteria should we use when forming classes of queries? There are two constraints to consider here. On the one hand, it is intuitive that the classes of queries should rely on similar search methods: we expect that queries will have strong performance from some individual search methods and weak performance among others and that this should be consistent among queries within any given class. On the other hand, when we have an incoming query, all we have is few keywords or example images that the searcher has provided. We do not know the relative performance of each of the search methods, so we cannot find the class of queries with similar performances. We can only find queries with similar keywords or examples.

We can view these two measures of the query as two separate spaces in which the query can reside. We call the performance of the various search methods the "performance space" and we call the content of the query the "semantic space." We aim to discover classes of queries by placing some example queries in these spaces and conducting clustering, taking the resulting clusters to be the query classes. In the figure below, we show nine hypothetical queries mapped into both performance and semantic spaces. In figure (a), we show the results of clustering in only performance space. In this case, the resulting clusters are similar in terms of performance, but the cluster is split in terms of semantics. This will make it hard to map new queries into the correct class at query time. Similarly, in figure (b), we show the results of clustering in semantic space alone. The resulting clusters are well-formed in semantic space, so they will be easy to map new queries into at search time, but the performance space is ill-formed, so the resulting query classes might not even share appropriate search models. Our solution is shown in figure (c), where we cluster in both performance and semantic space simultaneously and we end up with clusters that are well-formed in both spaces, meaning that they will result in classes with similar search models and that new queries can easily be placed into the correct class.


Experiments and Results

We apply the model to the TRECVID automatic search task. We construct a performance space for the queries by developing several different search methods, including a text-based keyword search, a query-by-example image search, and various query expansion approaches. The perfomance space is explicitly the performance of each of the methods for the given query, measured in average precision. We construct the semantic space by extracting various pieces of information about the query, such as the occurrence of various types of named entities (such as people or locations), and the counts of various parts of speech, like nouns and verbs. We generate a set of roughly 100 queries with known relevance labels to construct the classes and we then apply the method to the TRECVID 2004 automatic search task. We compare the method against a query-independent approach (with no classes) and other query-class-dependent approaches with human-defined query classes. We confirm that query-class-dependency can significantly improve the performance of multimodal search and we further find that automatically-discovered query classes can outperform human-defined ones.

We further investigate the resulting classes by looking at some of the queries that the various classes contain. In the figure below, we show some examples from six of the discovered classes. On the left we show the relative performance of the queries in each of the individual search methods (Speech transcript, Pseudo-relevance feedback, WordNet expansion, Google expansion, Image, and Person-X). In each class, we see a tendency to rely on similar search methods. We also see that some of the classes are easily identified semantically. The first one is clearly name persons. The second is named locations. The third contains logos. Some of the classes are strange combinations of semantically distinct topics that nonetheless share similar search strategies. The fourth class, for example, seems to merge vehicles and sports. The fifth class is animals, but contains others as well. Finally, some classes are formed very strongly around the performances space, such as the sixth class, which exhibits strong performance from the image search approach, but it is unclear what the common semantics are.




  1. Lyndon Kennedy, Paul Natsev, Shih-Fu Chang. Automatic Discovery of Query Class Dependent Models for Multimodal Search. In ACM Multimedia, Singapore, November 2005. [pdf]


For problems or questions regarding this web site contact The Web Master.
Last updated: Sept. 30th, 2008.