NO.024 Dimensionality and Scalability
May 20 - 23, 2013 (Check-in: May 19, 2013 )
- Michael E. Houle
- National Institute of Informatics, Japan
- Vincent Oria
- New Jersey Institute of Technology, USA
- Arthur Zimek
- Ludwig-Maximilians-Universität München, Germany
For many fundamental operations in the areas of search and retrieval, data mining, machine learning, multimedia, recommendation systems, and bioinformatics, the efficiency and effectiveness of implementations depends crucially on the interplay between measures of data similarity and the features by which data objects are represented.
When the number of features (the data dimensionality) is high, similarity values tend to concentrate strongly about their means, a phenomenon commonly referred to as the curse of dimensionality. As the dimensionality increases, the discriminative ability of similarity measures diminishes to the point where methods that depend on them lose their effectiveness.
One fundamental task, arising in applications of multimedia, data mining and machine learning, and other disciplines, is that of content-based similarity search. For such applications, features are often sought so as to provide the best possible coverage across a range of anticipated queries. However, for any given query, only a relatively small number of features may turn out to be relevant. When the dimensionality is high, the errors introduced into similarity measurements by the many irrelevant feature attributes can completely overwhelm the contributions of the relevant features.
The way in which the curse of dimensionality manifests itself varies from discipline to discipline. For example:
- In multimedia applications, direct representations of images and video frames typically involve hundreds or thousands of features. The visual vocabularies currently being sought for large image corpuses or video archives can have millions of visual words. These huge feature set sizes can lead to severe problems for clustering, classification, and any other applications that require content-based similarity search.
- Data mining and analysis applications often require the identification of relatively small clusters of mutually-similary data objects (nuggets). Identification of nuggets usually entails the generation of lists of objects similar to query objects, to serve as candidates for cluster membership. Traditional clustering methods may be inadequate in identifying nuggets; even subspace clustering methods that specifically target groupings based on subsets of the feature set may be defeated by the sheer number of possible feature combinations.
To support operations in such areas, feature selection and other dimensional reduction techniques from machine learning are often considered in an attempt to improve the discriminability of similarity measures, and the scalability of methods that depend upon them. Yet even here, the complexity of searching through combinations of features can be prohibitive.
Dimensionality and data modeling
Many researchers and practitioners from different areas who are specifically working on problems involving the selection of features tend to be aware of the difficulties involved with high-dimensional data settings. Researchers in other areas are generally aware that the performance of their solutions depends on the dimensionality of their data sets, but are often not clear as to why.
In general, researchers tend to evaluate the inherent difficulty of tasks in terms of characteristics of the data set, such as the number of records, the number of feature dimensions, or the size of the data in bytes. However, the usual characterizations are not always indicative of the true difficulty involved in the processing of tasks for that set. For any given task performed on two different sets with the same numbers of records, features, and bytes, the performances could vary tremendously. On the other hand, data sets that are difficult to process for tasks in one domain tend to be difficult to process for other tasks in other domains.
Over the past decade or so, new characterizations of data sets have been proposed so as to assess the performance of particular methods. Such characterizations include estimations of distribution, estimation of local subspace dimension, and measures of intrinsic dimensionality of data. Although the applications affected by the curse of dimensionality vary widely across research disciplines, the characterizations and models of data that can be applied to analyze the performance of solutions are very general. Researchers from different disciplines, even though they work with different types of data for very different purposes, can nevertheless make use of a common set of data models and data characterizations. With a common framework in place for the assessment of the complexity of data sets, general techniques developed within one discipline for common subtasks such as search, clustering, classification, matching, and feature selection could conceivably be assessed for their applicability to tasks from other disciplines. Unfortunately, across the various disciplines, most such characterizations are as yet either unknown or underused, and techniques developed within one discipline for coping with the curse of dimensionality are constantly being reinvented by researchers in other disciplines.
The goal of this meeting is to bring together researchers and students active in the areas of databases, data mining, pattern recognition, machine learning, statistics, multimedia, bioinformatics, visualization, and algorithmics who are currently searching for effective and scalable solutions to problems affected by the curse of dimensionality. In particular, the objectives are:
- To survey the existing approaches used in dealing with the curse of dimensionality in these various disciplines, identifying their commonalities, strengths and limitations.
- To identify promising general data characterizations useful for the design and analysis of important subtasks common to these disciplines that are strongly affected by the curse of dimensionality, including (but not limited to) search, classification, clustering, and feature selection.
- To promote the adoption of theoretical analysis by researchers within these various disciplines.
Participants will not be expected to give presentations of fully completed research. Instead, emphasis will be placed on group input into the development of data characterizations, and the identification of future directions for research on dimensionality and scalability.