Data Dependent Dissimilarity Measures

NII Shonan Meeting:

@ Shonan Village CenterOctober 15-18, 2018


Kai Ming Ting, Federation University, Australia

Takashi Washio, Osaka University, Japan

Ata Kaban, University of Birmingham, UK


Description of the meeting

The aim of this meeting is to provide a forum to

  • Discuss recent development in data dependent dissimilarity measures,
  • Plan for future research directions in the next 2-5 years, and
  • Establish research collaboration towards the planned research

While the conventional data independent distance metric has been the primary means to measure dissimilarity of any two points in a given space, research in different fields has provided evidence that data dependent dissimilarity, where data distribution has the primary influence on the dissimilarity, is a better measure to find the closest match neighbourhood of a query—a core computation demanded in automated tasks such as classification, clustering, anomaly detection and information retrieval.

Advocates of data dependent dissimilarity include psychologists and computer scientists. Researchers in machine learning have advocated metric learning—a method which learns a mapping such that the mapped points are in the Euclidean space. In the supervised learning context, the mapping amounts to reducing the distance between points of the same class and increasing the distance between points of different classes in the mapped Euclidean space. It is also viewed as a way to learn a generalised (or parameterised) Mahalanobis distance, subject to some optimality constraint, from a dataset.

Some data dependent dissimilarity measures, which require no learning, have been proposed, for instance, Mahalanobis distance, the term-weighted Cosine distance, cdf and rank transformations, and information theoretic definitions of similarity.

The need for data dependent dissimilarities came up in various different forms, implicitly or explicitly, in different subfields of machine learning and data mining. For instance, kernel methods, new definitions of similarity or dissimilarity for structured types of data, and the use of side information or ‘privileged information’ i.e., additional data available only at training time to inform the choice of metric to be used.

It is interesting to note that many existing data dependent dissimilarity measures are either metric or pseudo-metric. This is due to the following assumption: a necessary condition for the above mentioned automated tasks is that the dissimilarity measures must be a metric.

The psychological tests conducted in the 70’s have shown that the dissimilarity between two instances, as judged by humans, is influenced by the context of measurements and        other instances in proximity. It is suggested that a dissimilarity measure which is akin to human’s judged dissimilarity is one that interprets two instances in a dense region to be less similar than two instances of equal interpoint distance but located in a less dense region. In addition, the judged dissimilarity does not satisfy the metric constraints.

Recent research has provided more concrete evidence that nonmetric data dependent dissimilarity measures can be an effective alternative to distance metric to overcome the weaknesses of existing distance-based neighbourhood algorithms.

This meeting will facilitate a unified understanding by exchange of experiences and by sparking discussion around some of the fundamental questions/issues on this topic:

  • Recent development in data dependent dissimilarity (metric or nonmetric).
  • Are metric constraints a necessary condition for automated tasks?
  • What are the conditions under which distance metric will fail to perform?
  • Would data dependent dissimilarity measures “succeed” under those conditions?
  • What kind of nonmetric dissimilarity measures is a good alternative to distance metric?
  • Current optimisation tools have been largely based on metric. What are existing optimisation tools for nonmetric? What optimisation tools need to be developed for nonmetric?

As dissimilarity is a core computation demanded in automated tasks, this forum will be of interest to academia as well as industry in the research community in different fields.

Comments are closed.