NO.020 Whole-Session Evaluation of Interactive Information Retrieval Systems
October 9 - 12, 2012 (Check-in: October 8, 2012 )
- Nicholas Belkin
- Rutgers University, USA
- Susan Dumais
- Microsoft Research, USA
- Noriko Kando
- National Institute of Informatics, Japan
- Mark Sanderson
- RMIT University, Australia
Information retrieval (IR) has a long and proud history of evaluation of IR system performance, from the beginnings of such research in the early 1960s at the Cranfield Institute of Technology, UK, through the most recent (the 20th) 2011 Text REtrieval Conference (TREC), held at the National Institute of Science and Technology, Gaithersburg, MD, USA. The very strong emphasis on substantive and rigorous evaluation has been a hallmark of this area of computer and information science, and has led, over the years, to ever more substantive models of IR in general, and of IR techniques, and to substantial increase in performance of IR systems.
However, it is the case that the specific evaluation paradigm that has been almost universally applied in IR research is, in some respects, quite limited, and may no longer be applicable to the evaluation of performance of contemporary and future IR systems, nor to the development of better theories of IR, and more effective IR systems.
The problem that we, and others (see, e.g. the 2009 Dagstuhl Seminar on Interactive Information Retrieval, the 2010 ACM SIGIR Workshop on Simulation of Interaction: Automated Evaluation of Interactive IR, and the TREC 2009 and 2010 Session Tracks) have noted is that the standard mode of evaluation of IR system performance is with respect to how well the IR system responds to a single query put to the system. “How well” explicitly means some measure of the extent to which items “relevant” to the query are retrieved in response to the query (e.g., in a ranked list, how high in the list), and the extent to which “non-relevant” items are not retrieved. This type of evaluation is based on a so-called “test collection”, which consists of a static database of information objects, a static set of “topics” representing information needs, and a static set of judgments of the relevance of the information objects to each topic.
Why do we see this model of evaluation as problematic? Because people’s interactions with IR systems are not, in the general case, limited to one query, and one response, but rather take place over what we call a “search session”, a sequence of interactions between the IR system and the information seeker. These interactions involve not just issuing queries and identifying relevant objects, but also have to do with other goals such as learning about the database within which the person hopes to find information, determining how best to query the system, learning about a topic of interest, and so on. Thus, an effective IR system, which we will now call an Interactive IR (IIR) system to distinguish it from the type of non-interactive system whose performance the traditional paradigm evaluates, is one which effectively supports the searcher throughout the search session, and indeed with respect to the search session as a whole.
Although the problem we have identified is reasonably clear — that for further progress in theoretical understanding and increased effectiveness of IIR systems, evaluation of performance of IIR systems should be in terms of how well they support searchers with respect to whole search sessions — how actually to evaluate such interactions is not at all clear. We believe that a quite new evaluation paradigm will be necessary in order to achieve this goal, based on new and more reasoned understandings of IIR itself, with new measures of performance, and new methods for applying such measures. Thus, the goal of the meeting we propose is to bring together researchers in IIR theories, techniques and methods; in evaluation of IR systems in general; and in particular in session-based evaluation of IIR systems, in order to form a framework within which a new paradigm of evaluation of information retrieval systems can be developed.
There are many issues which arise as problematic when considering session-based evaluation of IIR systems. Here, we mention a few such issues as examples of topics for discussion at the proposed meeting.
A fundamental problem in such an evaluation paradigm is how to identify the goal of a search session, in order evaluate the performance of the IIR system in helping the searcher to achieve that goal. Although there has been some work done on characterizing the tasks that lead people to engage in IIR systems, this is still a thorny and unsolved problem. In this vein, it will also be necessary to be able to evaluate the performance of the IIR system with respect not only to overall performance, but also with respect to performance in supporting the various activities that the searcher engages in during the search session. Again, although there has been some research on sequences of activities within a search session, how to characterize the goal of each activity, and how to evaluate with respect to optimal sequence of activities, are open questions.
Furthermore, it has been shown that for many tasks which lead to information seeking behavior, people engage in multiple search sessions, over time. Current state-of-the-art in IR research is only at the very beginning of understanding such behavior, with quite limited ideas about how to evaluate system support for such tasks which extend and evolve over time.
Beyond these fundamental issues associated with understanding the nature of the IIR situation, there are quite substantive methodological problems associated with the evaluation of whole search sessions. A fundamental tenet of IR evaluation has been replicability, and the ability to compare the performance of different theories as implemented in IR techniques and approaches to one another. But search sessions themselves are inherently dynamic, and differ from one another even when people search on the same topic, and how to measure and compare performance between different IIR systems in such circumstances is quite unclear. Furthermore, in order to do such comparison, it might be necessary to have some standard collection of search sessions, to which different IIR techniques could be applied, and their effects measured. There are at least three views on how this issue could be addressed. One is to collect a database of search sessions, for instance through crowdsourcing or similar techniques. This is a method that was used by the TREC Session Track, but it has many problems associated with it. Another is to construct a database of simulated search sessions, based either on some specific, manipulable searcher models, or on data collected by search engines. A third is to give up on the idea of a test collection of search sessions, and to instead rely upon common methods for studying live search sessions, either in situ, or in experimental settings. At the moment, not only is there no agreement on how to address this problem, but even how the proposed positions would be implemented is quite unclear.
We propose to address these, and related issues in the meeting in a relatively structured way. The first day of the meeting would be devoted to discussion of the basic nature of IIR, and in particular of search sessions, their goals and structure. This can build upon the results of the 2009 Dagstuhl Workshop on Interactive Information Retrieval. The second day would be devoted to considering measures for evaluation of whole search sessions, and their components, and the third day for consideration of methods for conducting such evaluations, including the issues of test collections and their alternatives. The fourth day would involve the participants creating a summary document of the results of the meeting, and planning for further activity in developing a new paradigm for whole-session-based evaluation of IIR. We hope in particular to develop proposals for whole search session evaluation for the TREC, NTCIR and perhaps INEX and CLEF evaluation exercises/conferences.