Mining & Modeling Unstructured Data in Software ‐ Challenges for the Future

NII Shonan Meeting:

@ Shonan Village Center, March 7-10, 2016

NII Shonan Meeting Report (ISSN 2186-7437):No.2016-3


  • Sonia Haiduc, Florida State University, USA
  • Takashi Kobayashi, Tokyo Institute of Technology, Japan
  • Michele Lanza, University of Lugano, Switzerland
  • Andrian Marcus, University of Texas at Dallas, USA

Overview084_Group Photo

Description of the meeting

To analyze, comprehend, and reverse engineer software projects and their software development processes, developers rely on various sources of information. Bug reports, execution logs, mailing lists, code review reports, change logs, requirements documents, and the actual source code contain implicit developer knowledge about the project and past development efforts. Most of this knowledge is captured as unstructured information, that is, natural language text used to exchange information among people.

Researchers in the Information Retrieval (IR), Data Mining (DM), and Natural Language Processing (NLP) fields have experimented with various techniques (such as, Latent Dirichlet Allocation and Vector Space Model) and ad‐hoc approaches to enable the mining of unstructured data from software artifacts. However, these techniques were not designed to work with the complexities and peculiarities of unstructured software engineering data, and thus are not readily applicable to the software engineering research domain.

The challenges for both researchers and practitioners are to determine the appropriate set of techniques to tackle the problem at hand and to understand how to use them effectively.

The Shonan Meeting aims to tackle these challenges and make mining unstructured data clear, accessible, and applicable to the software engineering domain. We propose to achieve this via three paths:
1. First, we invite peers to give a written description of their experiences with mining unstructured data, in the form of short (2‐page) papers to be presented at the meeting, by sharing the techniques they used, the challenges they faced, and the solutions that they found successful.
2. Second, we encourage discussion and dissemination of the presented work in following extended group discussion sessions.
3. Third, we organize a group discussion in the form of a panel, according to the “fishbowl” technique, to identify and discuss topics that are most relevant to the meeting participants. By collecting available techniques, solutions, and challenges yet to be overcome, we aim to advance the state‐of‐the‐art in mining unstructured software engineering data.

We will complement the above with at least two keynotes given by prominent researchers with both industrial and academic background, to “set the stage” for a highly productive meeting.

The meeting thus aims to address the following topics (but is not limited to):
1) Applications of unstructured data mining techniques to support software maintenance, software reverse engineering tasks (e.g., feature location, traceability), and for enhancing software quality;
2) Novel sources of unstructured data, such as mobile app stores, phone records, screenshots, interviews, or wiki pages;
3) Usage of NLP, IR, and ML techniques for mining unstructured data;
4) Classification and dissemination of techniques for extracting unstructured data;
5) Identification of open research challenges and proposed solutions;
6) Approaches for handling imperfect data, such as summarization approaches;
7) Novel extractors for unstructured data and performance evaluation with respect to existing techniques;
8) Linking of unstructured and structured data for richer information;
9) Negative results (“what did not work”) when mining unstructured data, and experience reports;
10) Large‐Scale mining of Unstructured Data in Big Data environments;

We aim to facilitate in‐depth discussions of techniques for mining unstructured data, their similarities and differences, applications in modern Data Mining, as well as potential pitfalls and problems. The intended outcomes of this meeting are to:
1) Facilitate knowledge‐exchange in the field of mining unstructured software data and practical applications of techniques through presentations of short (2‐page) paper submissions.
2) Establish connections between the various research communities that mine unstructured data, resulting in cross‐fertilization of techniques and methodologies.
3) Put techniques and methodologies for mining unstructured data in a common framework, enabling researchers and practitioners to find the appropriate tools that meet their particular data mining needs.
4) Identify open problems and challenges for mining unstructured data, providing the basis for a roadmap of future research opportunities in mining unstructured data research.
5) Educate on, discuss, and advance the state‐of‐the‐art in mining unstructured data.

Expected outcomes and impact

The goal of this meeting is to facilitate the cross‐fertilization of three diverse research communities, namely the one on mining software repositories, the one on mining unstructured data, and the one on software summarization. We believe that at the intersection of these three research fields lies a vast and still underexplored research territory, which can only be investigated if approaches developed by the three communities and merged in a synergetic way.

Given the wide range of expertise needed in the research on mining unstructured data from software artifacts, collaborations are often necessary. This workshop will be the perfect tool to facilitate such collaborations.

Comments are closed.