Mining & Modeling Unstructured Data in Software‐Challenges for the Future


NII Shonan Meeting Seminar 084


To analyze, comprehend, and reverse engineer software projects and their software development processes, developers rely on various sources of information. Bug reports, execution logs, mailing lists, code review reports, change logs, requirements documents, and the actual source code contain implicit developer knowledge about the project and past development efforts. Most of this knowledge is captured as unstructured information, that is, natural language text used to exchange information among people.

Researchers in the Information Retrieval (IR), Data Mining (DM), and Natural Language Processing (NLP) fields have experimented with various techniques (such as, Latent Dirichlet Allocation and Vector Space Model) and ad‐hoc approaches to enable the mining of unstructured data from software artifacts. However, these techniques were not designed to work with the complexities and peculiarities of unstructured software engineering data, and thus are not readily applicable to the software engineering research domain.

The challenges for both researchers and practitioners are to determine the appropriate set of techniques to tackle the problem at hand and to understand how to use them effectively.


Schedule overview

Sunday, March 6, 2016 – check-in starts at 3 pm – welcome banquet 7-9 pm at Katsura Restaurant, 2nd floor

Monday, March 7, 2016 – full workshop day

Tuesday, March 8, 2016 – full workshop day

Wednesday, March 9, 2016 – half workshop day followed by social event (a trip near Shonan)

Thursday, March 10, 2016 – half workshop day – the workshop ends at lunch time


Monday March 7, 2016

7:00 – 9:00 am – Breakfast on site at the cafeteria

9:00 – 10:30 am – Introduction

10:30 – 11:00 am – Coffee Break 1

11:00 – 12:30 pm – Tutorial – Island Parsing – Luca Ponzanelli, Andrea Mocci

  • Artifacts containing natural language, like Q&A websites (e.g., Stack Overflow), tutorials, and development emails, are essential to support software development, and they have become a popular subject for software engineering research. The analysis of such artifacts is particularly challenging because of their heterogeneity: these resources consist of natural language interleaved with fragments of multiple programming and markup languages. In this tutorial, I will focus on our efforts towards a systematic approach to model contents of such artifacts, enabling holistic analyses that fully exploit their intrinsic heterogeneous nature. In particular, I will illustrate our StORMeD framework (, and how its parsing service can be effectively used to implement a holistic summarizer for Stack Overflow discussions, that takes into account both the narrative and the structured fragments extracted and modeled from the contents.

12:30 – 2:00 pm – Lunch on site at the cafeteria

2:00 -3:30 pm – Code inside the MUD – talks by Robillard, Treude, Washizaki, Hayashi – followed by fishbowl conversation

  • Martin Robillard (McGill University, Canada) – Finding the Meaning of Life in API Documentation
    Abstract: Learning resources are crucial for helping developers learn to use software development technologies. However, the gap between the information needs of developer and externalized knowledge to meet these needs shows no sign of closing. I will discuss the problem of automatically finding information that explains a certain technology, and discuss how far we got with a technique to find information that explains how to use API types.
    Questions: 1) Do we even have a working definition of “information need” for software development? 2) How can we ascertain whether a certain document fulfills an information need? What is the criterion for deciding that the search for information can be over? Can we even simplify our characterization of the process to this extent? 3) What are the different approaches for explaining sofware development concepts?
  • Christoph Treude (University of Adelaide, Australia) – Using NLP to Identify Meaningful Sentences in Informal Documentation – Slides
    Abstract: Sentences on Stack Overflow or other informal documentation sites are often not meaningful on their own without their surrounding code snippets or the question that prompted a given answer. Based on a study to identify sentences from Stack Overflow that are related to a particular API type and that provide insight not contained in the API documentation of that type, we discuss a set of NLP features that can help identify sentences that are meaningful on their own, including co-occurring part-of-speech tags and the presence of the verb “to be”.
    Questions: 1) What is the ideal format/structure of documentation from the perspective of a documentation user? 2) What knowledge is impossible to communicate through source code examples but only through natural language text?
  • Hironori Washizaki (Waseda University, Japan) – Towards Trace-Any: Interactive and Transitive Recovery of Traceability Links – Slides
    Abstract: Recovering missing important links from software is the key to success of its maintenance such as specifying locations that need correction. Towards tracing any software material at any abstraction level, this talk discusses two techniques for recovering traceability links: log-based interactive recovery involving link recommendation and user feedback, and, transitive recovery by connecting different links.
    Questions: 1) How can we recover and maintain links among structured and unstructured software artifacts? How different is unstructured-artifacts-link recovery from structured-artifacts-link recovery? 2) Do existing links support recovery and maintenance of other (possibly missing) links among structured and unstructured software artifacts? Can developers specify whether each link candidate among structured and unstructured software artifacts is correct or not? 3) Are there different types of links among structured and unstructured software artifacts? Do links among structured and unstructured software artifacts contribute to software development, maintenance and operation?

3:30 – 4:00 pm – Coffee Break 2

4:00 – 5:30 pm – MUD outside the Code – talks by Gorla, Orso, Medvidovic – followed by fishbowl conversation

  • Alessandra Gorla (IMEDEA, Spain) – Is code behaving as expected? Extracting expected behavior from natural language artefacts
    Abstract: Natural language artefacts often encode important information regarding the expected behavior of code artefacts. Analyzing natural language artefacts to extract such information and using it to check whether it matches the actual behavior of code artefacts can highlight the presence of faults or covert behavior. In this talk I will present three ideas along this line. The first one is about using Android app descriptions and comparing them against implementations to identify covert – and often malicious – behaviour. The second one has the same goals, but uses the text of UI element labels instead of app descriptions. Finally, the third one is about analyzing Javadoc comments to automatically generate test oracles to highlight faults in Java methods.
    Questions: 1) How can we improve the effectiveness of the NLP component of the first technique? 2) Can mismatches between UI element labels and actions identify other type of problems? 3) The main challenge of the third idea is to match concepts in the Javadoc with code elements. Can we do better?
  • Alex Orso (Georgia Institute of Technology, USA) – Generating Tests for Android Apps from Natural Language Bug Reports and App Reviews – Slides
    Abstract: As confirmed by a recent survey conducted among developers of the Apache, Eclipse, and Mozilla projects, two extremely challenging tasks during maintenance are reproducing and debugging field failures – failures that occur on user machines after release. Unfortunately, the information provided by users in bug reports or, even worse, app reviews is in most cases too limited to allow for reproducing, further investigating, and ultimately understanding a failure. To help developers with these tasks, we plan to leverage program analysis and NLP techniques to (1) infer a set of step that can lead to a failure from a natural language bug report or app review, (2) match these steps to graphical elements of the app, source code elements, or both, and (3) synthesize test cases that mimic the reported field failure and can be used to debug it. Because this project is in its very initial phase, the main goal of this talk is to introduce the project’s motivation and goals, present some very preliminary results, and discuss the next steps of the work.
    Questions: 1) Is it feasible? What are the metrics of success? 2) What technologies are best suited for the tasks involved? 3) Can we leverage multiple similar reviews/bug reports? 4) Can we do this interactively (as the users enter their reviews)?
  • Nenad Medvidovic (University of Southern California, USA) – Extracting the Essence of Software Systems’ Architectures through Unstructured-Data Mining – Slides
    Abstract: Engineers frequently neglect to carefully consider the impact of their changes to a software system. As a result, the software system’s architecture eventually deviates from the original designers’ intent and degrades through unplanned introduction of new and/or invalidation of existing design decisions. Architectural decay increases the cost of making subsequent modifications and decreases a system’s dependability, until engineers are no longer able to effectively evolve the system. At that point, the system’s actual architecture must be recovered from the implementation artifacts. However, this is a time-consuming and error-prone process, and leaves critical issues unresolved: the problems caused by architectural decay will likely be obfuscated by the system’s many elements and their interrelationships – the epitome of unstructured data- thus risking further decay. In this talk I will focus on pinpointing the locations in a software system’s architecture that reflect architectural decay, the points in time when that decay tends to occur, and the reasons why that decay occurs. Specifically, I will present an emerging catalogue of commonly occurring symptoms of decay – architectural “smells”. I will illustrate the occurrence of smells identified in the process of recovering the architectures of a large number of real-world systems. I will also highlight the relationship between architectural smells and the much better understood code smells. Finally, I will touch upon several undesirable but common occurrences during the evolution of existing systems that directly contribute to decay. I will conclude by identifying a number of simple steps that engineers can undertake to stem software system decay.
    Questions: 1) How do you ensure that you are analyzing the right architectural view(s)? 2) Is there empirical evidence that architectural decay is really undesirable? 3) What is the relationship between code decay (“smells”) and architectural decay?

6:00 – 7:30 pm – Dinner on site at the cafeteria


Tuesday March 8, 2016

7:00 – 9:00 am – Breakfast on site at the cafeteria

9:00 – 10:30 am – MUD inside the Code – talks by Peng, Arnaoudova, Maletic – followed by fishbowl conversation

  • Xin Peng (Fudan University, China) – Interactive Code and Knowledge Search Supported by Text Analysis
    Abstract: Open-source code repositories and question-and-answer websites provide a huge amount of useful code and development knowledge. However, developers often feel it hard to find required code and answers by using keyword-based search. To improve the current practice, we think it is beneficial to provide more advanced code and knowledge search support that can better capture both the intent of developers’ search requests and the meaning of code fragments and questions. In this talk, I will introduce our past and ongoing works on interactive code and development knowledge search supported by text analysis.
    Questions: 1) Can we provide better ways for developers to express what they need for code or development knowledge? 2) How can we identify the essential intent and meaning of a code fragment and question? 3) How to combine developers’ judgement and tools’ automatic analysis to produce better search results?
  • Venera Arnaoudova (Washington State University, USA) – Quality of Source Code Lexicon – Slides
    Abstract: It has been well documented that a large portion of the cost of any software lies in the time spent by developers in understanding a program’s source code before maintenance, repairs, or updates can be undertaken. To understand software, developers spend a considerable amount of time reading the source code lexicon, i.e., the identifiers (names of programming entities such as classes or variables) and comments that are used by developers to embed domain concepts and to communicate with their teammates. In this talk we will review existing metrics for lexicon quality and how lexicon quality has been related to program understanding and to software quality.
    Questions: 1) What process should we follow to identify poor lexicon? 2) How should we prioritize the reporting of poor lexicon? 3) How do we define guidelines for writing good quality lexicon?
  • Jonathan Maletic (Kent State University, USA) – Part-Of-Speech Tagging of Source Code Identifiers and Comments
    Abstract: An approach for using heuristics and static program analysis information to markup part-of-speech for program identifiers is presented.  It does not use a natural language part-of-speech tagger for identifiers within the code.  A set of heuristics is defined akin to natural language usage of identifiers usage in code.  Additionally, method stereotype information, which is automatically derived, is used in the tagging process.  The approach is built using the srcML infrastructure and adds part-of-speech information directly into the srcML markup.
    Questions: 1) Does it really make sense to use Natural Language Parsing on source code, given source code is not natural language? 2) Do we read and interpret source code differently from NL proses? 3) Do we need a Geek net? That is a word net that is generated solely for the domain of source code.

10:30 – 11:00 am – Coffee Break 1

11:00 – 12:30 pm – Tutorial – srcML for MUD – Jonathan Maletic

  • The tutorial is intended for those interested in constructing custom software analysis and manipulation tools to support research.  srcML ( is an infrastructure consisting of an XML representation for C/C++/C#/Java source code along with efficient parsing technology to convert source code to-and-from the srcML format.  The briefing describes srcML, the toolkit, and the application of XPath and XSLT to query and modify source code.  Additionally, a hands-on tutorial of how to use srcML and XML tools to construct custom analysis and manipulation tools will be conducted. 

12:30 – 2:00 pm – Lunch on site at the cafeteria

2:00 -3:30 pm – Other MUDdy stuff – talks by Minelli, Maruyama – followed by fishbowl conversation

  • Katsuhisa Maruyama (Ritsumeikan University, Japan) – Mining Fine-Grained Code Changes to Resolve Merge Conflicts – Slides
    Abstract: I believe that fine-grained code changes behind merge conflicts are useful for resolving those conflicts. My talk presents an idea for supporting such resolution using a tool that both records fine-grained code changes of Java source code and extracts a particular part of them.
    Questions: 1) What unstructured data are useful for code change recommendation? 2) In your programming, do you accept code changes derived from statistical results?
  • Roberto Minelli (University of Lugano, Switzerland) – Mining IDE Interaction Data
    Abstract: Developers continuously interact with Integrated Development Environments (IDEs) while working. These interactions carry a lot of actionable information. For example, researchers recorded the navigation paths followed by developers, and used this information to support source code exploration. However, most of the potential of interaction data is largely unexplored and unused. In our research we developed DFlow, an IDE interaction profiler. We collected over 700 hours of development time recorded with DFlow. In this talk we will illustrate how to interpret and mine this novel and complex source of information. We will understand how developers spend their time inside the IDE, how they navigate source code, and how the user interface of an IDE might impact on their productivity. At the end of our talk, we will also illustrate our vision: Interaction-Aware IDEs, IDEs that improve programmers’ productivity by leveraging all the fine-grained IDE interactions.
    Questions: 1) How can we leverage interaction data? What can we do with it? 2) How do you imagine the IDE of the future? 3) Which data would you consider “sensible” during software development? Which level of privacy would you desire?
  • Shinpei Hayashi (Tokyo Institute of Technology, Japan) – Linking between Unstructured Software Artifacts with Structural Flavor – Slides
    Linking different types of software artifacts, e.g., requirements-to-code, requirements-to-rationale, or code-to-code, is a key technique in software engineering, and the textual information of them are utilized for matching the artifacts. Since typical software artifacts also have behavioral and/or structural aspects even if they are based on natural language descriptions, additional consideration of such aspects is useful for a better matching. In this talk, I will show some example experiences of utilizing the (semi-)structure of textual information of software artifacts for linking them.
    Questions: 1)
    What kind of existing analyses for structured data are useful for enhancing MUD? 2) How can we manage the soundness in the result of MUD?

3:30 – 4:00 pm – Coffee Break 2

4:00 – 5:30 pm – Walk and Talk – informal session during hiking (weather permitting)

6:00 – 7:30 pm – Dinner on site at the cafeteria


Wednesday March 9, 2016

7:00 – 9:00 am – Breakfast on site at the cafeteria

9:00 – 10:30 am – Breakout sessions

10:30 – 11:00 am – Coffee Break 1

11:00 – 12:30 pm – Breakout sessions

12:30 – 2:00 pm – Lunch on site at the cafeteria

2:00 – 10:00 pm – Excursion & Dinner


Thursday March 10, 2016

7:00 – 9:00 am – Breakfast on site at the cafeteria

9:00 – 10:30 am – Breakout summaries

10:30 – 11:00 am – Coffee Break 1

11:00 – 12:30 pm – Breakout summaries + wrap up

12:30 – 2:00 pm – Lunch on site at the cafeteria

Check-out and departure.



Sonia Haiduc – Forida State University, USA

Takashi Kobayashi – Tokyo Institute of Technology, Japan

Michele Lanza – University of Lugano, Switzerland

Andrian Marcus – The University of Texas at Dallas, USA


Venera Arnaoudova Washington State University USA
Daniel German University of Victoria Canada
Michael Godfrey University of Waterloo Canada
Alessandra Gorla IMDEA Spain
Sonia Haiduc Florida State University USA
Shinpei Hayashi Tokyo Institute of Technology Japan
Takashi Kobayashi Tokyo Institute of Technology Japan
Nicholas Kraft ABB Research USA
Michele Lanza University of Lugano Switzerland
Jonathan Maletic Kent State University USA
Andrian Marcus University of Texas at Dallas USA
Katsuhisa Maruyama Ritsumeikan University Japan
Collin McMillan University of Notre Dame USA
Nenad Medvidovic University of Southern California USA
Marija Mikic-Rakic Google USA
Roberto Minelli University of Lugano Switzerland
Andrea Mocci University of Lugano Switzerland
Vincent Ng University of Texas at Dallas USA
Masao Ohira Wakayama University Japan
Alessandro Orso Georgia Institute of Technology USA
Xin Peng Fudan University China
Martin Pinzger University of Klagenfurt Austria
Luca Ponzanelli University of Lugano Switzerland
Martin Robillard McGill University Canada
Christoph Treude University of Adelaide Australia
Hironori Washizaki Waseda University Japan
Tom Zimmermann Microsoft Research USA