Software Analytics: Principles and Practice


NII Shonan Meeting Seminar 037

Fish-bowl Panel

The last session on Thursday (Oct 24) will be a fish-bowl panel (see here for the format of a fish-bowl panel).

For our fish-bowl panel, the panel chair (one of the meeting organizers) will pick a debate/discussion topic out of a topic-candidate list contribute by the live audience or by the audience offline ahead of time posted below, and designate that the line of seats on the left hand side of the stage to answer “Yes” (i.e., positive answer) and ?the line of seats on?the right hand side of the stage to answer “No” (i.e., negative answer). A lot of debate/discussion topics may have the real (if any) answer as “It depends” but once a participant gets into a seat on one line, that participant has to stick to and defend the assigned answer. Of course, a participant can freely change seat to the other line on the fly. Note that just like a real debate, someone’s picking a debate side doesn’t imply that that person will be on that side in the real life! So please participate and speak freely!

Privacy policy: whatever is said in the panel/room stays in the room, and no Tweet or Facebook status is allowed to quote concrete contents of the debate. (Please don’t ask the panel chair on the definition of “concrete” because a privacy policy by design includes unclear terms!:)

Below are some (somewhat controversial) candidate topics to start with. Please edit this post to add more topic candidates for debate.

  • Software analytics research must focus on research that produces actionable results. (Answer: Yes or No)
  • Software analytics research must focus on research that produces impact on practice. (Answer: Yes or No)
  • Software analytics research should be part of the software engineering research (e.g., research work on using software data in helping non-software-engineering task should not be considered as software analytics research).?(Answer: Yes or No)
  • Constructing benchmark data for the community to focus on is harmful. (Answer: Yes or No)
  • Research on mining software repositories (MSR) should be limited on mining software repositories (e.g., mining streaming data without having a repository to store them is out of the scope of MSR).??(Answer: Yes or No)
  • Software analytics has reached its peak; all the low hanging fruit already have been collected. ?(Answer: Yes or No)

Proposed Talks (Speaker Name, Title, Abstract)

For the seminar participants, if you would like to give a talk at the seminar, please add your talk info below by editing this post by including ?(1) your talk title, (2)?your name and?affiliation, and (3) talk abstract.

If your talk abstract is not ready yet, you can omit your talk abstract in your initial talk info, and later add your talk abstract once it is ready by editing this post.

In addition, if you would like to do a short tool demo, please add “[Tool Demo]” to prefix your talk title in your comment.


Title: Checking App Behavior Against App Descriptions
Andreas Zeller, Saarland University

How do we know a program does what it claims to do? After clustering mined Android apps by their description topics, we identify outliers in each cluster with respect to their API usage. A “weather” app that sends messages thus becomes an anomaly; likewise, a “messaging” app would not be expected to access the current location. Applied on a set of 22,000+ Android applications, our approach identified several anomalies, and classified known malware accurately with high precision and recall.


Title: What Makes a Green Miner?
Abram Hindle,?University of Alberta

This talk will discuss recent results and the infrastructure behind Green Mining on the Android platform.


Title: Knowledge Engineering for Software Engineering
Yingnong Dang, Microsoft Research Asia

This talk will brief a few projects we conducted at the Software Analytics group of Microsoft Research Asia on software engineering, including code clone analysis, change understanding, and API usage mining. The synergy of these projects is extracting knowledge from large-scale codebase and help boosting software engineering productivity.


Title:?Active Support for Clone Refactoring
Norihiro Yoshida, Nara Institute of Science and Technology

Clone refactoring (merging duplicate code) is a promising solution to improve the maintainability of source code. This talk will discuss research objectives and directions towards the advancement of clone refactoring from the perspective of active support.


Title: [demo] Using the Big Sky environment for software analytics
Robert DeLine, Microsoft Research

Big Sky is a new integrated environment for large-scale data analysis being developed at Microsoft Research. Big Sky is a collaborative web service that allows data scientists to carry out entire workflows from raw data to final charts and plots. The central metaphor is an indelible research notebook: every action in Big Sky is immediately stored to preserve provenance and to allow repeatable analyses. Big Sky also provides automation and visualization at every step to keep the analyst productive and informed. I look forward to lots of feedback, since the workshop participants are exactly our intended users.


Title: Disruptive Events on Software Projects
Peter C Rigby, Concordia University, Montreal

I will discuss events that disrupt software projects, how we can measure these events, and how different projects mitigate the risk and damage associated with disruption.


Title: ?Availability of Modification Patterns for Identifying Maintenance Opportunities
Yoshiki Higo, Osaka University

In code repositories, there are multiple commits including the same modification, each of which we call modification pattern. This talk will discuss availability of modification patterns for identifying maintenance opportunities such as performing refactorings or finding latent bugs.


Title: ?On Rapid Releases and Software Testing
Bram Adams, Polytechnique Montréal

Large open and closed source organizations like Google, Facebook and Mozilla are migrating their products towards rapid releases. While this allows faster time-to-market and user feedback, it also implies less time for testing and bug fixing. Since initial research results indeed show that rapid releases fix proportionally less reported bugs than traditional releases, we investigated the changes in software testing effort after moving to rapid releases. We analyzed the results of 312,502 execution runs of the 1,547 mostly manual system-level test cases of Mozilla Firefox from 2006 to 2012 (5 major traditional and 9 major rapid releases), and triangulated our findings with a Mozilla QA engineer. We found that in rapid releases, testing has a narrower scope that enables deeper investigation of the features and regressions with the highest risk, while traditional releases run the whole test suite. Furthermore, rapid releases make it more difficult to build a large testing community, forcing Mozilla to increase contractor resources in order to sustain testing for rapid releases.


Title:??145 Questions for Data Scientists in Software Engineering
Thomas Zimmermann, Microsoft Research

I will present a catalog of 145 questions that software engineers would like to ask data scientists. The catalog was created based on feedback from 810 Microsoft employees. This is joint work with Andrew Begel.


Title: ?Querying, Transforming, and Synchronizing Software Artifacts
Zhenjiang Hu, National Institute of Informatics

I’d like to show how GRoundTram, a bidirectional graph transformation system, may be useful for quering, transforming, and synchronizing software artifacts in software development.



Title:? Logical dependencies and others
Marco Aurelio Gerosa, University of Sao Paulo, Brazil

I will present some studies that we have been conducting in our group, covering the following topics: a method for the identification of logical dependencies, characteristics of the automated tests x code quality, design degradation, change prediction, and refactoring. I will also present a short demo of MetricMiner.



Title: ?Software Text Analytics: Moving from Correlation?Towards Causation
Tao Xie, University of Illinois at Urbana-Champaign, USA

In recent years, using deep natural language process (NLP) techniques to understand semantics of natural language (NL)?software artifacts has emerged in the software engineering and security communities. Such movement is beyond what traditional text mining techniques, which typically treat NL sentences as a bag of words and then conduct statistical analysis on these words. In this talk, I will present some recent research efforts that we have conducted in developing/applying NLP techniques for discovering semantic information out of NL software artifacts. These efforts hold great promises for moving from correlation towards causation, exploring the long-standing issue of “correlation?does not imply?causation”, commonly faced in software analytics.


Title:? Automated Analysis of Load Testing Results
ZhenMing (Jack) Jiang, York University, Canada

Many software systems must be load tested to ensure that they can scale up under high load while maintaining functional and non-functional requirements. Current industrial practices for checking the results of a load test remain ad-hoc, involving high-level manual checks. Few research efforts are devoted to the automated analysis of load testing results, mainly due to the limited access to large scale systems for use as case studies. Approaches for the automated and systematic analysis of load tests are needed, as many services are being offered online to an increasing number of users. This talk I will talk about the general methodology that we have developed over the years to assess the quality of a system under load by mining the system behavior data (performance counters and execution logs).


Title: ?Making Defects Prediction?More Pragmatic
Yasutaka Kamei, Kyushu University, Japan

The majority of quality assurance research focused on defect prediction models that identify defect-prone modules (i.e., files or packages).?Although such models can be useful in some contexts, they also have their drawbacks.?I will present some defect prediction studies that we have conducted.


Title: ?Leveraging Performance Counters and Execution Logs to Diagnose Performance Issues
Mark D. Syer, Queen’s University, Canada

Load tests ensure that software systems are able to perform under the expected workloads. The current state of load test analysis requires significant manual review of performance counters and execution logs, and a high degree of system-specific expertise. In particular, memory-related issues (e.g., memory leaks or spikes), which may degrade performance and cause crashes, are difficult to diagnose. Performance analysts must correlate hundreds of megabytes or gigabytes of performance counters (to understand resource usage) with execution logs (to understand system behaviour). However, little work has been done to combine these two types of information to assist performance analysts in their diagnosis. ?In this talk, I will present an approach that combines performance counters and execution logs to diagnose memory-related issues in load tests.


Title: ?Automated Performance Analysis of Build Systems
Shane McIntosh, Queen’s University, Canada

Software developers rely on a fast and correct build system to compile their source code changes to produce modified deliverables for testing and deployment.?Unfortunately, the scale and complexity of builds makes build performance analysis necessary, yet difficult due to the absence of build performance analysis tools.?In this paper, we propose an approach that analyzes the build dependency graph and the change history of a software system to pinpoint build hotspots, i.e., source files that change frequently and take a long time to rebuild.?In conducting a case study on the GLib, PostgreSQL, Qt, and Ruby systems, we observe that:?(1) our approach identifies build hotspots that are more costly than the files that: rebuild the slowest, change the most frequently, or have the highest fan-in;?(2) logistic regression models built using architectural and code properties of source files can explain 50%-75% of these build hotspots;?and (3) build hotspots are more closely related to system architecture than to code properties.?Furthermore, we identify build hotspot anti-patterns and offer advice on how to avoid and address them.?Our approach helps developers to focus build performance optimization effort (e.g., refactoring) onto the files that will yield the most performance gain.


  • Ahmed E. Hassan, Queen’s University, Canada
  • Katsuro Inoue, Osaka University, Japan
  • Tao Xie, University of Illinois at Urbana-Champaign, USA
  • Dongmei Zhang, Microsoft Research Asia, China


  • Bram Adams (École Polytechnique de Montréal, Canada)
  • Marco Aurélio Gerosa (USP / UCI, Brazil)
  • Yingnong Dang (Microsoft Research Asia, China)
  • Robert DeLine (Microsoft Research, USA)
  • Daniel German (University of Victoria, Canada)
  • Mike Godfrey (University of Waterloo, Canada)
  • Ahmed E. Hassan (Queen’s University, Canada)
  • Yoshiki Higo (Osaka University, Japan)
  • Abram Hindle (University of Alberta, Canada)
  • Zhenjiang Hu (National Institute of Informatics, Japan)
  • Akinori Ihara (Nara Institute of Science and Technology,?NAIST, Japan)
  • Katsuro Inoue (Osaka University, Japan)
  • Zhen Ming (Jack) Jiang (York University, Canada)
  • Yasutaka Kamei (Kyushu University, Japan)
  • Sung Kim (Hong Kong University of Science and Technology, China)
  • Takashi Kobayashi (Tokyo Institute of Technology, Japan)
  • Andrian Marcus (Wayne State University, USA)
  • Michele Lanza (University of Lugano, Switzerland)
  • Shane McIntosh (Queen’s University, Canada)
  • Akito Monden (Nara Institute of Science and Technology, NAIST, Japan)
  • Masao Ohira (Wakayama University, Japan)
  • Peter Rigby (Concordia University, Canada)
  • Mark Syer (Queen’s University, Canada)
  • Tao Xie (University of Illinois at Urbana-Champaign, USA)
  • Norihiro Yoshida (Nara Institute of Science and Technology, NAIST, Japan)
  • Andreas Zeller (Saarland Univeristy, Germany)
  • Dongmei Zhang (Microsoft Research Asia, China)
  • Thomas Zimmermann (Microsoft Research, USA)


20th October (Sunday)

  • 15:00 ? Hotel Check In (early check-in from 12:00 is negotiable if informed in advance)
  • 19:00 ? 21:00: Welcome Reception

21st October (Monday)

  • 8:30-10:30 Invited talk 1
  • 10:30-11:00 Coffee break
  • 11:00-12:30 Lightning talks
  • 12:30-2:00 Lunch
  • 2:00-3:30?Lightning talks
  • 3:30-4:00 Coffee break
  • 4:00-5:30?BreakOut Session Planning

22nd October (Tuesday)

  • 8:30-10:30?Lightning talks+?BreakOut Session Planning
  • 10:30-11:00 Coffee break
  • 11:00-12:30?BreakOut Session
  • 12:30-2:00 Lunch
  • 2:00-3:30?BreakOut Summaries
  • 3:30-4:00 Coffee break
  • 4:00-5:30?Invited talks 2 and 3

23rd October (Wednesday)

  • 8:30-10:30 Invited talk 4
  • 10:30-11:00 Coffee break
  • 11:00-12:30?BreakOut Session
  • 11:30-1:30 Lunch
  • 1:15-5:30?Excursion

24rd October (Thursday)

  • 8:30-10:30?BreakOut Session
  • 10:30-11:00 Coffee break
  • 11:00-12:30?BreakOut Summaries
  • 12:30-2:00 Lunch
  • 2:00-3:30 Free 1
  • 3:30-4:00 Coffee break
  • 4:00-5:30 Fish-bowl Panel

25th October (Friday)

  • 8:30-10:30?Free 3
  • 10:30-11:00 Coffee break
  • 11:00-12:30?Wrap up and future plans
  • 12:30-2:00 Lunch


A wealth of various data (e.g., source change history, test cases, and bug reports) exists in the practice of software development. Further modern software and services in operation produce rich data (e.g., operation logs, field crashes, and support calls). Hidden in these unexplored data is rich and valuable information about the quality of software and services and the dynamics of software development. Companies (Microsoft, Google, Facebook, Cisco, Yahoo, IBM, RIM, etc.) are increasingly adding analytics as an important role in their organizations, leveraging the wealth of various data produced around their software or services.

Software analytics is concerned with the use of data-driven approaches to obtain insightful and actionable information for completing various tasks around software systems, software users, and software development process. Insightful information is information that conveys meaningful and useful understanding or knowledge. Actionable information is information upon which software practitioners can come up with concrete solutions (better than existing solutions if any) towards completing tasks. Typically such information cannot be easily obtained by direct investigation on the raw data without the aid of analytic technologies.

Especially recently the area of Big Data has emerged as a critical and strategic focus by the society. Big data is everywhere now but it is still under-utilized in the area of software engineering. However, leveraging big data is very relevant in software engineering as software and services get larger and more inter-connected, often being developed by a large number of engineers in distributed fashions and being used by a huge number of users around the world. Software analytics needs to be prepared for the upcoming decade’s exciting and yet challenging problem of leveraging big data for software engineering tasks.

The proposed seminar will foster collaboration between industry and academia, bringing academic researchers working on the principles and practice of software analytics together with researchers from industry. The aim is not only to act as a forum for the exchange of ideas, but as a vehicle to stimulate, deepen and widen partnership between academia and industry in software analytics internationally. In the age of Big Data, this seminar also serves as the first step to plan for the next decade of Big Data Analytics in Software Engineering, since it is impossible for individual groups or companies to tackle this challenging problem alone.

Software analytics is an ideal topic for this kind of interaction. It combines challenging research problems with real practical importance for the software industry, and the wider society that it serves. It presents an excellent and wide-ranging set of open research questions to academics concerning, amongst other things, analytic-algorithm design, data analysis, information visualization, scalable computing, software-artifact analysis and mining, social factors, empirical software engineering, measurement, process improvement, and technology transfer and adoption. Software analytics is also of critical practical significance to almost every organization involved in the production and use of software and services. Answers to the currently open research questions in software analytics can have a major impact upon industrial practice, with far-reaching implications for the development of the global economy. This combination of academic challenge and industrial relevance makes software analytics a natural topic for the proposed seminar.

In this seminar, we want to bring together software-analytics researchers in academia and industry. Our main focus is to exploit the synergy of these communities and to provide a platform to forge new collaborations. Participants are invited to present a few plenary talks and demos of new tools, beside which the seminar will provide ample opportunities for small working groups on themes suggested by the participants. We expect the seminar to result in ample cross-fertilization between the different research areas and to show up exciting directions for improving software-engineering practices via practical software analytics.