Mining Unstructured Data From Code Reviews: A Hands-On Tutorial
Dr. Nicolas Bettenburg, Queen’s University, Kingston, Canada
We present a hands-on tutorial on mining unstructured data. Unstructured data consists of a mixture of natural language text and technical information, which traditional data mining tools fail to properly extract and discern. In this tutorial we will demonstrate the use of state-of-the-art lightweight tools and techniques for mining unstructured data and separating natural language text from technical information. We will then apply sentiment analysis techniques to the natural language portion of the unstructured data and link that measure with the technical information to create a new software metric.
Topic: Mining Unstructured Data. Issues around mining unstructured data from repositories, knowledge discovery, applications of mining unstructured data. Why is the topic of interest to a broad section of the software engineering and maintenance community: A plethora of software development artifacts is created during day-to-day software development activities. Many such artifacts consist of unstructured data, i.e., mixtures of natural language text and technical information. This information may contain valuable information about development activities, the evolution of the software, as well as maintenance related aspects. Unfortunately, unstructured data cannot be readily mined with off-the-shelf techniques commonly used when mining natural language text, or structured data on their own.
Overall goals of this tutorial: Demonstrating applications of mining unstructured data for practitioners and researchers in the Software Engineering community. Concrete objectives include: (1) Discuss the unique properties of unstructured data, as well as the main differences to structured data such as source code, and natural language text as would be found in newspaper articles. (2) Demonstrates tools and techniques for mining unstructured data based on a concrete source of unstructured data in the form of source code reviews. (3) Demonstrate how to separate technical information from natural language text in unstructured data using lightweight tools and techniques that have been published in literature. (4) Demonstrate how to analyze the natural language text portion of the unstructured data using sentiment analysis tools. In particular, we are interested in discovering positive or negative sentiments in text fragments that mention technical information such as code identifiers or version numbers.
Target audience: The target audience for this tutorial are software practitioners and researchers wanting to understand state-of-the-art techniques for Mining Unstructured Data. This tutorial makes use of standard data mining techniques, the Unix command line, and the JAVA programming language. The tutorial should thus be understandable by developers, technical managers and researchers in the field of software engineering.
Outline: Part 1 – Characteristics of Unstructured Data: (1) What is Unstructured Data? (2) Differences between Unstructured Data and Natural Language Text; (3) Differences between Unstructured Data and Structured Data; (4) Why traditional techniques do not apply to Unstructured Data. Part 2 – Mining Code Review Data: (1) Mining Code Review data; (2) Using lightweight tools and techniques for mining unstructured data; (3) Separating Natural Language text and Technical Information from unstructured data. Part 3 – Practical Application: (1) Using sentiment analysis to characterize the natural language text; (2) Attaching sentiment information to the technical information; (3) Using the combination of both as a new software metric.
Nicolas Bettenburg received his Ph.D. in Computer Science from Queens University (Canada) under the supervision of Dr. Ahmed E. Hassan. His research interests are in mining unstructured information from software repositories with a focus on relating developer communication and collaboration to software quality. In the past, he has co-organized various conference tracks and has been a co-organizer of the Workshop on Mining Unstructured Data (MUD). His work has been published at premier venues like ICSE, FSE, TSE, ESEM, MSR, WCRE and ICSM.