Analyzing and Generating Commit Messages for Software Repositories
University of Delaware
Poor quality documentation increases the time developers spend trying to understand and modify source code. Therefore, if documentation can be automatically extracted from words and phrases in source code, it can diminish maintenance costs when human-written documentation is poor. Commit messages are a type of documentation that specifically describes program change. While methods exist for both finding differences between versions and for extracting linguistic information from source code, there has been little work in producing output that uses both to produce natural language output similar to developer-written commit messages. In order to lay groundwork for such a model of output, in this Thesis we performed an observational study of commit messages from open source software projects to determine their linguistic and non-linguistic properties. We also sent out a survey to users of software repositories to learn about how they use commit messages, what kinds of commit messages they find useful, and to present an initial model of output for natural-language commit messages using verb phrases and their associated direct objects. We find this model is insufficient as it lacks important location information from the original commit messages, which is often found in prepositional phrases in the original messages. Finally, we performed an independent analysis on a distribution of DeltaDoc, a research tool which attempts to generate output to supplement developer written commit messages. We found this distribution to be too problematic to use as it is, but that its output has potential be extended using natural language techniques if the concerns about its usability and performance can be addressed.