Abstract:
© 2016 Copyright held by the owner/author(s). Publication rights licensed to ACM. In order to facilitate research based on datasets in empirical software engineering, the meaning of data must be able to be interpreted correctly. Datasets contain measurements that are associated with metrics and entities. In some datasets, it is not always clear which entities have been measured and exactly which metrics have been used. This means that measurements could be misinterpreted. The goal of this study is to determine a useful way to understand what datasets are actually intended to represent. We construct precise definitions of datasets and their potential elements. We develop a metamodel to describe the structure and concepts in a dataset, and the relationships between each concept. We apply the metamodel to a number of existing datasets from the PROMISE repository. We found that of the 70 existing datasets we studied, 61 datasets contained insufficient information to ensure correct interpretation for metrics and entities. Our metamodel can be used to identify such datasets and can be used to evaluate new datasets. It will also form the foundation for a framework to evaluate the quality of datasets.