Abstract:
Software engineering data sets provide valuable information related to software development and the evolution of software projects. Researchers often obtain software engineering data sets from public data repositories, which contain data and metadata about software development artefacts, such as the bug reports and source code. Typically, the term data refers to measurements associated with metrics and entities, whereas the term metadata describes the meaning and context of the data. However, there is growing concern regarding the quality of such data sets because they can lead to questionable empirical results if the data is of poor quality. Researchers, who are the creators of data sets, describe the data and metadata in many different ways. This creates challenges that make it difficult to interpret data sets or to report clearly the quality of data. Moreover, researchers intending to use data sets from public data repositories rely on their quality. Although many techniques have been proposed in the literature to resolve the challenges associated with data quality, there is a lack of research that evaluates the quality of data sets in public data repositories. The main research question in this study is, 'How can we help researchers to understand the quality of data sets in software engineering?' Hence, the aim of the study was to develop a data quality assessment framework to interpret the data sets better and evaluate their quality. The framework will allow researchers to understand the quality of data and identify any potential problems that may be present in a data set. The research was organised into six steps. The first step was a review of data quality in software engineering. The results of a systematic mapping study carried out as part of the review showed that many definitions of data quality issues are unclear because of the inconsistent terminology used in these definitions. The second step was an observation of existing and artificial data sets to explore their different formats and structures. The results from the observation of data sets indicated that few data sets contain metadata to describe what the data mean, which might lead to misinterpretation. The third step was designing a metamodel to describe the structure and concepts associated with data sets, and the relationships between each concept. Every concept in the metamodel is defined using standard terminology to allow common interpretation of data. The fourth step involved developing a framework for data quality assessment to determine whether a data set contains sufficient information to facilitate the correct interpretation of data. The fifth step was constructing formal guidelines for the creation of good-quality data sets. The guidelines were constructed based on the essential terminology of the dataset metamodel and procedures from the framework. The final step consisted of the evaluation of part of the framework by means of a user study. The part of the framework consists of definitions of elements in the dataset metamodel and formal definitions for data quality issues. The research makes four significant contributions which are: (i) a systematic mapping study on data quality in software engineering. (ii) a dataset metamodel for describing the structure of data sets. (iii) a data quality assessment framework to better understand the quality of data sets. (iv) formal guidelines for creating good-quality data sets.