dc.contributor.author |
Kalliamvakou, E |
en |
dc.contributor.author |
Gousios, G |
en |
dc.contributor.author |
Blincoe, Kelly |
en |
dc.contributor.author |
Singer, L |
en |
dc.contributor.author |
German, DM |
en |
dc.contributor.author |
Damian, D |
en |
dc.date.accessioned |
2016-11-08T03:10:14Z |
en |
dc.date.issued |
2016-10 |
en |
dc.identifier.citation |
Empirical Software Engineering 21(5):2035-2071 Oct 2016 |
en |
dc.identifier.issn |
1382-3256 |
en |
dc.identifier.uri |
http://hdl.handle.net/2292/30996 |
en |
dc.description.abstract |
With over 10 million git repositories, GitHub is becoming one of the most important sources of software artifacts on the Internet. Researchers mine the information stored in GitHub’s event logs to understand how its users employ the site to collaborate on software, but so far there have been no studies describing the quality and properties of the available GitHub data. We document the results of an empirical study aimed at understanding the characteristics of the repositories and users in GitHub; we see how users take advantage of GitHub’s main features and how their activity is tracked on GitHub and related datasets to point out misalignment between the real and mined data. Our results indicate that while GitHub is a rich source of data on software development, mining GitHub for research purposes should take various potential perils into consideration. For example, we show that the majority of the projects are personal and inactive, and that almost 40 % of all pull requests do not appear as merged even though they were. Also, approximately half of GitHub’s registered users do not have public activity, while the activity of GitHub users in repositories is not always easy to pinpoint. We use our identified perils to see if they can pose validity threats; we review selected papers from the MSR 2014 Mining Challenge and see if there are potential impacts to consider. We provide a set of recommendations for software engineering researchers on how to approach the data in GitHub. |
en |
dc.publisher |
Springer Verlag (Germany) |
en |
dc.relation.ispartofseries |
Empirical Software Engineering |
en |
dc.rights |
Items in ResearchSpace are protected by copyright, with all rights reserved, unless otherwise indicated. Previously published items are made available in accordance with the copyright policy of the publisher. |
en |
dc.rights.uri |
https://researchspace.auckland.ac.nz/docs/uoa-docs/rights.htm |
en |
dc.title |
An in-depth study of the promises and perils of mining GitHub |
en |
dc.type |
Journal Article |
en |
dc.identifier.doi |
10.1007/s10664-015-9393-5 |
en |
pubs.issue |
5 |
en |
pubs.begin-page |
2035 |
en |
pubs.volume |
21 |
en |
pubs.end-page |
2071 |
en |
dc.rights.accessrights |
http://purl.org/eprint/accessRights/RestrictedAccess |
en |
pubs.subtype |
Article |
en |
pubs.elements-id |
526231 |
en |
pubs.org-id |
Engineering |
en |
pubs.org-id |
Department of Electrical, Computer and Software Engineering |
en |
dc.identifier.eissn |
1573-7616 |
en |
pubs.record-created-at-source-date |
2016-11-08 |
en |