An in-depth study of the promises and perils of mining GitHub

Show simple item record

dc.contributor.author Kalliamvakou, E en
dc.contributor.author Gousios, G en
dc.contributor.author Blincoe, Kelly en
dc.contributor.author Singer, L en
dc.contributor.author German, DM en
dc.contributor.author Damian, D en
dc.date.accessioned 2016-11-08T03:10:14Z en
dc.date.issued 2016-10 en
dc.identifier.citation Empirical Software Engineering 21(5):2035-2071 Oct 2016 en
dc.identifier.issn 1382-3256 en
dc.identifier.uri http://hdl.handle.net/2292/30996 en
dc.description.abstract With over 10 million git repositories, GitHub is becoming one of the most important sources of software artifacts on the Internet. Researchers mine the information stored in GitHub’s event logs to understand how its users employ the site to collaborate on software, but so far there have been no studies describing the quality and properties of the available GitHub data. We document the results of an empirical study aimed at understanding the characteristics of the repositories and users in GitHub; we see how users take advantage of GitHub’s main features and how their activity is tracked on GitHub and related datasets to point out misalignment between the real and mined data. Our results indicate that while GitHub is a rich source of data on software development, mining GitHub for research purposes should take various potential perils into consideration. For example, we show that the majority of the projects are personal and inactive, and that almost 40 % of all pull requests do not appear as merged even though they were. Also, approximately half of GitHub’s registered users do not have public activity, while the activity of GitHub users in repositories is not always easy to pinpoint. We use our identified perils to see if they can pose validity threats; we review selected papers from the MSR 2014 Mining Challenge and see if there are potential impacts to consider. We provide a set of recommendations for software engineering researchers on how to approach the data in GitHub. en
dc.publisher Springer Verlag (Germany) en
dc.relation.ispartofseries Empirical Software Engineering en
dc.rights Items in ResearchSpace are protected by copyright, with all rights reserved, unless otherwise indicated. Previously published items are made available in accordance with the copyright policy of the publisher. en
dc.rights.uri https://researchspace.auckland.ac.nz/docs/uoa-docs/rights.htm en
dc.title An in-depth study of the promises and perils of mining GitHub en
dc.type Journal Article en
dc.identifier.doi 10.1007/s10664-015-9393-5 en
pubs.issue 5 en
pubs.begin-page 2035 en
pubs.volume 21 en
pubs.end-page 2071 en
dc.rights.accessrights http://purl.org/eprint/accessRights/RestrictedAccess en
pubs.subtype Article en
pubs.elements-id 526231 en
pubs.org-id Engineering en
pubs.org-id Department of Electrical, Computer and Software Engineering en
dc.identifier.eissn 1573-7616 en
pubs.record-created-at-source-date 2016-11-08 en


Files in this item

There are no files associated with this item.

Find Full text

This item appears in the following Collection(s)

Show simple item record

Share

Search ResearchSpace


Browse

Statistics