Abstract:
The University of Auckland Library research repository has over one thousand open access journal articles, and thousands of open access theses. Many of these publications contain named entities, e.g., titles, subject, authors, locations, figures, areas specific keywords, etc. Identifying, extracting and classifying or describing these entities and elements will allow us to produce and publish open linked data related to each article, and in time, across University research. This project focuses on the auto-extraction of named entities and bibliographies from thesis submissions (in PDF format) and the detection and classification of subject keywords. It delivers a platform independent, open source software to extract, analysis, and classify PDF documents into structured data as output, using standard formats (e.g., JSON, XML). This is a joint project between the Library Digital Development team and Department of Computer Science at University of Auckland.