dc.contributor.advisor |
Dobbie, Gillian |
|
dc.contributor.advisor |
Koh, Yun Sing |
|
dc.contributor.author |
Wong, Ian Shane |
|
dc.date.accessioned |
2022-01-05T02:27:42Z |
|
dc.date.available |
2022-01-05T02:27:42Z |
|
dc.date.issued |
2021 |
en |
dc.identifier.uri |
https://hdl.handle.net/2292/57855 |
|
dc.description.abstract |
Machine learning typically learns from tabular data sets. However, data is commonly stored across multiple linked tables in relational databases. Relational data requires practitioners to hand craft features that represent multi table information before state of the art classifiers can be applied. Feature engineering is the informal task of transforming data to improve machine learning performance, but it is difficult because the space of all possible feature transformations is intractably large. Improving feature engineering efficiency is an open challenge.
The research question we address is, how do we efficiently automate feature engineer-ing in relational datasets to improve machine learning accuracy. Traditional approaches improve tractability by reducing the feature space to a limited set of transformations, and/or apply heuristic search strategies. However, a typical source of inefficiency is the presence of multiple one-to-many relationships, which cause most relational feature gener-ation methods to generate exponential numbers of features as successive transformations are required. In order to address the efficiency of relational feature engineering, we first propose heuristic relational feature generation algorithms ChooseOne and DepthOne that scale linearly or better with the number of one-to-many relationships. Second, we address the lack of sufficient data for training machine learning based feature engineering, by proposing a relational data generator (RDG), which can generate data of varying char-acteristics and joint distributions across multiple tables. Joint distributions are satisfied using itemsets based on our novel Items2Data algorithm. Items2Data can generate syn-thetic boolean datasets with joint distributions by satisfying a set of itemsets. We find useful conditions within the NP-Hard problem where a satisfying dataset can be generated from itemsets in O(n2) time with respect to the number of itemsets. Finally we apply reinforcement learning to combine both single table and relational feature transforma-tions along with synthetic relational training data to learn an efficient feature engineering policy. We demonstrate the first successful transfer of feature engineering agents trained on synthetic relational data to benchmark datasets and show that the combination of sin-gle table plus relational transformations can improve accuracy over traditional relational transformation only methods. |
|
dc.publisher |
ResearchSpace@Auckland |
en |
dc.relation.ispartof |
PhD Thesis - University of Auckland |
en |
dc.relation.isreferencedby |
UoA |
en |
dc.rights |
Items in ResearchSpace are protected by copyright, with all rights reserved, unless otherwise indicated. |
en |
dc.rights |
Items in ResearchSpace are protected by copyright, with all rights reserved, unless otherwise indicated. |
|
dc.rights.uri |
https://researchspace.auckland.ac.nz/docs/uoa-docs/rights.htm |
en |
dc.rights.uri |
http://creativecommons.org/licenses/by-nc-sa/3.0/nz/ |
|
dc.title |
Efficient Relational Feature Engineering |
|
dc.type |
Thesis |
en |
thesis.degree.discipline |
Computer Science |
|
thesis.degree.grantor |
The University of Auckland |
en |
thesis.degree.level |
Doctoral |
en |
thesis.degree.name |
PhD |
en |
dc.date.updated |
2021-12-08T12:33:05Z |
|
dc.rights.holder |
Copyright: The author |
en |
dc.rights.accessrights |
http://purl.org/eprint/accessRights/OpenAccess |
en |
dc.identifier.wikidata |
Q112957262 |
|