Efficient Relational Feature Engineering

Wong, Ian Shane

dc.contributor.advisor	Dobbie, Gillian
dc.contributor.advisor	Koh, Yun Sing
dc.contributor.author	Wong, Ian Shane
dc.date.accessioned	2022-01-05T02:27:42Z
dc.date.available	2022-01-05T02:27:42Z
dc.date.issued	2021	en
dc.identifier.uri	https://hdl.handle.net/2292/57855
dc.description.abstract	Machine learning typically learns from tabular data sets. However, data is commonly stored across multiple linked tables in relational databases. Relational data requires practitioners to hand craft features that represent multi table information before state of the art classiﬁers can be applied. Feature engineering is the informal task of transforming data to improve machine learning performance, but it is diﬃcult because the space of all possible feature transformations is intractably large. Improving feature engineering eﬃciency is an open challenge. The research question we address is, how do we eﬃciently automate feature engineer-ing in relational datasets to improve machine learning accuracy. Traditional approaches improve tractability by reducing the feature space to a limited set of transformations, and/or apply heuristic search strategies. However, a typical source of ineﬃciency is the presence of multiple one-to-many relationships, which cause most relational feature gener-ation methods to generate exponential numbers of features as successive transformations are required. In order to address the eﬃciency of relational feature engineering, we ﬁrst propose heuristic relational feature generation algorithms ChooseOne and DepthOne that scale linearly or better with the number of one-to-many relationships. Second, we address the lack of suﬃcient data for training machine learning based feature engineering, by proposing a relational data generator (RDG), which can generate data of varying char-acteristics and joint distributions across multiple tables. Joint distributions are satisﬁed using itemsets based on our novel Items2Data algorithm. Items2Data can generate syn-thetic boolean datasets with joint distributions by satisfying a set of itemsets. We ﬁnd useful conditions within the NP-Hard problem where a satisfying dataset can be generated from itemsets in O(n2) time with respect to the number of itemsets. Finally we apply reinforcement learning to combine both single table and relational feature transforma-tions along with synthetic relational training data to learn an eﬃcient feature engineering policy. We demonstrate the ﬁrst successful transfer of feature engineering agents trained on synthetic relational data to benchmark datasets and show that the combination of sin-gle table plus relational transformations can improve accuracy over traditional relational transformation only methods.
dc.publisher	ResearchSpace@Auckland	en
dc.relation.ispartof	PhD Thesis - University of Auckland	en
dc.relation.isreferencedby	UoA	en
dc.rights	Items in ResearchSpace are protected by copyright, with all rights reserved, unless otherwise indicated.	en
dc.rights	Items in ResearchSpace are protected by copyright, with all rights reserved, unless otherwise indicated.
dc.rights.uri	https://researchspace.auckland.ac.nz/docs/uoa-docs/rights.htm	en
dc.rights.uri	http://creativecommons.org/licenses/by-nc-sa/3.0/nz/
dc.title	Efficient Relational Feature Engineering
dc.type	Thesis	en
thesis.degree.discipline	Computer Science
thesis.degree.grantor	The University of Auckland	en
thesis.degree.level	Doctoral	en
thesis.degree.name	PhD	en
dc.date.updated	2021-12-08T12:33:05Z
dc.rights.holder	Copyright: The author	en
dc.rights.accessrights	http://purl.org/eprint/accessRights/OpenAccess	en
dc.identifier.wikidata	Q112957262