Efficient Relational Feature Engineering

Show simple item record

dc.contributor.advisor Dobbie, Gillian
dc.contributor.advisor Koh, Yun Sing
dc.contributor.author Wong, Ian Shane
dc.date.accessioned 2022-01-05T02:27:42Z
dc.date.available 2022-01-05T02:27:42Z
dc.date.issued 2021 en
dc.identifier.uri https://hdl.handle.net/2292/57855
dc.description.abstract Machine learning typically learns from tabular data sets. However, data is commonly stored across multiple linked tables in relational databases. Relational data requires practitioners to hand craft features that represent multi table information before state of the art classifiers can be applied. Feature engineering is the informal task of transforming data to improve machine learning performance, but it is difficult because the space of all possible feature transformations is intractably large. Improving feature engineering efficiency is an open challenge. The research question we address is, how do we efficiently automate feature engineer-ing in relational datasets to improve machine learning accuracy. Traditional approaches improve tractability by reducing the feature space to a limited set of transformations, and/or apply heuristic search strategies. However, a typical source of inefficiency is the presence of multiple one-to-many relationships, which cause most relational feature gener-ation methods to generate exponential numbers of features as successive transformations are required. In order to address the efficiency of relational feature engineering, we first propose heuristic relational feature generation algorithms ChooseOne and DepthOne that scale linearly or better with the number of one-to-many relationships. Second, we address the lack of sufficient data for training machine learning based feature engineering, by proposing a relational data generator (RDG), which can generate data of varying char-acteristics and joint distributions across multiple tables. Joint distributions are satisfied using itemsets based on our novel Items2Data algorithm. Items2Data can generate syn-thetic boolean datasets with joint distributions by satisfying a set of itemsets. We find useful conditions within the NP-Hard problem where a satisfying dataset can be generated from itemsets in O(n2) time with respect to the number of itemsets. Finally we apply reinforcement learning to combine both single table and relational feature transforma-tions along with synthetic relational training data to learn an efficient feature engineering policy. We demonstrate the first successful transfer of feature engineering agents trained on synthetic relational data to benchmark datasets and show that the combination of sin-gle table plus relational transformations can improve accuracy over traditional relational transformation only methods.
dc.publisher ResearchSpace@Auckland en
dc.relation.ispartof PhD Thesis - University of Auckland en
dc.relation.isreferencedby UoA en
dc.rights Items in ResearchSpace are protected by copyright, with all rights reserved, unless otherwise indicated. en
dc.rights Items in ResearchSpace are protected by copyright, with all rights reserved, unless otherwise indicated.
dc.rights.uri https://researchspace.auckland.ac.nz/docs/uoa-docs/rights.htm en
dc.rights.uri http://creativecommons.org/licenses/by-nc-sa/3.0/nz/
dc.title Efficient Relational Feature Engineering
dc.type Thesis en
thesis.degree.discipline Computer Science
thesis.degree.grantor The University of Auckland en
thesis.degree.level Doctoral en
thesis.degree.name PhD en
dc.date.updated 2021-12-08T12:33:05Z
dc.rights.holder Copyright: The author en
dc.rights.accessrights http://purl.org/eprint/accessRights/OpenAccess en
dc.identifier.wikidata Q112957262


Files in this item

Find Full text

This item appears in the following Collection(s)

Show simple item record

Share

Search ResearchSpace


Browse

Statistics