Knowledge-Driven Text Generation

Qi, Qianqian

Knowledge-Driven Text Generation

Degree Grantor

The University of Auckland

Abstract

Natural Language Generation (NLG) is an automated process that produces human-like text. It can create this text either from scratch or using inputs like natural language or structured data such as database records, computer-generated reports or keywords. The main objective of NLG is to generate coherent, fluent and relevant natural language text, usually based on input data. NLG has various practical applications, such as creating content, summarizing text and generating dialogues. For instance, NLG can automatically produce news articles, product descriptions, or weather reports. Advancements in machine learning and NLP have led to the development of Large Language Models (LLMs) that are trained on massive amounts of data and can generate human-like text. Pretrained Language Models (PLMs), such as T5, BART, and ChatGPT, are examples of modern state-of-the-art models which have evolved beyond traditional grammar and statisticalbased methods. These models can be improved by providing more data and increasing the number of neural network layers. In modern NLP methods, such models are often first pretrained on large datasets and then fine-tuned for specific tasks. This thesis studies the data-to-text generation task that aims to generate textual descriptions of structured data with the help of Pretrained Language Models (PLMs). First, we investigate the capability of PLMs to generate grammatically correct and consistent text with different types of structured data, such as keywords, tables, and abstract meaning representation. Second, we explore how PLMs can retrieve useful information from incomplete datasets and generate text with the provided multiple data sources. The study also investigates the possibility of fine-tuning only the first few layers of PLMs to save time and resources. Third, we examine the hybrid PLMs that work on natural language generation and natural language understanding and compare them with pretrained seq2seq models. Finally, we investigate effective control mechanisms for the language model in epic-level text generation. The study is divided into four parts, each with a specific scope of investigation. The first part is limited to implementing a method for keyword-to-text generation and evaluating the generated text from a syntactic and semantic perspective. The data sources used in this study are RACE and Wikimedia, and the English frequency word list is used to identify keywords. The second part focuses on developing a system for table-to-text and RDF-to-text generation using PLMs. The sources of data for this study are E2E, WebNLG, and DART. The proposed method involves fine-tuning different PLMs for data-to-text generation tasks and developing the dynamic prompt tuning method for data augmentation. The third part studies on text generation from tables and knowledge graphs. The data sources for this study are WikiBio and Wikidata. The study proposes a hybrid model combining PLMs and assesses its performance in comparison to a pre-trained seq2seq model. A new dataset called TaKG is also created to address the incomplete problem of the WikiBio dataset. In the fourth part, a framework is proposed to address the limitations of large-scale language models in generating epic-scale text. Our contribution include designing effective control mechanisms for the language model, optimizing GPT-3.5 for open-domain text, and evaluating the generated text against long text generation requirements. In summary, this thesis makes a contribution to the investigation of structured data’s impact on PLMs and offers valuable insights into the factors that influence the effectiveness of controlling PLMs. This includes the advancement of effective control mechanisms for the PLMs.