Developing And Assessing Language Models For Logical Reasoning Over Natural Language
Reference
Degree Grantor
Abstract
Recent advancements in AI have highlighted the importance of integrating deep learning with symbolic logic reasoning. Language models such as RoBERTa, DeBERTa, LLaMA, Alpaca, Vicuna, GPT-3.5, and GPT-4 have advanced the performance of AI systems in various natural language processing tasks to human-like levels. However, the generalization of language models in logical reasoning remains underexplored. One of the main reasons is the limitation posed by the lack of extensive, balanced, and real-world datasets for logical reasoning. This thesis has three research objectives, addressing the main research gap/limitation:
-
To improve the models' out-of-distribution performance on multi-step logical reasoning tasks through logic-driven data augmentation.
-
To enhance the models' performance on real-world logical reasoning datasets by constructing an Abstract Meaning Representation based logic-driven data augmentation method.
-
Although large language models demonstrate impressive performance on current logical reasoning leaderboards, it remains underexplored whether they truly possess strong capabilities in logical reasoning.
The first part of the thesis focuses on improving language models' ability in multi-step logical reasoning, particularly when faced with unbalanced reasoning steps. Inspired by DeepLogic, we present IMA-GloVe-GA, an RNN-based model with a gate attention mechanism, developed to accommodate varying reasoning depths. This is facilitated by our PARARULE-Plus dataset, created for deeper reasoning tasks. Our results show notable enhancements in model performance under both standard and out-of-distribution conditions.
The second part of the thesis focuses on generating diverse training data to address the scarcity of real-world logical reasoning datasets and enhance large language models (LLMs) for logical reasoning tasks. We introduce AMR-LDA, a data augmentation method that converts text into Abstract Meaning Representation (AMR) graphs, improving reasoning datasets. This approach benefits various models, including GPT-3.5 and GPT-4, and improves performance, notably achieving the top rank on the ReClor leaderboard.
The third part of the thesis examines how Large Language Models (LLMs) like GPT-3.5 and GPT-4 respond to trivial changes in logical reasoning datasets. We created ReClor-plus, LogiQA-plus, and LogiQAv2-plus, which include shuffled options and modified correct choices to test LLMs' logical reasoning. Although LLMs excel on standard datasets, they exhibit degraded performance with these modified versions. Our findings reveal that incorporating task variations, perturbations in training sets, and logic-driven data augmentation significantly enhances LLMs' generalisation and robustness in logical reasoning scenarios.
This thesis explores several different approaches to demonstrate a more robust QA system that aids computers in thinking and reasoning over natural language texts through logical reasoning. Our methods have been evaluated and now lead the public logical reasoning leaderboard, ReClor. We are the first group in the world to have scored above 90% on the ReClor hidden test set.