Annotation of clinical datasets using composite information models

Reference

Degree Grantor

The University of Auckland

Abstract

Annotation of clinical datasets is a process of discovery of clinical concepts and their association with machine readable constructs. The purpose of the process is to expose clinical information held in the datasets to automated agents in a structured format that facilitates and accelerates processes like document comparison, search or decision support. Numerous annotation techniques currently exist and techniques vary from simple matching to words from a reference dictionary to complex supervised machine learning techniques. The results vary, there is no gold standard and the community constantly produces new techniques. We developed an annotation technique that utilises ontological features of SNOMED-CT in discovering complex concepts contained in clinical documents and transforming them using term expansion into composite information models that reveal semantics of the messages conveyed in the document. Our algorithm converts clinical documents into segments of ontology of reality and weights the concepts found based on their frequency and location in the expanded structure. The algorithm utilises term expansion methods and pre-defined weights and compares the similarities of the resulting structures by measuring distances between their x highest weight-bearing concepts based on the position of those concepts in the ontology of reality’s graph structure represented by SNOMED CT. We test the effectiveness of our algorithm by comparing the outputs of the headers and bodies of clinical discharge summaries. We use a machine agent to measure similarity between the composite information models discovered in the body and in the diagnosis section of the matched entity. The assumption that the model derived from the body of the discharge summary will be similar to the model derived from the diagnosis section has been tested and the results show that our algorithm is a valid solution for comparing and finding similarities between clinical documents. The results show lower distances between the concepts in the related bodies of texts compared to the distances between the concepts in the unrelated bodies of text. To further test the application of the algorithm that we created and our novel approach, we explore the utility of the algorithm in comparing openEHR Archetypes with clinical documents. Our results confirm that our approach is a step to the right direction as the results are promising. Our method utilises SNOMED CT as an ontology of reality for expansion of clinical concepts found in the matched entity and introduces a novel technique of weighting that takes into consideration relations between entities, their position in the ontology of reality and their frequency in the clinical document text. As ontologies are quickly becoming content rich representations of reality, we consider it important to establish their utility as ontologies of reality in annotation processes. The method we developed is an alternative to the currently used search expansion methods and corpus annotation methods to name a few.

Description

DOI

Related Link

Keywords

ANZSRC 2020 Field of Research Codes

Collections