Semantic Space Models for Classification of Consumer Webpages on Metadata Attributes

ResearchSpace/Manakin Repository

Show simple item record

dc.contributor.advisor Warren, J en Chen, Guocai en 2011-02-10T21:48:21Z en 2010 en
dc.identifier.uri en
dc.description.abstract A means of dealing with the quantity and quality issues of web-based consumer health resources is the creation of web portals centred on particular health topics and/or communities of users, a strategy that provides access to a more manageably sized corpus of reduced good quality, relevant, information. Breast Cancer Knowledge Online (BCKO) is an example of such a topic-centered portal; it provides a gateway to online information about breast cancer for patients, their families, friends and carers. Such portals are enhanced by metadata elements that help to focus user search, but the maintenance of such information, especially for dynamic Web 2.0 style resources, challenges the sustainability of the portal strategy. This thesis addresses this problem by exploring the feasibility of automated assessment of metadata attributes for consumer health webpages. In this thesis I use Hyperspace Analogue to Language (HAL) to model the language use patterns of webpages as Semantic Spaces. I present and demonstrate methods for automatically inferring non-trivial metadata attributes that have been encoded for BCKO for article tone ('supportive' versus 'medical'), author credentials and disease stage. I introduce a refined use of the classic Decision Forest and a novel Summed Similarity Measure (SSM) to automatically classify online webpages on their Semantic Space models. For the purpose of comparison, I have applied these methods and the well-known SVM algorithm on both BCKO and the popular Reuters21578 dataset. In addition to performance evaluation in terms of random sub-samples, to simulate real use I look at the datasets in their 'natural order' - the order in which the cases occurred chronologically. I find classification accuracy of 90% to 93% for the different BCKO metadata attributes (with SSM always among the top performing methods, and significantly superior to SVM on the author credential attribute) and approximately 98% for distinguishing the two most frequent classes in Reuters21578. In natural order, accuracies reach approximately 90%. These results indicate that language use patterns can be used to automate classification of consumer health webpages with acceptable accuracy. However, our study has been limited to webpages indexed by the BCKO consumer portal and only its metadata attributes. A wider range of websites and metadata attributes needs to be assessed, and the classification results should be compared to end-user feedback. en
dc.publisher ResearchSpace@Auckland en
dc.relation.ispartof PhD Thesis - University of Auckland en
dc.relation.isreferencedby UoA99208314714002091 en
dc.rights Items in ResearchSpace are protected by copyright, with all rights reserved, unless otherwise indicated. en
dc.rights.uri en
dc.rights.uri en
dc.title Semantic Space Models for Classification of Consumer Webpages on Metadata Attributes en
dc.type Thesis en Computer Science en The University of Auckland en Doctoral en PhD en
dc.rights.holder Copyright: The author en
pubs.peer-review false en
pubs.elements-id 205724 en
dc.relation.isnodouble 19445 *
pubs.record-created-at-source-date 2011-02-11 en

Full text options

This item appears in the following Collection(s)

Show simple item record Except where otherwise noted, this item's license is described as


Search ResearchSpace

Advanced Search