Semantic Space Models for Classification of Consumer Webpages on Metadata Attributes

ResearchSpace/Manakin Repository

Show simple item record

dc.contributor.advisor Warren, J en
dc.contributor.author Chen, Guocai en
dc.date.accessioned 2011-02-10T21:48:21Z en
dc.date.issued 2010 en
dc.identifier.uri http://hdl.handle.net/2292/6357 en
dc.description.abstract A means of dealing with the quantity and quality issues of web-based consumer health resources is the creation of web portals centred on particular health topics and/or communities of users, a strategy that provides access to a more manageably sized corpus of reduced good quality, relevant, information. Breast Cancer Knowledge Online (BCKO) is an example of such a topic-centered portal; it provides a gateway to online information about breast cancer for patients, their families, friends and carers. Such portals are enhanced by metadata elements that help to focus user search, but the maintenance of such information, especially for dynamic Web 2.0 style resources, challenges the sustainability of the portal strategy. This thesis addresses this problem by exploring the feasibility of automated assessment of metadata attributes for consumer health webpages. In this thesis I use Hyperspace Analogue to Language (HAL) to model the language use patterns of webpages as Semantic Spaces. I present and demonstrate methods for automatically inferring non-trivial metadata attributes that have been encoded for BCKO for article tone ('supportive' versus 'medical'), author credentials and disease stage. I introduce a refined use of the classic Decision Forest and a novel Summed Similarity Measure (SSM) to automatically classify online webpages on their Semantic Space models. For the purpose of comparison, I have applied these methods and the well-known SVM algorithm on both BCKO and the popular Reuters21578 dataset. In addition to performance evaluation in terms of random sub-samples, to simulate real use I look at the datasets in their 'natural order' - the order in which the cases occurred chronologically. I find classification accuracy of 90% to 93% for the different BCKO metadata attributes (with SSM always among the top performing methods, and significantly superior to SVM on the author credential attribute) and approximately 98% for distinguishing the two most frequent classes in Reuters21578. In natural order, accuracies reach approximately 90%. These results indicate that language use patterns can be used to automate classification of consumer health webpages with acceptable accuracy. However, our study has been limited to webpages indexed by the BCKO consumer portal and only its metadata attributes. A wider range of websites and metadata attributes needs to be assessed, and the classification results should be compared to end-user feedback. en
dc.publisher ResearchSpace@Auckland en
dc.relation.ispartof PhD Thesis - University of Auckland en
dc.relation.isreferencedby UoA99208314714002091 en
dc.rights Items in ResearchSpace are protected by copyright, with all rights reserved, unless otherwise indicated. en
dc.rights.uri https://researchspace.auckland.ac.nz/docs/uoa-docs/rights.htm en
dc.rights.uri http://creativecommons.org/licenses/by-nc-sa/3.0/nz/ en
dc.title Semantic Space Models for Classification of Consumer Webpages on Metadata Attributes en
dc.type Thesis en
thesis.degree.discipline Computer Science en
thesis.degree.grantor The University of Auckland en
thesis.degree.level Doctoral en
thesis.degree.name PhD en
dc.rights.holder Copyright: The author en
pubs.peer-review false en
pubs.elements-id 205724 en
pubs.record-created-at-source-date 2011-02-11 en


Full text options

This item appears in the following Collection(s)

Show simple item record

http://creativecommons.org/licenses/by-nc-sa/3.0/nz/ Except where otherwise noted, this item's license is described as http://creativecommons.org/licenses/by-nc-sa/3.0/nz/

Share

Search ResearchSpace


Advanced Search

Browse