dc.contributor.advisor |
Warren, J |
en |
dc.contributor.author |
Chen, Guocai |
en |
dc.date.accessioned |
2011-02-10T21:48:21Z |
en |
dc.date.issued |
2010 |
en |
dc.identifier.uri |
http://hdl.handle.net/2292/6357 |
en |
dc.description.abstract |
A means of dealing with the quantity and quality issues of web-based consumer health resources is the creation of web portals centred on particular health topics and/or communities of users, a strategy that provides access to a more manageably sized corpus of reduced good quality, relevant, information. Breast Cancer Knowledge Online (BCKO) is an example of such a topic-centered portal; it provides a gateway to online information about breast cancer for patients, their families, friends and carers. Such portals are enhanced by metadata elements that help to focus user search, but the maintenance of such information, especially for dynamic Web 2.0 style resources, challenges the sustainability of the portal strategy. This thesis addresses this problem by exploring the feasibility of automated assessment of metadata attributes for consumer health webpages. In this thesis I use Hyperspace Analogue to Language (HAL) to model the language use patterns of webpages as Semantic Spaces. I present and demonstrate methods for automatically inferring non-trivial metadata attributes that have been encoded for BCKO for article tone ('supportive' versus 'medical'), author credentials and disease stage. I introduce a refined use of the classic Decision Forest and a novel Summed Similarity Measure (SSM) to automatically classify online webpages on their Semantic Space models. For the purpose of comparison, I have applied these methods and the well-known SVM algorithm on both BCKO and the popular Reuters21578 dataset. In addition to performance evaluation in terms of random sub-samples, to simulate real use I look at the datasets in their 'natural order' - the order in which the cases occurred chronologically. I find classification accuracy of 90% to 93% for the different BCKO metadata attributes (with SSM always among the top performing methods, and significantly superior to SVM on the author credential attribute) and approximately 98% for distinguishing the two most frequent classes in Reuters21578. In natural order, accuracies reach approximately 90%. These results indicate that language use patterns can be used to automate classification of consumer health webpages with acceptable accuracy. However, our study has been limited to webpages indexed by the BCKO consumer portal and only its metadata attributes. A wider range of websites and metadata attributes needs to be assessed, and the classification results should be compared to end-user feedback. |
en |
dc.publisher |
ResearchSpace@Auckland |
en |
dc.relation.ispartof |
PhD Thesis - University of Auckland |
en |
dc.relation.isreferencedby |
UoA99208314714002091 |
en |
dc.rights |
Items in ResearchSpace are protected by copyright, with all rights reserved, unless otherwise indicated. |
en |
dc.rights.uri |
https://researchspace.auckland.ac.nz/docs/uoa-docs/rights.htm |
en |
dc.rights.uri |
http://creativecommons.org/licenses/by-nc-sa/3.0/nz/ |
en |
dc.title |
Semantic Space Models for Classification of Consumer Webpages on Metadata Attributes |
en |
dc.type |
Thesis |
en |
thesis.degree.discipline |
Computer Science |
en |
thesis.degree.grantor |
The University of Auckland |
en |
thesis.degree.level |
Doctoral |
en |
thesis.degree.name |
PhD |
en |
dc.rights.holder |
Copyright: The author |
en |
pubs.peer-review |
false |
en |
pubs.elements-id |
205724 |
en |
dc.relation.isnodouble |
19445 |
* |
pubs.record-created-at-source-date |
2011-02-11 |
en |
dc.identifier.wikidata |
Q112882992 |
|