Clustering, Swarms and Recommender Systems
Reference
Degree Grantor
Abstract
Knowledge Discovery and Data mining (KDD) helps us understand some characteristics of data in large repositories by passing the data through different operations such as data selection, preprocessing, transformation, data mining and post processing. Data mining, which is the core of KDD, extracts informative patterns such as clusters of relevant data, classification and association rules, sequential patterns and prediction models for different types of data such as textual, audio-visual, and microarray data. Recently a huge increase in the use of Swarm Intelligence (SI) based optimization techniques for KDD has been observed. Particle Swarm Optimization (PSO) is one of the most highly cited SI techniques for KDD because it has the flexibility, simplicity, and extendibility to be used for different data mining tasks. In this thesis, we exploited the potential of PSO for data clustering and analyzed the performance of PSO in different application areas. We extended the basic PSO model into Evolutionary PSO clustering (EPSO-clustering) and Hierarchical PSO clustering (HPSOclustering). The later technique clusters the data in a hierarchy of clusters, combining the benefits of hierarchical and partitional clustering. It provides a clustering solution at different levels of granularity. We tested our approach on benchmark classification data and extended the approach to be used for clustering web usage data. Besides good performance on classification data, HPSO-clustering has outperformed traditional clustering techniques on web usage data. Web usage data is different from traditional data, therefore, before clustering, the data needs to be passed through sophisticated analysis and preprocessing phases. One of the problems with web usage data is the poor quality of the web usage data. It is sparse, noisy and approximately 60% to 80% of the web usage data does not contribute to the pattern mining process. To clean and prepare the web usage data for clustering, we proposed an analysis and preprocessing model. The proposed model divides the entire cleaning process into sub phases such as log-based analysis, subject-based analysis, general preprocessing, and algorithm specific preprocessing. We presented the results of each phase with the help of two experimental studies, one based on data from the NASA website, and the other on data from the Department of Computer Science, University of Auckland web server. Another problem with web usage data is that the noise in the data is not from genuine web users. A number of known and unknown web crawlers browse the web pages. Alongside the genuine web users, the activity of the web crawlers is also recorded in the web log. The behavior of the web crawlers is different from humans so their sessions need to be identified and analyzed separately. To tackle this problem, we extended our HPSO-clustering approach to use it for cluster-based outlier detection. Besides verifying the technique for identifying outliers in numeric data, we used it to detect web crawlers in web usage data. We selected recommender systems to demonstrate an application of the patterns generated from web usage data based on our proposed analysis, clustering, and outlier detection model. Recommender systems have emerged as one of the major applications of the patterns extracted from the behavior of web users and resources. Web-based recommender systems are tools that assist users find a particular web resource based on the history of the web user or the usage of that resource. Building web based implicit recommender systems is not a trivial task due to the amount of data and its poor quality. After subsequent preprocessing and clustering phases, we generated recommendations for different scenarios of an active user. We present detailed results of accuracy of different recommendations which verify the capability of our proposed technique. In our research, besides proposing an analysis and preprocessing model, we researched PSO as a method for performing efficient and flexible clustering and a method for detecting outliers from web usage data. All these issues are crucial for building efficient and accurate implicit data based recommender systems.