Abstract:
The increasing numbers of textual documents from diverse sources such as different websites (e.g. social networks, news, magazines, blogs and medical recommendation websites), publications and articles, medical prescriptions leads to massive amounts of daily complex data. This phenomenon has caused many researchers to focus on analysing the content and measuring the similarities among the documents and texts to cluster them. One popular method to measure the similarity between documents is to represent the terms within the documents as vectors and measure the similarity among them based on the angle or Euclidean distance between each pair. By only considering these two criteria for similarity measurement, we may miss important underlying similarities in this area. We propose a new method, TS-SS, to measure the similarity level among documents, in such a way that one hopes to better understand which documents are more (or less) similar. This similarity level can be used as a handy measure for clustering and recommendation systems for documents. Our study gives insights on the drawbacks of geometrical and non-geometrical similarity measures and provides a novel method to combine the other geometric criteria into a method to measure the similarity level among documents from new prospective. We apply Euclidean distance, Cosine similarity and our new method on four labelled datasets. Finally we report how these three geometrical similarity measures perform in terms of similarity level and clustering purity using four evaluation techniques. The evaluations' results show that our new model outperforms the other measures.