Clustering (GALILEI)

Pascal Francq

December 30, 2011 (April 20, 2011)

Abstract

The clustering in GALILEI consists to build document topics and communities of interests. Since documents and profiles are object represented with the same knowledge model, the same approach is used for both, the document clustering and the profile clustering.

1 Introduction

The GALILEI framework deals with two categories of objects that could be clustered: the documents in topics (each document being assigned to one topic only) and the profiles in communities of interests (each profile being assigned to one community only)::
where ) returns the community of interest containing the profile, , and ) returns the topic of the document, .
In the knowledge model of the GALILEI framework, both categories are represented as tensors and a similarity measure exists. If we suppose that the algorithm used to cluster exploit only measures on the objects (which is the case as explained below), it can be used for both problems (the document clustering and the profile clustering).

2 Agreement and Disagreement Ratios

The GALILEI framework supposes that users assess the relevance of documents for a given profile. These assessments provide some information on the similarity between documents:
  • documents assessed as relevant by a same profile are probably similar;
  • documents assessed differently [A]  [A] One of the document is assessed as relevant, the other one as irrelevant or fuzzy relevant. by a same profile are probably dissimilar.
Of course, if two documents are assessed as irrelevant or fuzzy relevant by the same profile, it is impossible to infer a degree of similarity or dissimilarity. We can take an identical reasoning for the profiles:
  • profiles assessing as relevant the same document are probably similar;
  • profiles assessing differently [B]  [B] One of the profile assessed it as relevant, the other one as irrelevant or fuzzy relevant. the same document are probably dissimilar.
Based on these observations, four measure may be propose (using the subset notations):
  1. An agreement ratio between two documents, and , defined as the ratio between the number of profiles assessing both documents are relevant and the number of profiles assessing them: where represents a minimum number of profiles having assessed the two documents.
  2. A disagreement ratio between two documents, and , defined as the ratio between the number of profiles assessing both documents differently and the number of profiles assessing them: where represents a minimum number of profiles having assessed the two documents.
  3. An agreement ratio between two profile, and , defined as the ratio between the number of documents assessed as relevant by both profiles and the number of documents assessed by them: where represents a minimum number of documents assessed by the two profiles.
  4. A disagreement ratio between two profile, and , defined as the ratio between the number of documents assessed differently by both profiles and the number of documents assessed by them: where represents a minimum number of documents assessed by the two profiles.
The and parameters are used to avoid that a conclusion is drawn based on too few assessments. For example, if only one profile has assessed two documents as relevant, it is of course insufficient to conclude that the two documents belong to the same group. In practice, and can be set to 10.

3 Plug-ins

In the GALILEI project, two plug-ins implement algorithm for the object clustering :
gca The plug-in implements the Similarity-based Grouping Genetic Algorithm (SGGA) and the Nearest Neighbors Grouping Genetic Algorithm (NNGGA). The first one was initially used for the profiles clustering. The second one was developed to avoid the computing on the whole similarity, agreement and disagreement matrices when the number of objects to cluster becomes important (in particular for the documents clustering).
kmeans The plug-in implements the classical k-Means algorithm [1] and the k-Means++ variant [2]. These algorithms were mainly implemented for comparison with the SGGA and the NNGGA.

4 Results

A series of tests done for the profiles clustering shows that the SGGA performs better than the k-Means-based algorithms [3]. A complete evaluation of the SGGA and the NNGAA for both the profiles clustering and the documents clustering is planned.

References

[1] James B. MacQueen, “Some Methods for Classification and Analysis of Multivariate Observations”, In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, pp. 281—297, 1967.

[2] David Arthur & Sergei Vassilvitskii, “K-Means++: The Advantages of Careful Seeding”, In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete algorithms, pp. 1027—1035, 2007.

[3] Pascal Francq, Collaborative and Structured Search: An Integrated Approach for Sharing Documents Among Users, PhD Thesis, Université libre de Bruxelles, 2003.