Clustering Problems

Pascal Francq

December 19, 2011 (April 20, 2011)

Abstract

The clustering problems are a class of optimization problems where the goal is to group a set of objects in different groups, each object being assigned in one group only.

1 Description

A clustering problem, sometimes called cluster analysis, is the task to assigning a set of objects into groups (called clusters) according some criteria, each object being assigned in one group only. In general, the criteria is to group similar objects in the same cluster (using some similarity measure), where each cluster can contain as many objects as necessary. But the goal may vary. For example in the bin packing problem, each cluster can only group a maximal size (giving by the sum of the sizes of the object already group together), and the goal is to use as few cluster as possible.
The rigorous definition of the problem begins with a set of objects to place in a set of groups:
Depending of the problem, the number of available groups can be infinite (in the bin packing problem for example) or limited (such as when some tasks must be assigned to a fixed number of workers).

2 Constraints

The clustering problems introduce several constraints, but it is always possible to define them as a member of one of the three constraints classes, namely the “capacity constraints” class, the “homogeneity constraints” and the “profile constraints” classes.
The “capacity constraints” constitute the first class. It regroups the constraints related to a given “capacity” of each group. In the bin packing problem for example, each group has a maximal size. These constraints may also introduce a minimal or maximal number of objects that a group must have. In the case the assignment of tasks to workers, one constraint could be a homogeneous repartition of the task between them.
The “homogeneity constraints” are the second class. These constraints specify the conditions that must be respected by an object when it is put in an existing group. This may be, for example, a minimal similarity with the objects already assigned.
The “profile constraints” are the last class. To obtain a valid solution, the clustering must respect the “wishes” of each object. For example, if the goal is to assign students to work groups, a constraint could be to avoid that a given genre is represented only once in a group.

3 Criteria

The criteria that should be optimized for the clustering problem vary from one problem to another. For some problems, the criteria are well defined (for example in the bin packing problem where the number of clusters should be minimized). But, in the case of the document clustering, where the goal is to regroup “similar” documents in the same topic, the problem is fuzzy: is it not easy to express the fact that “two documents are similar”. When several criteria must be optimized, the clustering problem is a multiple objective one.
Some clustering algorithms work with one type of problems (criteria) only. The k-Means algorithm is designed to minimize the within-cluster sum of squares [1]. Other methods, such as the Grouping Genetic Algorithms (GGA), may solve a whole range of problems.

4 Algorithm Evaluation

To evaluate the performance of an algorithm, it is necessary to have an “ideal” solution for a given problem instance and some measures to compare the “computed” solution provided by the algorithm with this “ideal” solution. Two situations exist:
  1. The quality of a solution depends on a computation based on the clustering. For example, in the bin packing problem, the number of clusters needed to group the objects.
  2. The quality of a solution depends on the particular assignment of the objects.
In the first case, a comparison between the computation done for the “computed” and “ideal” solution evaluates the performance of the algorithm. For the second situation, we need some measures to compare the “computed” assignments to the ideal ones (each object is assigned to an “ideal cluster” in the “ideal” solution). At least, three measures can be used for this latest comparison:
  • the adjusted Rand index, , used in conventional clustering problems;
  • the precision, , adapted from the precision of the information retrieval problems [2];
  • the recall, , adapted from the recall of the information retrieval problems [2].
The adjusted Rand Index is used to measure the global quality of a solution. When a clustering is not correct, the measures of precision and recall are necessary to determine the reasons. In fact, there is a relation between recall and precision:
  • A high value of the recall and a low value of the precision indicate that the resulting clustering has too few clusters: most objects that belongs to the same ideal cluster are grouped together, but are also grouped with objects of other ideal clusters.
  • A low value of the recall and a high value of the precision indicate that the resulting clustering has too many clusters: most objects that are grouped together belongs to the same ideal cluster, but all the objects of the same ideal cluster are not grouped together.
A given object, , is assigned to a computed cluster, by the algorithm, and corresponds to a ideal cluster, . The number of objects in the cluster, , is denoted . represents the number of objects of “ideal” cluster . Moreover, is the number of objects that are in the same computed and ideal cluster as object .
To verify whether an object and a computed cluster belong to the same ideal cluster, it is only necessary to verify whether the two ideal cluster sets, , are identical because each one contains one element only.
The measures can be illustrated by means of the example in Table 1↓.
Ideal Clustering Computed Clustering
Table 1 Example of an ideal and a computed clustering.
Notice that, while there are 3 ideal cluster, the computed solution provides 4 clusters in this example.

4.1 Adjusted Rand index

The aim of the adjusted Rand index is to establish an overall comparison between the computed and the ideal clustering. It is based on the Rand index [3], where a comparison is made between the assignments of each pair of objects in the ideal and the computed clustering. Roughly speaking, the Rand index computes the percentage of pairs of objects for which both classification methods, the computed and the ideal one, agree.
Let be the number of pairs of objects in the same topic and the same computed cluster, let be the number of pairs of objects in the same ideal cluster but not in the same computed cluster, let be the number of pairs of objects in the same computed cluster but not in the same ideal cluster, and let be the number of pairs of objects in different computed clusters and in different ideal clusters. For the example in Table 1↑:
The Rand index [3] is defined by:
A problem with the Rand index is that two randomly computed clustering have not a constant index, for example zero. Hubert and Arabie therefore introduce the adjusted Rand index [4], which is based on the assumption that the process is the generalized hypergeometric distribution, i.e. the ideal and computed clustering are selected at random so that the number of objects in both clustering is fixed. Table 2↓ introduces some notations for the computation where represents the number of objects of an ideal cluster, , to be grouped into a computed cluster, .
Ideal \ Computed Sums
Sums
Table 2 Notation for comparing two clustering.
The general form of the adjusted Rand index is:
It has been show that the adjusted Rand index can be written [4]:
with
Let us illustrate the adjusted Rand index on the example shown in Table 1↑. Table 3↓ shows the different values of the elements defined in Table 2↑.
Ideal \ Computed Sums
Sums
Table 3 Adjusted Rand index computed on the example.
Let us compute the adjusted Rand index:
We can see that the Rand index is greater than the adjusted Rand index. In fact, the range of values is greater for the adjusted Rand index () than for the Rand index (), which makes it a better measure. Many different clustering measures were studied in [5], and the recommendation is to use adjusted Rand index.
Let us consider now the examples shown in Table 4↓. The two computed clustering are different but have the same quality: all the objects are correctly grouped except object which is put either into (clustering n°) or in (clustering n°) rather than to remain alone.
Ideal Clustering Computed Clustering n° Computed Clustering n°
Table 4 Another example.
The adjusted Rand Indexes, and , for both computed clustering can be computed:
It can be seen that the two values are not equal also if the real quality of both computed clustering is identical. This is called the disturbed measurement problem. This means that a little variation of the adjusted Rand Index between two clustering does not automatically implies that one of the solution is better than the other one.

4.2 Precision

This measure is derived from the concept of precision in the information retrieval systems [2]. The precision of the clustering is the average of the precisions of each object. The precision, , of a given object, , is the fraction of the other objects in its computed cluster that are issued from the same ideal cluster:
where represents the set of objects attached to the same ideal cluster than object .
Let us consider the example in Table 1↑. The precision because the object, , in the computed cluster of is also in the same ideal cluster. The precision because none of the objects of computed cluster of are in the same ideal cluster. The precision because only two objects ( and ) of computed cluster of are also in the same topic. After computing the precision of each object, the precision of the clustering, , is the average of the precisions of the objects:

4.3 Recall

This measure is derived from the concept of recall in the information retrieval systems [2]. The clustering recall is the average of the recalls of each object. The recall, , of a given object, , is the fraction of the other objects of the same ideal cluster that were grouped into the same computed cluster:
where represents the set of objects attached to the same ideal cluster than object .
Let us again consider the example in Table 1↑. Recall because only one object, , of the same ideal cluster as is also in the same computed cluster. The recall because none of the objects of the same ideal cluster as ( and ) are in the same computed cluster. The recall because all the objects of the same ideal cluster as ( and ) are also in the same virtual community. After computing the recall of each object, the recall of the clustering, , is the average of the recalls of the objects:

5 Related

References

[1] James B. MacQueen, “Some Methods for Classification and Analysis of Multivariate Observations”, In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, pp. 281—297, 1967.

[2] Ricardo Baeza-Yates & Berthier Ribeiro-Neto, Modern Information Retrieval: The Concepts and Technology behind Search, Addison-Wesley, 2011.

[3] William M. Rand, “Objective criteria for the evaluation of clustering methods”, Journal of the American Statistical Association, 66(336), pp. 846—850, 1971.

[4] Lawrence Hubert & Phipps Arabie, “Comparing partitions”, Journal of Classification, 2(1), pp. 193—218, 1985.

[5] Glenn W. Milligan & Martha C. Cooper, “A study of the comparability of external criteria for hierarchical cluster analysis”, Multivariate Behavioral Research, 21(4), pp. 441—458, 1986.