Abstract
The clustering problems are a class of optimization problems where the goal is to group a set of objects in different groups, each object being assigned in one group only.
1 Description
A clustering problem, sometimes called cluster analysis, is the task to assigning a set of objects into groups (called clusters) according some criteria, each object being assigned in one group only. In general, the criteria is to group similar objects in the same cluster (using some similarity measure), where each cluster can contain as many objects as necessary. But the goal may vary. For example in the bin packing problem, each cluster can only group a maximal size (giving by the sum of the sizes of the object already group together), and the goal is to use as few cluster as possible.
The rigorous definition of the problem begins with a set of objects to place in a set of groups:
Depending of the problem, the number of available groups can be infinite (in the bin packing problem for example) or limited (such as when some tasks must be assigned to a fixed number of workers).
2 Constraints
The clustering problems introduce several constraints, but it is always possible to define them as a member of one of the three constraints classes, namely the “capacity constraints” class, the “homogeneity constraints” and the “profile constraints” classes.
The “capacity constraints” constitute the first class. It regroups the constraints related to a given “capacity” of each group. In the bin packing problem for example, each group has a maximal size. These constraints may also introduce a minimal or maximal number of objects that a group must have. In the case the assignment of tasks to workers, one constraint could be a homogeneous repartition of the task between them.
The “homogeneity constraints” are the second class. These constraints specify the conditions that must be respected by an object when it is put in an existing group. This may be, for example, a minimal similarity with the objects already assigned.
The “profile constraints” are the last class. To obtain a valid solution, the clustering must respect the “wishes” of each object. For example, if the goal is to assign students to work groups, a constraint could be to avoid that a given genre is represented only once in a group.
3 Criteria
The criteria that should be optimized for the clustering problem vary from one problem to another. For some problems, the criteria are well defined (for example in the bin packing problem where the number of clusters should be minimized). But, in the case of the document clustering, where the goal is to regroup “similar” documents in the same topic, the problem is
fuzzy: is it not easy to express the fact that “two documents are similar”. When several criteria must be optimized, the clustering problem is a
multiple objective one.
Some clustering algorithms work with one type of problems (criteria) only. The k-Means algorithm is designed to minimize the within-cluster sum of squares
[1]. Other methods, such as the
Grouping Genetic Algorithms (GGA), may solve a whole range of problems.
4 Algorithm Evaluation
To evaluate the performance of an algorithm, it is necessary to have an “ideal” solution for a given problem instance and some measures to compare the “computed” solution provided by the algorithm with this “ideal” solution. Two situations exist:
-
The quality of a solution depends on a computation based on the clustering. For example, in the bin packing problem, the number of clusters needed to group the objects.
-
The quality of a solution depends on the particular assignment of the objects.
In the first case, a comparison between the computation done for the “computed” and “ideal” solution evaluates the performance of the algorithm. For the second situation, we need some measures to compare the “computed” assignments to the ideal ones (each object is assigned to an “ideal cluster” in the “ideal” solution). At least, three measures can be used for this latest comparison:
-
the adjusted Rand index, , used in conventional clustering problems;
-
the precision, , adapted from the precision of the information retrieval problems [2];
-
the recall, , adapted from the recall of the information retrieval problems [2].
The adjusted Rand Index is used to measure the global quality of a solution. When a clustering is not correct, the measures of precision and recall are necessary to determine the reasons. In fact, there is a relation between recall and precision:
-
A high value of the recall and a low value of the precision indicate that the resulting clustering has too few clusters: most objects that belongs to the same ideal cluster are grouped together, but are also grouped with objects of other ideal clusters.
-
A low value of the recall and a high value of the precision indicate that the resulting clustering has too many clusters: most objects that are grouped together belongs to the same ideal cluster, but all the objects of the same ideal cluster are not grouped together.
A given object, , is assigned to a computed cluster, by the algorithm, and corresponds to a ideal cluster, . The number of objects in the cluster, , is denoted . represents the number of objects of “ideal” cluster . Moreover, is the number of objects that are in the same computed and ideal cluster as object .
To verify whether an object and a computed cluster belong to the same ideal cluster, it is only necessary to verify whether the two ideal cluster sets, , are identical because each one contains one element only.
The measures can be illustrated by means of the example in Table
1↓.
Ideal Clustering
|
Computed Clustering
|
|
|
Table 1 Example of an ideal and a computed clustering.
Notice that, while there are 3 ideal cluster, the computed solution provides 4 clusters in this example.
4.1 Adjusted Rand index
The aim of the adjusted
Rand index is to establish an overall comparison between the computed and the ideal clustering. It is based on the
Rand index
[3], where a comparison is made between the assignments of each pair of objects in the ideal and the computed clustering. Roughly speaking, the
Rand index computes the percentage of pairs of objects for which both classification methods, the computed and the ideal one, agree.
Let
be the number of pairs of objects in the same topic and the same computed cluster, let
be the number of pairs of objects in the same ideal cluster but not in the same computed cluster, let
be the number of pairs of objects in the same computed cluster but not in the same ideal cluster, and let
be the number of pairs of objects in different computed clusters and in different ideal clusters. For the example in Table
1↑:
The
Rand index
[3] is defined by:
A problem with the
Rand index is that two randomly computed clustering have not a constant index, for example zero.
Hubert and
Arabie therefore introduce the adjusted
Rand index
[4], which is based on the assumption that the process is the generalized hypergeometric distribution, i.e. the ideal and computed clustering are selected at random so that the number of objects in both clustering is fixed. Table
2↓ introduces some notations for the computation where
represents the number of objects of an ideal cluster,
, to be grouped into a computed cluster,
.
Ideal \ Computed
|
|
|
|
|
Sums
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Sums
|
|
|
|
|
|
Table 2 Notation for comparing two clustering.
The general form of the adjusted Rand index is:
It has been show that the adjusted
Rand index can be written
[4]:
with
Let us illustrate the adjusted
Rand index on the example shown in Table
1↑. Table
3↓ shows the different values of the elements defined in Table
2↑.
Ideal \ Computed
|
|
|
|
|
Sums
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Sums
|
|
|
|
|
|
Table 3 Adjusted Rand index computed on the example.
Let us compute the adjusted Rand index:
We can see that the
Rand index is greater than the adjusted
Rand index. In fact, the range of values is greater for the adjusted
Rand index (
) than for the
Rand index (
), which makes it a better measure. Many different clustering measures were studied in
[5], and the recommendation is to use adjusted
Rand index.
Let us consider now the examples shown in Table
4↓. The two computed clustering are different but have the same quality: all the objects are correctly grouped except object
which is put either into
(clustering n°
) or in
(clustering n°
) rather than to remain alone.
Ideal Clustering
|
Computed Clustering n°
|
Computed Clustering n°
|
|
|
|
Table 4 Another example.
The adjusted Rand Indexes, and , for both computed clustering can be computed:
It can be seen that the two values are not equal also if the real quality of both computed clustering is identical. This is called the disturbed measurement problem. This means that a little variation of the adjusted Rand Index between two clustering does not automatically implies that one of the solution is better than the other one.
4.2 Precision
This measure is derived from the concept of precision in the information retrieval systems
[2]. The precision of the clustering is the average of the precisions of each object. The precision,
, of a given object,
, is the fraction of the other objects in its computed cluster that are issued from the same ideal cluster:
where represents the set of objects attached to the same ideal cluster than object .
Let us consider the example in Table
1↑. The precision
because the object,
, in the computed cluster of
is also in the same ideal cluster. The precision
because none of the objects of computed cluster of
are in the same ideal cluster. The precision
because only two objects (
and
) of computed cluster of
are also in the same topic. After computing the precision of each object, the precision of the clustering,
, is the average of the precisions of the objects:
4.3 Recall
This measure is derived from the concept of recall in the information retrieval systems
[2]. The clustering recall is the average of the recalls of each object. The recall,
, of a given object,
, is the fraction of the other objects of the same ideal cluster that were grouped into the same computed cluster:
where represents the set of objects attached to the same ideal cluster than object .
Let us again consider the example in Table
1↑. Recall
because only one object,
, of the same ideal cluster as
is also in the same computed cluster. The recall
because none of the objects of the same ideal cluster as
(
and
) are in the same computed cluster. The recall
because all the objects of the same ideal cluster as
(
and
) are also in the same virtual community. After computing the recall of each object, the recall of the clustering,
, is the average of the recalls of the objects:
5 Related
References
[1] James B. MacQueen, “Some Methods for Classification and Analysis of Multivariate Observations”, In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, pp. 281—297, 1967.
[2] Ricardo Baeza-Yates & Berthier Ribeiro-Neto, Modern Information Retrieval: The Concepts and Technology behind Search, Addison-Wesley, 2011.
[3] William M. Rand, “Objective criteria for the evaluation of clustering methods”, Journal of the American Statistical Association, 66(336), pp. 846—850, 1971.
[4] Lawrence Hubert & Phipps Arabie, “Comparing partitions”, Journal of Classification, 2(1), pp. 193—218, 1985.
[5] Glenn W. Milligan & Martha C. Cooper, “A study of the comparability of external criteria for hierarchical cluster analysis”, Multivariate Behavioral Research, 21(4), pp. 441—458, 1986.