Similaritybased Grouping Genetic Algorithm
Pascal Francq
June 24, 2013 (March 13, 2011)
Table of Contents
1 Introduction
The Similaritybased Grouping Genetic Algorithm (SGGA) is a semisupervised clustering to group a set of objects, . It is an application of the Grouping Genetic Algorihtms (GGA) developed by Falkenauer [1].
The algorithm supposes that three different matrices of measures between two objects and exist:
 A matrix containing the measures of similarity, .
 A matrix containing a humanbased agreement ratio between two objects, .
 A matrix containing a humanbased disagreement ratio between two objects, .
The agreement and disagreement ratios represent the semisupervised aspects of the algorithm. In practice, the SGGA needs at least the matrix of similarity for each object to group. The other matrices represent some additional constraints.
2 Coding
3 Heuristic
The SGGA implements a variation of a firstfit heuristic. The developed heuristic randomly handles the objects that must be grouped. For each object, the following steps heuristic are performed:
 The heuristic search the first object in that is already in a cluster. If one is found, is grouped in the same cluster.

The heuristic parses each cluster to find those that satisfy the following constraints:
 The cluster does not contains the maximum number of objects allowed (by default, this limit is infinitive).
 The cluster does not contained any objects of .
 The cluster does not contained an object having a similarity with lower than a minimum one.
 If no valid cluster is found, the heuristic creates a new cluster and puts in it. Otherwise, the cluster containing the object with the highest similarity with is choose.
Once all objects are grouped, the heuristic verifies that each cluster contains at least a minimum number of objects (by default, this value is zero). For each cluster that doesn’t contain this number, the heuristic looks for each objects, , can be moved in another cluster using the following method:
 The heuristic searches a valid cluster containing one of the objects of . If one is found, the object is moved to it.
 If no one is found, the heuristic searches a valid cluster that doesn’t contained objects of . If one is found, the object is moved to it.
 If all the objects of a cluster were moved, the cluster is removed from the clustering.
4 Operators
5 Evaluation
To evaluate the chromosomes, the NNGGA uses three criteria. Let us first define two functions:
 represents the average values of the measure, , for object with all the other objects grouped of the same cluster.
 represents the average values of the measure, , for object with all the other objects not grouped in the same cluster.
The three criteria to optimize are:

A similarity criteria to maximize defined by:where represents the number of objects.

An agreement criteria to maximize defined by: where represents the number of objects.

A disagreement criteria to maximize defined by: where represents the number of objects.
To combine these criteria, the NNGGA uses the PROMETHEE multicriteria method.
6 Validation
The NNGGA was used in the GALILEI framework for two problems:
 The document clustering.
 The profile clustering.
References
[1] Emanuel Falkenauer, Genetic Algorithms and Grouping Problems, John Wiley, 1998.