Nearest Neighbors Grouping Genetic Algorithm
Pascal Francq
June 24, 2013 (March 13, 2011)
Table of Contents
1 Introduction
The Nearest Neighbors Grouping Genetic Algorithm (NNGGA) is a semisupervised clustering to group a set of objects, . It combines a Grouping Genetic Algorihtms (GGA) with an approach based on the nearest neighbors.
The algorithm supposes that three different measures between two objects and exist:
 A measure of similarity, .
 A humanbased agreement ratio between two objects, .
 A humanbased disagreement ratio between two objects, .
The agreement and disagreement ratios represent the semisupervised aspects of the algorithm. If they are defined for some objects, the NNGGA uses them as constraints for the clustering.
Most clustering algorithms suppose that a given matrix exists defining some kind of similarity measure between all the objects to cluster. Despite the fact that it is often a sparse matrix, it can still have a high price to compute and store. Moreover, in our case, we want to exploit three measures which implies to compute three such matrices. Therefore, as its name suggests, the NNGGA uses only the nearest neighbors of each object in regards with these three measures. In practice, for each object , the NNGGA exploits the following lists:
 The nearest neighbors of object (those objects we have the highest similarity with it), .
 A list of objects sharing relevance agreements with , .
 A list of objects having some relevance disagreements with , .
In practice, the NNGGA needs at least the nearest neighbors for each object to group. The other lists represent some additional constraints.
The NNGGA is a particular application of the GGA developed by Falkenauer [1]. In fact, it is adapted from the Similaritybased Grouping Genetic Algorihtms (SGGA) when the number of objects is too high to compute (or maintain) these matrices of measures.
2 Coding
3 Heuristic
The NNGGA implements a variation of a firstfit heuristic. The developed heuristic randomly handles the objects that must be grouped. For each object, the following steps heuristic are performed:
 The heuristic search the first object in that is already in a cluster. If one is found, is grouped in the same cluster.
 Otherwise, the heuristic counts the number of objects of in each cluster.

The clusters are then treated in descending order of these numbers. The heuristic finds the first "valid" cluster that satisfy the following constraints:
 The cluster does not contains the maximum number of objects allowed (by default, this limit is infinitive).
 The cluster does not contained any objects of .
 The cluster does not contained an object having a similarity with lower than a minimum one.
 If no valid cluster is found, the heuristic creates a new cluster and puts in it.
Once all objects are grouped, the heuristic verifies that each cluster contains at least a minimum number of objects (by default, this value is zero). For each cluster that doesn’t contain this number, the heuristic looks for each objects, , can be moved in another cluster using the following method:
 The heuristic searches a valid cluster containing one of the objects of . If one is found, the object is moved to it.
 If no one is found, the heuristic searches a valid cluster that doesn’t contained objects of . If one is found, the object is moved to it.
 If all the objects of a cluster were moved, the cluster is removed from the clustering.
4 Operators
5 Evaluation
To evaluate the chromosomes, the NNGGA uses three criteria. Let us first define two functions:
 represents the average values of the measure, , for object with all the other objects grouped of the same cluster.
 represents the average values of the measure, , for object with all the other objects not grouped in the same cluster.
The three criteria to optimize are:

A similarity criteria to maximize defined by:where represents the number of objects.

An agreement criteria to maximize defined by: where represents the number of objects.

A disagreement criteria to maximize defined by: where represents the number of objects.
To combine these criteria, the NNGGA uses the PROMETHEE multicriteria method.
6 Validation
The NNGGA was used in the GALILEI framework for two problems:
 The document clustering.
 The profile clustering.
References
[1] Emanuel Falkenauer, Genetic Algorithms and Grouping Problems, John Wiley, 1998.