# Hierarchical Problems

Pascal Francq

December 19, 2011 (December 15, 2011)

Table of Contents

## 1 Description

### 1.1 General Problem

A hierarchical problem is the task to assigning a set of objects into a hierarchy (tree of nodes) according some criteria, each object being assigned to one node only. In general, the criteria is to regroup objects sharing similar characteristics in the same part of the tree (with regards to some characteristics), where each node can have as many child nodes as necessary. Nodes without children are called leaf nodes. But the goal may vary. For example when a taxonomy must be build, the tree should be balanced: the words should be allocate homogeneously though the tree.

The rigorous definition of the problem begins with a set of objects to place in a set of nodes:

Each node, , has a set of or children:

Depending of the problem, the number of available nodes can be infinite (when a taxonomy must be build for example) or limited (such as when a set of resources must be allocated to an organizational structure).

### 1.2 Set Theory Problem

A particular class of hierarchical problems, for example the taxonomy building, consists in building a tree as defined in the set theory. In this case, the objects are defined using a set of attributes , each object, , being defined by a vector, , where:

Each node is also defined by a similar vector, , where:

An object, , can then be assign to a node, , if:

It is possible to build a rooted tree where each leaf node corresponds to a given object and each non-leaf node corresponds to a given sub-set of attributes shared by all its children (except the root associated with no attributes). An example of such a tree is a thesaurus where each leaf node correspond to a particular document, each non-leaf node corresponds to a topic and the attributes are concepts.

For example, suppose we have the following set of objects where attributes are represented by identifiers:

Figure 1↓ shows three trees representing different configurations for these objects. The attributes “added” by each node are specified.

Building such a tree becomes difficult when the number of objects and attributes increases. In fact, depending from the goal (building a balanced tree, minimizing the number of non-leaf nodes, etc.), the resulting tree may be different.

## 2 Constraints

The hierarchical problems introduce several constraints, but it is always possible to define them as a member of one of the three constraints classes, namely the “capacity constraints” class, the “homogeneity constraints” and the “profile constraints” classes.

The “capacity constraints” constitute the first class. It regroups the constraints related to a given “capacity” of each node. In the case the taxonomy building, one constraint could be a homogeneous repartition of the words between them. These constraints may also introduce a minimal or maximal number of objects that a node must have.

The “homogeneity constraints” are the second class. These constraints specify the conditions that must be respected by an object when it is put in an existing node. This may be, for example, a minimal similarity with the objects already assigned.

The “profile constraints” are the last class. To obtain a valid solution, the clustering must respect the “wishes” of each object. For example, if the goal is to allocate persons to an organizational structure, a constraint could be to each person should be assigned to an unit where he or she knows someone working for it.

## 3 Criteria

The criteria that should be optimized for the hierarchical problem vary from one problem to another. For some problems, the criteria are well defined. But, in the case of the thesaurus building for example, where the goal is to homogeneously organize terms in a hierarchy of concepts, the problem is

*: is it not easy to express the fact that “a word is related to a concept”. When several criteria must be optimized, the hierarchical problem is a multiple objective one.**fuzzy*
Some hierarchical algorithms work with one type of problems (criteria) only. Other methods, such as the Hierarchical Genetic Algorithms (HGA), may solve a whole range of problems.

## 4 Algorithm evaluation

To evaluate the performance of an algorithm, it is necessary to have an “ideal” solution for a given problem instance and some measures to compare the “computed” solution provided by the algorithm with this “ideal” solution. Two situations exist:

- The quality of a solution depends on a computation based on the clustering.
- The quality of a solution depends on the particular assignment of the objects.

In the first case, a comparison between the computation done for the “computed” and “ideal” solution evaluates the performance of the algorithm. For the second situation, we need some measures to compare the “computed” assignments to the ideal ones (each object is assigned to an “ideal node” in the “ideal” solution).

I propose an adaptation of a similarity measure for phylogenetic trees in the case when the tree to build is related to the set theory (section 1.2↑).

### 4.1 Similarity Measure for Phylogenetic Trees

In a work related to information retrieval in phylogenetic information systems, Whang, Shan, Shasha and Piel propose a similarity measure for phylogenetic trees that satisfy certain properties [1]. Their method is based on an

*Up matrix*, , associated with each tree, where the values represent the number of up operations needed to go from node to node . Taking the trees , and of Figure 1↑ as examples, we have because it is necessary to have one up operation (and one down operation) to go from node to note . Similarly, because it takes three up operations (and two down operations) to go from node to node . Table 1↓ shows the up matrices for the trees , and . Notice that .
The authors propose a measure of similarity between a query tree and a data tree [A] [A] For their problem, they compute the similarities between the query tree and all the trees of the database and retrieve those who have a similarity greater than a given threshold.. To do so, they first define four different sets of labeled nodes : is the set of those in , is the set of those in , and . Then, they introduce the

*Updown distance*from to , denoted , as:
Finally, they propose a similarity measure, denoted , given by [B] [B] In their original formula, the similarity is given as a percentage.:

### 4.2 Extension to Trees of Objects Sharing Attributes

The similarity measure (↓) can be used for the hierarchical problem if we consider that the labeled nodes correspond to the objects, that the “ideal” tree represents a sort of ideal query, and that the computed trees are data trees. Moreover, in our problem, since all the trees contain all the objects, , and the second term of (↓) is always null. For the same reason, the reduction step described in the original paper is not necessary in our case [1].

Let us suppose that is the ideal tree and that and are two different computed trees. Using (↓) with and }, we can compute a similarity between the computed trees and the ideal one:

The fact that is logical because the “structural differences” are identical between and , and and . Indeed, in both case, the “left” part of the tree is composed from two leaf-nodes, and the “right” part of the tree is divided in two parts containing respectively an object normally assigned to the “left” part ( in and in ), and the two objects and . But, regarding the attributes assigned to the non-leaf nodes, it is clear that and are not identical. Intuitively, we may consider that the sub-tree formed by , and their parent node is more similar between and than between and . In fact, in this sub-tree shares one attribute () with another object (); and in , it shares two attributes ( and ) with another object (), while in it shares no attributes at all.

Whang, Shan, Shasha and Piel propose a simple adaptation of their similarity measure for weighted tree [2]. They propose to associate with each up operation a weight that equals the weight of the corresponding edge. The values of the Up matrix becomes then the sum of the weights associated to the up operations. We propose to use this extension and to define the weights as the number of non-shared attributes at a given node of the tree.

Let us take again the trees , and of Figure 1↑ and detail some examples. We have because the up operation of corresponds to two non-shared attributes ( and ). In , because the up operation of corresponds to four non-shared attributes (, , and ). Finally, in , because the two needed up-operations of corresponds to three non-shared attributes ( for the first up operation, and and for the second one). Table 2↓ shows the up matrices for the trees , and using the non-shared attributes weights.

Using (↓) with theses matrices, we can compute another similarity between these trees:

Now, as expected, .

## 5 Related

## References

[1] Jason T. L. Wang, Huiyuan Shan, Dennis Shasha & William H. Piel, “Treerank: A Similarity Measure for Nearest Neighbor Searching in Phylogenetic Databases”, In *Proceedings of the 15th International Conference on Scientific and Statistical Database Management*, pp. 171—180, 2003.

[2] Jason T. L. Wang, Huiyuan Shan, Dennis Shasha & William H. Piel, “Fast Structural Search in Phylogenetic Databases”, *Evolutionary Bioinformatics Online*, 1, pp. 37—46, 2005.