Similarity Measure

Pascal Francq

February 17, 2016 (April 20, 2011)

Abstract

Several algorithms dedicated to information science related problems (for example document clustering) need some existing similarity measures. This article presents such a measure for the tensor space model which takes the different concept categories into account.

Table of Contents

Section 1: Introduction

Section 2: Concept Categories: A Similarity Perspective

Subsection 2.1: Token Concepts

Subsection 2.2: Metadata Concepts

Subsection 2.3: Structure Concepts

Subsection 2.4: Link Concepts

Section 3: Some Vector Similarity Measures

Subsection 3.1: Classical Cosine Similarity

Subsection 3.2: Adapted Cosine Similarity

Subsection 3.3: Overlap Similarity

Section 4: Tensor Space Model Similarity

Subsection 4.1: Token Similarity

Subsection 4.2: Metadata Similarity

Subsection 4.3: Structure Similarity

Subsection 4.4: Link Similarity

Subsection 4.5: Similarity Aggregation

Section 5: Implementation

References

1 Introduction

Several tasks managed in the GALILEI framework are actually implemented through plug-ins that need a similarity measure to compare two objects, including:

The document and profile clustering.
The validation processes for the profile computing methods and for the topic and community of interests description computing methods.

A similarity measure,

, is a function that compares two objects,

and

, such that its returns 0 if the objects have nothing in common and 1 if they are identical.

In the tensor space model, each object,

, is described by a tensor,

where

represents the weight of the concept

for a vector associated with the meta-concept

. In the case of document descriptions, all these weights are greater than or equal to

. But for the profile descriptions and the topic and community of interests descriptions, which are computed with some linear combination of document descriptions, some objects may be represented by tensors with negative weights. The similarity measure must therefore take these negative weights into account.

The following discussion that leads to the similarity measure proposed is based on document descriptions. But, because negative weighs are managed, this similarity measure can be used to compute similarities between profiles, between topics and between communities of interests, or between pairs of these objects (for example to compute the similarity between a document and a community of interests).

2 Concept Categories: A Similarity Perspective

In the tensor space model, each concept is associated to a given type and a given category. The categories represent the “nature” of the concepts (token, metadata, structure and link), i.e. different concepts from a same category have the same “nature”. Let us consider them with regards to the similarity between the corresponding objects (such as documents).

2.1 Token Concepts

Using token concepts (mostly index terms) to compare two documents is the basis of the classical vector space model. The underlying hypothesis is simple: two documents sharing some tokens that don’t appear very often in the whole corpus may deal with related topics, i.e. they are similar [1]. Even if we suppose the token independence, the appearance of the same token in two objects (such as documents or profiles) tells us something about the similarity between these objects.

Since document content may be in several languages, one approach consists in treating the languages (English, French, etc.) and the language-independent meaningful entities (such as names, cities, organizations, etc.) differently. Currently, another approach is used in the GALILEI framework: all terms, whatever their language, are considered as one single subspace (or concept type). This choice is motivated by simplicity and because most terms written similarly in different languages have the same meaning.

Remark: If some translation mechanisms exist, they could be use during the indexation step to directly describe documents with terms of one language only.

2.2 Metadata Concepts

A metadata (often called “data on data”) is supposed to convey descriptive information on an object (such as a document). Examples of metadata include an author, a title, a place, etc. A metadata can be seen by computer programs as a sort of index of a given object (or a given part of a given object). Here is a XML fragment with three metadata:

<dc:creator> Pascal Francq </dc:creator>
<dc:description lang=’en’> Vector </dc:description>
<dc:description lang=’fr’> Vecteur </dc:description>

From the similarity point of view, objects (such as documents) sharing similar metadata are probably highly related (for example if two documents are written by the same authors). It is particularly useful to treat the metadata separately (and not as usual text content) since there are standards defining them, which simplifies the comparison between objects. The Dublin Core MetaData Initiative (DCMI) is probably the best known effort [2].

2.3 Structure Concepts

Since knowledge is stored, some structural rules are necessary to organize its digital form (file format, database schema, etc.). Even the simple text file proposes some “special characters” to separate paragraphs. In fact, structural rules emerge when knowledge begins to be written. Punctuation marks and divisions of documents (such as chapters) are structural rules.

At first glance, we may identify three categories of structural rules:

The rules that structure and organize the content (mostly written language). Elements such as the punctuation marks and divisions come immediately in mind. But other examples exist, such as most tags defined by the DocBook standard [3]:
```
<chapter>
   <title>Bourgeois and Proletarians</title>
   <para>
      The history of all hitherto existing
      society is the history of class
      struggles.
   </para>
</chapter>
```
The rules that specify how the content should be presented (on screen or on paper). Most tags defined by the well-known HTML standard illustrate this kind of rules:
```
<html>
   <body>
      This a sentence with a word in <i>italic</i> and a word in <b>bold</b>.
   </body>
</html>
```

The rules that provide some semantic information on the content. In particular, the XML technologies use tags to convey information to make content machine-readable. Here is an example of a XML document:

<?xml version="1.0" standalone="no"?>
<discography xmlns="http://music.org/">
<group groupID="Deep Purple">
   <groupName> Deep Purple </groupName>
   <album>
      <albumName> In Rock </albumName> 
      <published> 1969 </published> 
   </album>
</group>
</discography>

A particular semantic rule set is called a semantic model, and its elements can be seen as semantic concepts (the tags “group”, “groupName”, etc., in the above example). With regards to the similarity between objects, semantic models can be useful. We may suppose that objects (such as documents) that share the same structure concepts are related to the same kind of topics. Indeed, the XML standard encourages the reuse of common semantic models, called XML schemas, in different documents to described the same kind of content.

Of course, every structure rule doesn’t convey useful information. For example, presentation related rules (such as HTML tags) are irrelevant regarding the similarity between two objects. It can therefore be interesting to specify which structure rules constitute useful structure concepts.

2.4 Link Concepts

As explained elsewhere, a link from an object

to another object

(for example a hyperlink from one Web page to another) may be seen as if

recognizes a given authority to

. Taking this comparison further, one may suppose that two documents containing identical links are probably related. For example, two academic papers citing the same referenced books are probably related to the same scientific field.

Figure 1 Example of a link structure.

The objects and the links form a graph. Let us suppose that we have eleven academic papers forming the graph given at Figure 1↑. As just discussed, we can suppose that

and

are probably similar since they share a common set of references (

and

). But what can be said concerning the similarity between these four documents are

? The fact that

cites

, which is also cited by documents cited by documents cited by

and

, is it enough to conclude something about the similarity between

and

, and

? Probably not.

Therefore, we may suppose that, when there are too many intermediates links between two objects and a common link, nothing can be concluded regarding their similarity.

3 Some Vector Similarity Measures

3.1 Classical Cosine Similarity

In the classical vector space model, a similarity,

, between two documents,

and

is defined as the cosine between their vectors:

where

and

represent the weights of the corresponding vector elements computed using the tf and idf factors.

I proposed an adaptation for any objects,

and

, represented by vectors that may have negative weights [4]. Let us define the factor

as:

A first measure

to compare two objects may be defined by:

where

and

represent the weights of the corresponding vectors computed using the tf and idf factors.

The introduction of the factor

takes negative weights in object descriptions into account (such as for profile descriptions). In fact, if two objects have negative weights for a specific feature (meaning they are not related to it), without the factor introduced, the product will add a positive value and the corresponding similarity will increased. But, knowing that two objects are not related to an identical feature does not mean that they are related to a same subject. The factor

avoids therefore this situation.

To obtain a similarity,

, the previous measure is adapted:

This similarity assumes the independence of the features, i.e. the set of feature vectors is linearly independent and forms a basis for the subspace of interest.

3.2 Adapted Cosine Similarity

I propose an adaptation of the cosine similarity for two vectors,

and

, of tensors representing objects,

and

. Let us first defined two vectors

and

by using the tensor operations and the concept weighting:

where

is the coordinate vector corresponding to concept

the coordinate vector of concept

and

represents the transpose of matrix

The vectors

and

correspond to the vectors

and

when the classical tf and idf factors are used to compute the weights. We can now adapt the similarity measure defined in section 3.1↑. Let us define the factor

as:

where

and

represents the elements of vectors

and

The similarity measure,

is then defined by [A] [A] The concept weighting implies that each vector \strikeout off\uuline off\uwave off

\uuline default\uwave default is normalized to build

: its elements are divided by the most weighted one. In practice, to compute the cosine similarity, this operation is not necessary since the normalization factors appear in both the numerator and the denominator.:

This measure defines three intervals:

A majority of discriminant concepts relevant to describe one vector (positive weights) are relevant to describe the second one (positive weights).

Nothing can be concluded about the similarity of the vectors (they have no concepts in common).

A majority of discriminant concepts relevant to describe one vector (positive weights) are irrelevant to describe the second one (negative weights), or otherwise.

3.3 Overlap Similarity

Another possible similarity measure between two vectors ,

and

, of tensors representing objects,

and

. consists in computing a weighted ratio of overlapping concepts. As for the adapted cosine similarity, let us first defined two vectors

and

by using the tensor operations and the concept weighting:

Again, to avoid comparing concepts that have negative weights in both vectors, let us define the factor

as:

where

and

represents the elements of vectors

and

Such a weight ratio,

, can be computed with [B] [B] The concept weighting implies that each vector \strikeout off\uuline off\uwave off

\uuline default\uwave default is normalized to build

: its elements are divided by the most weighted one. In practice, to compute the metadata similarity, this operation is not necessary since the normalization factors appear in both the numerator and the denominator.:

where

is the maximum operator.

This measure is adapted to obtain a similarity,

As for the adapted cosine similarity, this measure defines three intervals:

A majority of discriminant concepts relevant to describe one vector (positive weights) are relevant to describe the second one (positive weights).

Nothing can be concluded about the similarity of the vectors (they have no concepts in common).

A majority of discriminant concepts relevant to describe one vector (positive weights) are irrelevant to describe the second one (negative weights), or otherwise.

4 Tensor Space Model Similarity

As explained in the previous section, each concept category provides some clues on the similarity between the corresponding objects. This suggest that we may define the similarity between two objects,

and

by:

where

and

represent a similarity based respectively on their token, metadata, structure and link concepts, and

is some aggregating function of these similarities.

To comply the agreement in force in information retrieval, we impose that

4.1 Token Similarity

Token contents are described through a set of vectors,

, each vector being associated to a meta-concept,

, which elements represent the importance of the tokens (eventually zero if a token is not contained or has no discriminant value). We may suppose that a meta-concept

, and therefore each vector

, embodies a particular context (usually of text content). For a document, such contexts could be an abstract, the body, the conclusion, etc. Here, at least three assumptions are possible:

We assume that comparing the tokens from two contexts (for example an abstract and a body, or a conclusion and an acknowledgment) provides some useful information regarding the similarity between the corresponding objects.
We assume that only tokens associated to a same context provide useful information regarding the similarity between the corresponding objects.
We assume that some pairs of contexts provide a clue regarding the similarity between the corresponding objects (for example an abstract and a body), and others not (for example a conclusion and a bibliography). Moreover, we may associated to each pair of contexts a weight that quantifies its importance regarding the object similarity.

Intuitively, the third option sounds the more logical. After all, a human being can recognize that two articles are related by reading the abstract of the first one, and the body or the conclusion of the other one. But, it has a practical drawback: all the possible pairs to compare must be defined (with, eventually, a corresponding weight). This is very difficult to do in practice. The second choice is certainly the most simple one, but it seems a little bit too limited. Therefore, the first choice appears to be the best solution (in particular if all pairs of contexts are supposed to be of the same importance).

To compute the similarity between two vectors,

and

, representing some tokens, the adapted cosine similarity can be used (section 3.2↑). To manage the different pairs of vectors, a linear combination of the similarities for each of them weighted by the number of concepts used for the computation is performed. Let us suppose that two vectors,

and

have

common positive weighted text concepts. We define the token similarity as:

where

represents the set of token concepts.

It should be noticed that, if the objects are described by just one text vector, the similarity is the classical one computed as the cosine between the corresponding vectors weighted by the tf and idf factors.

4.2 Metadata Similarity

To define a similarity measure related to the metadata concepts, it is necessary to know how these concepts will be represented in terms of vectors in an object description. When documents are described, each metadata (for example “dc:creator”) is a meta-concept associated to a vector representing its value (such as the terms “Pascal” and “Francq”).

To compute the metadata similarity, a first choice could be to use the same measure as for the text similarity. But a different measure will be defined. To understand why, let us start with some metadata describing two different documents:

<dc:creator> Marx </dc:creator>
<dc:creator> Marx, Engels </dc:creator>

Yet, it is clear that they are related since the same author appears in both tags. Intuitively, if two objects (such as documents) were only defined by these two metadata, their similarity should be

. In fact, we may suppose the similarity should be greater than

if “Marx” is more discriminant than “Engels”, and lower than

in the other case. Moreover, the following two metadata are certainly different even if they share a same index term:

<dc:publisher>
   World Health Organization (WHO)
</dc:publisher>
<groupName> The WHO </groupName>

Accordingly, we may suppose that different metadata cannot be compared, i.e. the metadata similarity depends only on the similarities between vectors corresponding to a same meta-concept (metadata):

To compute the similarity between two metadata vectors

and

, the overlap similarity is appropriated (section 3.3↑):

To manage the different metadata, a linear combination can be used to define the metadata similarity:

where

is the number of common positive weighted concepts in

and

, where

represents the set of metadata concepts.

4.3 Structure Similarity

As for the metadata concepts, the question of how structure is represented with vectors in an object description must be answered. In the tensor space model, semantic models (for example a XML schema) are structure concept types, each semantic model defining a set of possible structure concepts of that type (including the neutral one). When documents are described, each neutral concept of a given structure type is a meta-concept associated to a vector representing the semantic concepts of that type used to described the document semantic. Since it makes no sense to compare structure concepts from different semantic models (because they represent distinct domains), we decide that the structure similarity depends only on the similarities between vectors corresponding to a same meta-concept:

If we suppose the independence of the structure concepts, the similarity between two vectors,

and

, associated to a meta-concept,

, can be computed with the adapted cosine similarity (section 3.2↑). To manage the different vectors corresponding to semantic models, a linear combination is once again used. Let us suppose that two vectors,

and

have

common positive weighted semantic concepts. We define the structure similarity as:

where

represents the set of structure concepts.

4.4 Link Similarity

In the tensor space model, link schemes (such as URI ou DOI) are link concept types, each scheme defines a concept for all possible links of that type and a neutral one. When documents are described, each neutral concept of a given scheme type is a meta-concept associated to a vector representing the links of that type contained in the document (for example hyperlinks in Web pages). Since it makes no sense to compare different link schemes, we decide that the semantic similarity depends only on the similarities between vectors corresponding to a same meta-concept (same scheme):

As explained in the section 2.4↑, two objects are probably related if they share a given set of common links in their immediate neighborhood in the corresponding graph. We may suppose that if an object contains two links to a document and if another object contains only one link to this document, the objects are less similar than if the second object contains two links to the document. Moreover, we may also suppose that links that appearing a lot (for example a generic scientific book) convey less information regarding the similarity than links that appear very often in a sub-set of objects (for example a academic paper in scientific articles).

To compute the similarity

, a first step is to built these neighborhoods for the two vectors

and

. The neighborhood of the vector

is determined with a simple propagation approach in the graph formed by all links appearing in the vectors associated to the meta-concept,

. Moreover, if to the object

is linked by the vector

(

), it is added to its own neighborhood. Figure 2↓ shows how this neighborhood is built when the number of steps is limited to 3. It should be noticed that loops can occur (a document has a link to another document that has a link to the first one).

Figure 2 Example of neighborhood.

Each neighborhood of a vector

can be seen as a vector,

, where an element \strikeout off\uuline off\uwave off

represent the weight of a link (represented by the concept

) in this neighborhood. This weight depends on the weights of the occurrence of each link in each vector and the number of times it appears. The next two examples on an imaginary graph limited to three nodes (

and

) illustrates this principle:

Suppose that has only links to which has itself only links to . The neighborhood of contains the following pairs (link, weight): }.
Suppose that has only links to and one link to , and has one links to . The neighborhood of contains the following pairs (link, weight): }.

Once the “neighbor vectors” computed, the overlap similarity (section 3.3↑) can used two compare the two neighborhoods:

To manage the different vectors corresponding to link concepts, again a linear combination is used. Let us suppose that two vectors,

and

have

common positive weighted link concepts. We define the link similarity as:

where

represents the set of link concepts.

4.5 Similarity Aggregation

Once the category similarities are computed, they must be aggregated to obtain the global object similarity. The problem of aggregating different similarities can be seen as a multi-criteria decision problem. In fact, we can formulate the object similarity as follows: let us have a set of object similarities, how to rank these similarities depending on four different criteria (the category similarities)? The product and the sum as aggregating methods suppose the independence between the criteria (category similarities). There exist several multi-criteria decision methods taking the dependency between the criteria partially into account.

The Choquet integral is one if these method which performs better than classical aggregating methods such as the weighted sum. The basic idea is to not only take into account the importance of each criteria (the different similarities in our case), but also the interactions between them. For example, one may suppose that a high semantic similarity alone is not very important (small weight for the semantic criteria), but that it becomes important if the text similarity is high too (important weight for the interaction between the text and semantic criteria). These interactions may also be negative (two criteria are important as long as they are not important at the same time). One problem with the Choquet integral is its huge number of parameters (14 for 4 criteria). Therefore, the Choquet integral is often used with respect to a 2-additive capacity [5]. This supposes that the combinations of more than two criteria are not taken into account. In this case, the object similarity is then defined by:

where

is the minimum operator,

is the maximum operator,

represents the weight of the interaction between two criteria,

and

, and

represents the weight of one criteria,

It is still necessary to specify 10 parameters (the weights for the 4 criteria,

, and their 6 interactions,

). To ensure that the Choquet integral remains in

(if all the category similarities are in

]), the following constraints must be respected:

5 Implementation

The set of plug-ins (accessible as the multi-vector subversion repository) provides the tensor space model similarity measure for different object pairs (in particular documents and profiles). In practice, all the plug-ins instantiate a class from the template GGenericSims. The main elements of the multi-vector subversion repository are:

The class GComputeSimCos implements the adapted cosine similarity (section 3.2↑) which is used for the token similarity (section 4.1↑) and the structure similarity (section 4.3↑);
The class GComputeSimMeta implements the overlap similarity (section 3.3↑) and the metadata similarity (section 2.2↑);
The class GComputeSimLink implements the link similarity (section 4.4↑);
The method GGenericSims::ComputeSims computes the different category similarities (the results are stored in the array Cats);
The method GGenericSims::SimilarityChoquet implements the 2-additive capacity Choquet integral as aggregating method (section 4.5↑).

References

[1] Ricardo Baeza-Yates & Berthier Ribeiro-Neto, Modern Information Retrieval: The Concepts and Technology behind Search, Addison-Wesley, 2011.

[2] Dublin Core Metadata Initiative, Dublin Core Metadata Element Set, Version 1.1: Reference Description, 1999.

[3] Norman Walsh & Leonard Muellner, Docbook 5.0: The Definitive Guide, O’Reilly, 2008.

[4] Pascal Francq, Collaborative and Structured Search: An Integrated Approach for Sharing Documents Among Users, PhD Thesis, Université libre de Bruxelles, 2003.

[5] Michel Grabisch, “L’utilisation de l’intégrale de Choquet en aide multicritère à la décision”, Newsletter of the European Working Group “Multicriteria Aid for Decisions", 3(14), pp. 5—10, 2006.