Document Description

Pascal Francq

January 22, 2013 (April 20, 2011)


Computing a document description is the process of extracting concepts from digital files. This process implies solving several problems: analyzing the incoming characters, selecting the “correct” words, replacing some words by a more generic form, etc. This article describes how the document descriptions are build within the GALILEI framework regarding its tensor space model.
An electronic document corpus is a set of digital files, each file being formatted with a given standard (PDF, HTML, XML, MS-Word, etc.). Most of these standards don’t aimed to propose a machine-understandable form of the knowledge encapsulates in the corresponding file. Today, XML is a widely used standard to provide some semantic information. The GALILEI framework proposes a tensor space model to represent any kind of objects, and documents in particular. This article describes how such a tensor is built.

1 Documents in the Tensor Space Model

In the tensor space model, documents are described with tensors. These tensors represent a set of concept vectors, each vector being associated with a particular meta-concept. Currently, the analysis described below produces the following vectors:
  • a vector corresponding to the neutral concept “*” of the type “text block” that contains all the terms appearing in the body of the document;
  • a vector corresponding to the neutral concept “*” of the type “URI” that contains all the URIs appearing in the body of the document;
  • for each metadata appearing in the document (“author”, “title”, etc.), a particular vector is built with the content treated as terms;
  • for each kind of semantic sets used (for example XML schema), a particular vector is built containing all semantic information type (such as XML tag) used.
Of course, when a document doesn’t contain a concept of a given type (for example an URI), the corresponding vector isn’t built.

2 Document Analysis

The process of building a document description consists of analyzing a file formatted with a given standard in order to build a machine-understandable description of it (a tensor in the GALILEI framework). This process supposes two tasks:
File Extraction Starting from a particular document, the useful information to build the description must be identified. This implies at least that some text streams are extracted. To associate them with multiple meta-concepts (such as metadata), it is necessary to understand the structural rules of that particular document.
Document Building Once some text streams are available, eventually associated with a specific meta-concept, some treatments on them identify the “concepts” [A]  [A] The reader should remember that a “concept” in the tensor space model may be a simple term..
Figure 1 The process of building a document description.
As I will detail it below, the Figure 1↑ shows that the description building in the GALILEI framework is divided into the following steps:
  1. The lexical analysis of text streams.
  2. The treatment of the stopwords.
  3. The stemming of the remaining words.
  4. The selection of the concepts used to build the description.
The process builds three outputs: the document descriptions using the tensor space model, the document trees (the hierarchical organizations of the concepts contained including the positions and the depths of their occurrences), and an index. While the descriptions and trees are optimized to find all the concepts associated to a given document (and eventually hierarchical information), the index is optimized to search all the documents containing a given concept. The GALILEI platform provides mechanisms to store these outputs.

3 File Extraction

The role of the file extraction is not only to build text streams that will be treated by the description building task, but also to provide some information on how to interpret a particular text stream (simple content, metadata, (sub)title, etc.).

3.1 General case

As explained earlier, most file standards don’t provide semantic information related to their content. At best, they propose to associate some metadata with a particular file (mostly the author, a description and some keywords). For example, the HTML format provides the <meta> tag to include some metadata to a Web page. Similarly, word processor formats (such as MS-Word or OpenDocument) associate several properties to a document, some of them being metadata. When analyzing such file formats, it is important to identify such metadata to enrich the document description. For the rest, these formats are composed from characters combined with structural rules without any semantic (specifying chapter titles, paragraphs, blocks in italic, etc.). In practice, they can be seen as one single text stream to process.
It is the role of filters to extract the information (structural rules and text stream) from the files.

3.2 XML Documents

For XML documents, the situation is more complex because the standard was defined to enrich the text content with semantic rules reified as tags. Several remarks can be made:
  • the XML technologies are based on XML schemas proposing semantic models composed from semantic concepts (the tags);
  • the combination of some tags and their content provides useful information to represent XML documents (for example the well-known “metadata tags” of the Dublin Core MetaData Initiative (DCMI) [1]);
  • all the tags aren’t needed to catch the main semantic information contained in a XML document, since some of them serve structuring purposes only (they regroup several child tags related to a same kind of information) or give information such as the language used.
Based on these remarks, and regarding the way documents are currently described with the tensor space model (section 1↑), the following simple ideas are proposed:
Structural rules To each XML schema, a corresponding type concept its created and named after its URI of the XML schema (its concept category is “semantic”). To each XML schema appearing in a given XML document, a vector is associated with the corresponding neutral concept, “*”. Each XML tag of that particular XML schema is considered as a concept, and the corresponding vector stores those appearing in the XML document analyzed, its weight being its number of occurrences.
Text_stream The text content (except the XML tags and attributes) are supposed to form the text stream to be analyzed. In practice, some parts of this text stream contain metadata and other text blocks. The following XML example illustrates this:
  • <bibliography>
    • <entry>
      • <author>Pascal Francq</author>
        <title>Internet : La construction d’un mythe</title>
        <comment>This a very good book.</comment>
    If it is clear that the <author> and <title> tags provide metadata, while the <content> tag represents “normal text”. It makes therefore sense to define a list of tags which content should be treated as metadata.

3.3 XML Metadata Detection

As explained above, not all the XML tags provide metadata for the corresponding document description. Ideally, such tags should be explicitly referenced as such. But in practice it isn’t always the case. I propose therefore a simple method to automatically detect which tags must be considered as “metadata tags” if this information is not available.
I suppose that, most of the time, such tags have small content (typically a few terms). I also consider that generic information is “on top” of the documents, which means that “metadata tags” are probably high in the XML document hierarchy. Finally, I suppose that they don’t constitute the major part of the tags in an XML document and that they have no child tags.
Figure 2 Heuristic to detect metadata tags
Based on these assumptions, I propose the heuristic illustrated at Figure 2↑ to guess which tags are “metadata tags”. It consists in the verification of four constraints:
  1. A “metadata tag” is “high” in the XML hierarchy (for example it has a depth of maximum 3 starting with depth 0 for the root node).
  2. A “metadata tag” doesn’t appear too often in the whole document. In practice, the heuristic may limit the number of occurrences to an absolute value (for example less than 5 occurrences) or to a relative value (less than 5% of the total number of different tags in the document).
  3. A “metadata tag” has no child tags.
  4. A “metadata tag” contains a maximal number of terms (for example 20).
The reader should notice that a same tag may be considered as “metadata” or not inside the same document. In fact, except the second constraint, all the others are “local” to a particular occurrence. I made this choice to avoid to be too restrictive (all the occurrences must fulfill the criteria to be considered as “metadata tags”) or too permissive (if one occurrence fulfill the criteria, all the occurrences are considered as “metadata tags”).

3.4 Document Tree

A simple look at a text document (on screen or on paper) shows that it always adopts a structure. In fact, a document is a tree where each node corresponds to a given granularity in the structure (part, chapter, section, paragraph, sentence, word, character, etc.). For several text mining problems, such as document fragment retrieval, this structure is from some importance. Therefore, a filter extract not only the tokens appearing in a document, but also some information related to each of its occurrences (in particular their depth and position) in order to store, for each document, a document tree where each node corresponds to a given concept.

3.4.1 Node Depth

In practice, unlike semantic information, most file standards provide some structural and organizational information usable to infer the depth of a given node. At least, they specify a “style” to a block of text that fixes how it is presented, each “style” providing an implicit division information about the document (for example “title1”, “title2”, “paragraph”, etc.). HTML and of the word processor formats are such standards. Here is a text fragment expressed using HTML:
<h1>Bourgeois and Proletarians</h1>
  • The history of all hitherto existing
    society is the history of class
Other standards propose an explicit division of the document, in particular XML documents. An example is the DocBook standard [2], here used to express the same text fragment as above:
  • <title>Bourgeois and Proletarians</title>
    • The history of all hitherto existing
      society is the history of class
While these two cases seem at first glance similar, there is a difference: the DocBook standard delimits clearly the different parts of a document, while with the HTML standard we can only make suppositions (for example that the <p> tag following the <h1> tag corresponds to a paragraph that belongs to the chapter just before). Table 1↓ shows how the tag depths differ from one file format to the other when the document structure is represented as a tree: with the DocBook standard the depths correspond to the division levels of the document, while for the HTML standard all the tags have the same depth.
DocBook HTML
Tag Depth Tag Depth
chapter <chapter> 1 - -
chapter title <title> 2 <h1> 1
paragraph <para> 2 <p> 1
Table 1 Depth differences: A first look.
So, to recognize different document divisions for some file formats (such as HTML), a heuristic is needed. The simplest one is to associate to each “division style” a given depth, and to suppose that each style applied marks the transition from one part to another. In the case of the HTML standard, me way define the following array of styles: =[{<html>,0},{<body>,1},{<h1>,2},{<h2>,3},{<h3>,4},{<h4>,5},{<h5>,6},{<h6>,7},{<p>,8}]. But there is a still a problem. Let us take another HTML example to explain it:
<h1>Title 1</h1>
<p>This a first paragraph</p>
<h3>Title 1.1</h3>
<p>This is a second paragraph.</p>
Table 2↓ compares the depths associated with the different tags with the logical division levels that they represent in this particular example.
Depth Logical Level
<h1> 1 1
<p> 8 2
<h3> 3 2
<p> 8 3
Table 2 Depth differences: A second look.
It is therefore necessary to construct a little more complex heuristic. Algorithm 1↓ shows such an heuristic. It uses two structures: a first one, , represents a “division style” (a name and a corresponding depth) and a second one, , correspond to a logical division (the associated style and the logical level). represents a last-in last-out stack (where get element on the top).

while ():
   if :
      while( and ):
Algorithm 1 Document Divisions Detection.

3.4.2 Node Type

A given token is not only characterized by its position in the file and its depth in the corresponding document structure, but also by the type of the node where it occurs. In fact, a token such a word in a XML document can be interpreted differently when it occurs as a XML tag or as “normal” text. Therefore, when analyzing a document, a filter should also provide a type for each node extracted.
One design choice could have been to give each filter the freedom to create any type node. While this approach provides the greatest flexibility, it comes with a major drawback: an algorithm that would exploit this information must know every possible node type. Since this is difficult in practice, the choice was made to define a fixed set of node types (Table 3↓). This makes the exploitation of the node type independent of the particular file formats of the documents treated.
Type Description
The node is a “normal” textual token.
The node provides some semantic information (for example it is a XML tag).
The node provides a document division information (for example the <h1> and <chapter> tags in the above examples).
The node provides a metadata value (for example the <author> tag). In the case of XML, it is a particular case of the semantic node type. But some formats, such as PDF, store metadata outside the core of the document.
The node represents an attribute associated to the element of its parent node. Typically, in it is the case of a XML attribute.
The node points to another document (or part of document).
Table 3 Node types.

3.5 A complete example

To illustrate the different points developed above, I propose an example. Since the XML standard offers the richest information, it will be used. Let us suppose the following XML document:
<?xml version="1.0"?>
<doc xmlns="" xmlns:dc="/">
  • <dc:title>Document Title</dc:title>
    <dc:creator>John Cleese</dc:creator>
    <dc:creator>Michael Palin</dc:creator>
    • <para>Monty Python’s Flying Circus.</para>
      <para>And now, something completely different!</para>
We may identify different elements:
  1. The tags <dc:title> and <dc:creator> are defined as a metadata by the Dublin Core MetaData Initiative XML schema.
  2. The tags <part> and <para> propose text contents.
  3. We may suppose that the <support> tag doesn’t provide any interesting information (at least regarding the knowledge embodied in the document).
Type of the nodes Concept Category Concept Type Meta-concept Content extracted
Semantic Semantic * title, creator
Division Semantic * doc, part, para, support
Metadata Metadata title Document Title
Metadata Metadata creator
John Cleese
Michael Palin
Text Text text block *
Monty Python’s Flying Circus.
And now, something completely different!
Table 4 Example of file extraction.
Table 4↑ shows the result of a file extraction ( represents the URI of a XML schema). Five vectors should be created for the different elements of the document:
  1. A vector for the XML schema of the Dublin Core MetaData Initiative.
  2. A vector for the XML schema of the “mytags” XML application.
  3. A vector for the metadata “title”.
  4. A vector for the metadata “creator”.
  5. A vector for the text block corresponding to all the text content of the document.

4 Lexical Analysis

The role of the lexical analysis is to find the sequences of characters that can be considered as valid tokens.

4.1 Token Categories

The simplest method to identify a token is to consider it as a sequence of characters delimited by spaces [B]  [B] Multiple spaces are considered to be a single space, and special characters like “tab”, “end-of-line” and “carriage return” are considered to be normal spaces.. But, in practice, it is somewhat more complicated. Fox considers four choices for the establishment of tokens [3]:
Digits Most numbers do not offer a semantic interest for the content of documents, so digits are normally not included in the tokens. However, sequences containing number may have a meaning as in “ISO9002”. One easy solution consists of accepting sequences containing numbers only if the first character is a letter. But “510B.C” may be an important expression too, so another solution is to remove all the sequences containing numbers unless they match some regular expressions representing specific sequences.
Hyphens When a sequence contains one or more hyphens a decision must be made. The first approach is to break the word into its hyphenated terms, the strings “state-of-the-art” and “state of the art” will be considered as identical for example, but, strings such as “Jean-Claude” or “B-52” have a meaning themselves, and in these cases it is better to consider them as an indivisible sequence. Also, hyphens are used to mark a single word broken into syllables at the end of a line. The best solution is to adopt a general rule and to specify a list of exceptions on a case by case basis.
Other punctuation Some punctuation marks are often used as parts of tokens. A dot is used in most file names as in “index.html” or may appear in names like “X.25”. Emails are also formed with punctuation characters, as in “” for example. Generally, the punctuation is removed during the lexical analysis, i.e. “TCP/IP” is transformed into ”TCPIP”. However, the best solution is again to adopt a general rule and to specify the exceptions.
Case The cases of the letters mostly do not matter, so the lexical analysis converts the whole text into either the lower or the upper case. But some semantics might be lost during this conversion, the words “Bank” and “bank” can have two different meanings for example.
It is well known that implementing such rules is not difficult as such, but the cost in computing time of this step is directly dependent on the number of rules [3]. That is why many systems avoid these text operations. Although no studies have been undertaken to evaluate the cost of these operations in information retrieval systems, it was shown that lexical analysis counts for more than 50 percent of the computational expense of a document analysis [4].

4.2 Tokenizing Process

The tokenizing process works in two steps: it identifies first the character sequences that form text units, and secondly it verifies if these units respect some constraints before considering them as valid token. The first step identifies the text units following some rules:
  1. A text unit cannot start with a space or a punctuation character (defined by the classical C ispunct function).
  2. A sequence of punctuation characters followed by a space cannot end a textual unit. So, “pascal.francq” is a text unit while “pascal. francq” are two text units (“pascal” and “francq”).
  3. A text unit always ends with a space.
Figure 3↓ shows that this first step works like a finite-state machine (the states are in normal text, the character types in italic).
Figure 3 Tokenizing using a finite-state machine.
Once a text unit is identified, it must be decided if it is a valid token. Currently, two types of token are treated:
Link The text unit is recognized as a valid URI. This is the case for email addresses (such as “” and “”) and documents accessible via HTTP (such as “”). It should be noticed that an option allows to skip URI.
Text A text unit that isn’t recognized as a valid URI is converted to its lowercase equivalent. To decide if it is a valid token, two possibilities exist:
    1. A textual unit is considered as a token if it is made up of letters only. This option ensures that tokens are definitively words. It has also the advantage to limit the “sizes” of the descriptions (thus less memory used and a reduced computational speed).
    2. Some textual units containing non-alphabetical characters may pick up structured information like norm designations or email addresses. To eliminate those who are not useful, a textual unit must respect two constraints:
      1. A given character cannot occur consecutively more than a given number of times (for example 3). This rule avoids to consider as token the textual unit “loool”.
      2. The textual unit must contained a minimum percentage of letters (for example 30%). This rule eliminates the textual unit “B-1070” (16,6%) but not “” (88%).
      3. A given token cannot contain more than a maximum number of non-letter characters (for example 5). This rule recognizes as token the textual unit “iso6002” but not the token “0123456789a”

5 Stopword Treatment

The stopwords are the most frequently occurring words of a given language which have a poor semantic content in a given context. For example:
English the, then, or, and, a, an, …
French le, la, or, non, car, …

5.1 Stopword Removal

It was recognized early on that many of the stopwords are not characteristic of the document contents [5]. In fact, using such terms in search engines queries retrieves almost all of the documents because the discriminatory value of these stopwords is low [6, 7]. Moreover, if all the stopwords are eliminated, it decreases the number of processed words by 20 or by 30 percent of the total number of words in English-language documents [8].
The stopwords may be words other than articles, prepositions and conjunctions such as some verbs, adverbs or adjectives. For example, in the Brown collection a list of 450 stopwords was defined and applied successfully [8]. The stopwords may also be specific to a given context. For example, the word “poem” has probably no discriminatory value in a collection of documents on poetry and can therefore be eliminated. But such an approach is unusable for general collections.
Of course, the elimination of stopwords destroys some important information like “to be or not to be” in a document on Shakespeare. This is the reason why some systems use a full text index, i.e. all the words in the document are indexed. However, in the GALILEI project, they are removed (but this option can be disabled).

5.2 Document Language Detection

But the stopwords have another usage: they can determine the language of a document (this information is needed to apply a stemming algorithm as detailed in section 6↓). If some document formats indicate the language used, this information is not always accessible. It is therefore often necessary to guess the language by itself.
Let us define for each language a list, , of stopwords, . Moreover, if we define as the number of occurrences of a stopword, , in a document, . We can build for each language a set of stopwords, that appear in :
Moreover, let us define as the ratio between the number of different stopwords in a language appearing in the document, , and the total number of different words (tokens containing only letters), :
Since the stopwords represent the most frequently used words in a every language, their distribution in a particular document gives a clue about the language used. In fact, two conditions are used to assign a language to a document:
  1. The document must contain a reasonable number of different stopwords. If a document contains only the stopwords “no” and “yes” for example, it is probably not an English-language document. This can be expresses with the ratio:
    where is a given threshold
  2. Since the stopwords constitute a large amount of the words found in a document, we may suppose that the correct language is the one having the most stopwords in the document:
    where represents the set of all languages respecting the first condition for document .
When a document is updated, i.e. its content is modified, a new analysis is necessary. Most of the time, the modifications to the content does not imply a change in the language, i.e. this step is not necessary. On the other hand, there are no methods to check this assertion other than to determine the language again. So, a decision is to decide whether the language has to be determined each time that a document is analyzed, or whether the languages are considered to be static, i.e. a document keeps its language once it is known.
Table 5↓ shows the average, the standard deviation, the minimum and the maximum of the value of ratio for documents in different languages with respect to English and French [C]  [C] The “Newsgroup” collection was used for the English and the “Le Soir” collection for the French.. As can be seen, the difference between the average ratio for the correct language and for the other one is significant. In practice, a good value for is for both languages.
French documents English Documents
Table 5 Statistics on the ratio .
Remark: The XML specification defines an attribute, xml:language, which can be used to specify the language of a portion of a document, i.e. XML documents may contain different parts in different languages. In fact, every document may contain parts written in different languages. Actually, it is here supposed that only one language is associated with each document.

6 Stemming

When analyzing the list of words contained in a document, it immediately appears that many variants of the same word occur in the document, through plurals or past tense suffixes for example. To avoid processing all these variants as different tokens, a solution is to replace all these variants by their stems. A stem is the portion of a word which is left after removing its prefixes and suffixes. An example is “connect” which is the stem for “connection”, “connections”, “connected” and “connecting”. In practice, different tokens pointing to the same stem are regroup in a single one (and the corresponding occurrences are added). Another advantage is therefore that the number of indexed tokens decreases as a result of this operation.
To implement this stemming, one type of algorithms removes the suffix and/or the prefix of a word following a set of rules to give a stem, with only the first rule applicable being used. An example of such a rule is given by Harman with the suffix removal plurals [9]:
If a word ends in “ies” but not “eies” or “aies”, 
     replace “ies” by “y”.
If a word ends in “es” but not “aes”, “oes”, or “aes”, 
     replace “es” by “e”.
If a word ends in “s” but not “us” or “ss”, 
     remove “s”.
Several different stemming algorithms have been compared to evaluate the advantage of the use of stemming algorithms [10]. Most of the time, general stemming has either a positive or no effect at all on retrieval performance if the process is measured in terms of quality. But since there are still conflicting studies, many Web search engines do not included these algorithms.
This approach is, of course, language dependent, i.e. a specific algorithm must be developed for each language. Moreover, it is impossible to determine the set of rules necessary to handle all the possibilities of a language. This means that some imperfections cannot be avoid, either that a common stem between several words isn’t found, or different words point wrongfully to a same stem. Several stemmers exist. For English, Porter’s algorithm is the most widely used due to its compactness [11]. The GALILEI team developed the Carry algorithm for French [12], and a stemmer for Arabic. Moreover, the Snowball stemmers and resources page proposes several stemmers for multiple languages which are available as open source libraries in C and Java (and integrated in the GALILEI project).

7 Concept Selection

It is evident that, when the number of different concepts increases, the amount of memory and the computational time needed by the algorithms also increase. It can therefore be interesting to introduce a step involving the reduction of dimensionality. The idea is to transform the initial set of concepts, , in another space, , called a reduced term set, where . One idea is to define some criteria to avoid that certain concepts are used for the document description.
Currently, two filtering rules are proposed for terms extracted from documents:
  1. Select only those that appear at least a given number of times in a document (for example 2). The assumption is of course that the terms best describing the content of a document are frequently used. An advantage of this filtering is that it reduces the number of concepts used to described a document which decreases the amount of necessary memory and increases the computation. When set to 1, it means that only the [concept weighting method] selects which terms are the most important.
  2. Select only those which are formed from at least a given number of characters (for example 3).
By default, these filtering rules are not applied on terms appearing in metadata (but this is an option)

8 Implementation

The whole document analysis process described in the previous sections can be resume in three task categories:
  1. Extract textual streams associated with semantic and division information from the document to analyze.
  2. Extract textual tokens from the textual streams.
  3. Analyze the textual tokens to decide which ones must be removed (for example stopwords), replaced (such as stems replacing words) or added (during a concept space transformation for example), i.e. choose the textual tokens that will become concepts describing the document.
In the GALILEI platform, it was chosen that these tasks are executed by plug-ins. The whole process is controlled by the GDocAnalyze class as shown at Figure 4↓. It maintains a list of tokens (represented by the GTextualToken class) and, for each token, the corresponding concept and its occurrences (the position in the file, the depth in the document structure and the vector (meta-concept) associated with a particular occurrence).
Figure 4 Document Analysis Implementation.
When a document should be analyzed, its URI is passed to the GDocAnalyze class. If the URI refers to an online document, it is downloaded in a temporary file. The document to analyze is therefore always available as a local file. The task categories involve the following classes:
  1. The GFilter class provides a generic plug-in that extracts textual streams from a local file. In practice, each plug-in provides a filter that can handle a set of MIME types (for example “text/email” or “application/msword”). The GDocAnalyze class searches the filter corresponding to the MIME type of the document to analyze, and passes it the local file name (if no filter is found, the document cannot be analyzed). The GDocAnalyze class provides several methods that can be used by a filter, the most important one being:
    ExtractDefaultText Extract some text content and associate it to the a vector corresponding to the meta-concept “*” of the type “text block”.
    ExtractDefaultURI Extract some URI and associate it to the vector corresponding to the meta-concept “*” of the type “URI”.
    ExtractDMCI Extract some text content and associate it a metadata defined by the Dublin Core MetaData Initiative (DCMI) [1].
    ExtractText Extract some text content and associate it to a meta-concept to specify.
    AddConcept Associate a concept to a meta-concept to specify.
  2. The GTokenizer provides a generic tokenizer plug-in that extracts tokens from textual streams. The GDocAnalyze class passes the text contents to the current tokenizer plug-in which communicates the tokens extracted trough the method GDocAnalyze::AddToken (if no tokenizer is selected, a document cannot be analyzed).
  3. The GAnalyzer provides a generic plug-in to analyzes the text tokens. In practice, The GDocAnalyze class calls all the active analyzers in a specific order (for example to ensure that the plug-in dealing with stopwords and determining the document language is called before the one that does the stemming which is language-dependent). At least one analyzer should associate to the tokens a concept through the GTextualToken::SetConcept method [D]  [D] This association can be done anytime during the analysis, but for efficiency it is recommended to do so as late as possible. In fact, if we do a stemming, thus replacing all the words with the corresponding stems, it isn’t useful to create concepts for these words since they will never be used.. This can be done with the simple following code:
    GConceptType* TermsSpace(Session->GetInsertConceptType(ccCat,"Terms","Terms"));
    GConcept* Concept(TermsSpace->GetInsertConcept(Token->GetToken()));
Once the analysis is done, the GDocAnalyze class builds the document tensor and updates the document index with all the tokens associated with a given concept, the corresponding weights in each vector being their number of occurrences.
Several subversion repositories provide plug-ins for the different tasks:
filters The repository proposes GFilter plug-ins for different document format types (emails, MS-DOC, PDF, HTML, XML, etc.).
langs The repository offers a list of plug-ins for the different languages, each plug-in providing a stemming algorithm and a list of stopwords (English, French, Dutch, German, Italian, Arabic, etc.).
textanalyze The repository provides plug-ins that implements the different steps described above: the lexical analysis, the stopword treatment, the stemming and the concept filtering (which also associate tokens to concept).


[1] Dublin Core Metadata Initiative, Dublin Core Metadata Element Set, Version 1.1: Reference Description, 1999.

[2] Norman Walsh & Leonard Muellner, Docbook 5.0: The Definitive Guide, O’Reilly, 2008.

[3] Chistopher J. Fox, “Lexical Analysis and Stoplists”, In Information Retrieval: Data Structures & Algorithms, William B. Frakes & Ricardo Baeza-Yates (ed.), pp. 102—130, Prentice Hall, 1992.

[4] William McCastline Waite, “The Cost of Lexical Analysis”, Software Practice and Experience, 16(5), pp. 473—488, 1986.

[5] Hans Peter Luhn, “A Statistical Approach to Mechanized Encoding and Searching of Literary Information”, IBM Journal of Research and Development, 1(4), pp. 309—317, 1957.

[6] C. J. “Keith” van Rijsbergen, Information Retrieval, Butterworths, 1979.

[7] Gerard Salton & Michael McGill, Modern Information Retrieval, McGraw-Hill Book Co., 1983.

[8] W. Nelson Francis & Henry Kucera, Frequency Analysis of English Usage, Houghton Mifflin, 1982.

[9] Donna Harman, “How Effective is Suffixing?”, Journal of the American Society for Information Science, 42(1), pp. 7—15, 1991.

[10] Willian B. Frakes, “Stemming Algorithms”, In Information Retrieval: Data Structures & Algorithms, William B. Frakes & Ricardo Baeza-Yates (ed.), pp. 7—15, Prentice-Hall, 1992.

[11] Martin Porter, “An Algorithm for Suffix Stripping”, Program, 14(3), pp. 130—137, 1980.

[12] Marjorie Paternostre, Pascal Francq, Marco Saerens, Julien Lamoral & David Wartel, Carry, un algorithme de désuffixation pour le français, Scientific Report, Université libre de Bruxelles, 2002.