GALILEI Framework

Pascal Francq

April 9, 2012 (October 24, 2011)

Abstract

The GALILEI framework presents a coherent approach for information science. Based on a common knowledge model that describes objects (documents, profiles, etc.), it proposes an integrated approach for multiple tasks (document clustering, community detection, etc.). The framework is implemented in an open software research platform made available to the community which uses a modular approach where the different components of the framework are implemented as plug-ins.

1 Introduction

It is a common place to say that the success of Internet has revolutionized the access to the human knowledge. Since the 1970s, several academic domains related to information science have emerged: information retrieval, (automatic) text categorization, collaborative filtering, data and text mining, communities detection, etc. Despite the amount of existing publications and the enormous scientific output, it is still difficult to have a global approach of information science. The fact that the academic solutions are validated on diverse text corpora (treated and indexed differently) with different approaches and measures increases this difficulty.
This article presents the GALILEI framework, a coherent approach for information science, and an open source implementation of it, the GALILEI platform [A]  [A] This work was supported by the HyperPRISME, GALILEI and STRATEGO projects funded by the Region wallonne. . The framework provides a set of components related to information science, and defines a common validation system for these components.

2 The GALILEI Framework

The origin of the GALILEI framework is a research project started in 2000. The goal was to develop a system for automatic communities building [1]. The main idea was to identify user’s interests and consequently grouping them. Concretely, the multiple interests of a user were modeled by profiles computed on the basis of the content of the consulted documents and user relevance feedbacks on these documents. These profiles were then grouped into communities of interests. The project showed that it was necessary to design a specific framework for this problem with the following elements:
  • a coherent set of notations and concepts;
  • the identification of processes and their decomposition in tasks and subtasks (for example the “building communities” task supposes the subtasks: (i) extract features from documents, (ii) compute descriptions of user profiles based on these features, (iii) use these descriptions to build a similarity measure between profiles and (iv) detect similar profiles that form communities of interests).
  • an unique description model for different kinds of objects (documents, profiles and communities);
  • the management of multiple solutions for a given subtask;
  • a validation system to evaluate the performances of all proposed solutions and their combinations.
Since then, the initial framework was enhanced to solve other problems such as object clustering, semi-automatic thesaurus building and document fragment retrieval.
The global vision of the framework is shown at Figure 1↓. It defines four layers of components, where each layer represents an increasing abstraction level.
figs/galilei.svg
Figure 1 The GALILEI Framework.
From outside, the framework is considered as a “black box” that proposes a set of “high level” processes related to information science. Pull technologies (such as search engines) retrieve objects (documents or parts of documents, profiles, etc.) based on a query (such as a set of keywords for search engines) and present them to the user. One of the main issues is the ranking method used to classify the most relevant retrieved objects first. Push technologies execute also retrieving tasks, but on an automatic basis (for example using a profile characterizing a user’s interest). The big challenge for such technologies is to “understand” the user’s interest by building a profile of it. While the number of information and documents grows, there is a crucial need for tools providing some knowledge organization (taxonomy building, document classification, document clustering, etc.). The “Web 2.0” phenomenon has demonstrated the importance of user collaboration to manage large amounts of information and to facilitate the dissemination of existing knowledge (in organizations or on the Internet). An important task is therefore to help these collaborations, for example by identifying users having similar interests, grouping them into communities of interests and sharing interesting documents between them.
Internally, these “high level” features involve solving a number of tasks (classes of problems). The task of knowledge extraction is central. Since most of the formalized knowledge is externalized trough numeric (mostly text-based) documents, the framework focus on computing document descriptions. The problems related to this task include language detection, stemming algorithms, etc. The object comparison refers to the class of problems dealing with the evaluation of how similar different objects may be (for example by computing a similarity measure between two objects). Another important class is the object profiling (for example user profiling) which is used by classification algorithms and push technologies. The class of object classification proposes to assign a set of objects to categories (a given object may sometimes be assigned to several categories). Examples of such problems are the document categorization into topics or the grouping of user’s profiles into communities of interests. The classification can be unsupervised (objects are clustered into unknown clusters) or supervised (the categories are known and some training data exists). Object retrieval is one of the classes used by the pull and push technologies: the goal is to identify the objects relevant to a given query. Finally, the confidence computing is the class of problems dedicated to the evaluation of the “quality” of a given object: a ranking (such as Google’s PageRank), a recommendation rating of some products, the concept of “reputation” that emerges actually from the social software universe, etc.
In practice, all these classes of problems suppose a mathematical knowledge model. The major challenge is the complexity of such a modelisation. In fact, most of the time, simple models do not perform well while very complex models are unable to manage large amounts of information (such as the millions of documents available in big organizations or the billions of Web pages). A well-balanced compromise between complexity of the modelisation and system performances must be found. One aspect of this model is the adoption of a synthetic object representation.
Since each component can be solved by a multitude of solutions, the framework includes a validation system in order to compare the performances of several possible solutions for a given component. This system defines validation processes for the different components, as well as a defined set of corpora. It ensures that all solutions are compared exactly in the same context (same processes and same corpora).
The advantages of the proposed framework are:
  1. The use of a common set of notations, concepts and definitions for all components.
  2. A faster development of new components by reusing existing components.
  3. Different solutions for a given component (for example document clustering) can be compared more “objectively”.
  4. Each component receives improvements from others (for example a better similarity measure between documents increases directly the quality of the corresponding clustering).

3 The GALILEI Platform

The GALILEI platform, as its name reveals, is an open source C++ implementation of the GALILEI framework presented above running on UNIX-like systems. At the time of writing, the platform contains more than 150.000 lines of C++ source code (representing approximately 40 person-years of development). Of course, there are several other open source projects related to information science such as Lucene [2] or Lemur project [3], but they are mainly focused on indexing documents and retrieving them. However, the quality of these projects could lead to future releases of the GALILEI platform using them.
figs/galilei-platform.svg
Figure 2 Architecture of the GALILEI platform.
Figure 2↑ illustrates the architecture of the GALILEI platform. The central element is the GALILEI API (implemented as a GNU LGPL shared library) which provides a set of C++ classes representing the objects (documents, profiles, etc.) and the tasks and subtasks managing these objects (compute a profile, retrieve an ordered list of relevant documents for a query, etc.). As shown, the GALILEI API is based on a R Library (a GNU LGPL shared library providing general-purpose C++ classes) and other open source basic libraries. The different components of the GALILEI framework are developed as plug-ins, and the GALILEI API ensures their integration and interoperability. The GALILEI API allows also real-life applications (end-user applications, server-side applications, etc.) to implement the “high level” processes of the framework. The GALILEI platform provides therefore not only a research platform, but also a production environment that facilitate the transfer of academic results. QGALILEI, shown at Figure 3↓, is a Qt-based graphical application for researchers to monitor the knowledge base (such as the descriptions of the documents) and launch different tasks (for example analyzing documents or computing user’s profiles).
figure figs/qgalilei.png
Figure 3 The QGALILEI Application.
One of the main choices is that the knowledge model and the validation system are part of the library, i.e. all the components use the same model and the same validation scenarios. While this choice seems limiting at first glance, it ensures that different components can be integrated easily. This is important in a research environment where different research teams work on distinct components. The drawback of this choice is that, if a more complex knowledge model is adopted, the kernel of the platform must be changed (which can then have an impact on the plug-ins). That said, the knowledge model will certainly not be modified very often, and possible modifications would be limited to major releases only. The library manages also the storage of objects.
Of course, the current status of the platform is far from complete, but it can already be used for development and research purposes. Moreover, the reader should remember that, for each component, different plug-ins can be developed, tested and compared.

4 Future Works

The GALILEI framework provides a global approach for information science that is implemented in the open source GALILEI platform. It is an ongoing work, and there is still a lot of work to do. Here are some main challenges for the framework and/or the platform:
  1. Since several solutions proposed are based on parameters to estimate, a double cross validation should be include in the framework.
  2. The platform is not parallelized while several subtaks can be performed in parallel (document analysis, profiles computing, etc.). The challenge here is to adopt an approach were the parallelization is managed by the platform and not the individual plug-ins.
  3. The object storage is actually managed by the platform. This has two drawbacks: a crash of the platform produces an inconsistency in the data (in particular the object descriptions and indexes), and the access to the data is not concurrent. A solution is to develop a separate process that provides key-value stores to the platform.
  4. Most current solutions are enhancements of algorithms published in the past. A global and systematic testing is needed to automate the performance evaluations of all the implemented solutions.
  5. An integration of computational linguistics methods (for example the Unitex toolkit [4]) to extract more complex tokens (such as groups of words) from documents.
  6. The exploitation of an external ontology [5] to select high conceptual features from documents.

5 GALILEI Related Articles

References

[1] Pascal Francq & Alain Delchambre, “Using documents assessments to build communities of interests”, In Proceedings of the 2005 Symposium on Applications and the Internet (SAINT’05), pp. 327—333, 2005.

[2] Mike McCandless, Erik Hatcher & Otis Gospodnetić, Lucene in Action, 2nd edition, Manning Publications, 2010.

[3] Thi Truong Avrahami, Lawrence Yau, Luo Si & James Callan, “The FedLemur Project: Federated Search in the Real World”, Journal of the American Society for Information Science, 57(3), pp. 347—358, 2006.

[4] Sébastien Paumier, De la reconnaissance de formes linguistiques à l’analyse syntaxique, PhD thesis, Université de Marne-la-Vallée, 2003.

[5] Lucian Vlad Lita, Warren A. Hunt & Eric Nyberg, “Resource analysis for question answering”, In Proceedings of the 2004 Association for Computational Linguistics Conference (ACL), pp. 162—165, 2004.