Object Storage

Pascal Francq

April 12, 2012

Abstract

The GALILEI platform manipulates multiple internal data related to objects: the tensors representing objects, some object indexes (lists of objects described by a given concept), and document trees. This articles describes how this data is stored.

Table of Contents

Section 1: Introduction

Section 2: Descriptions

Section 3: Indexes

Section 4: Document Trees

Section 5: Implementation

References

1 Introduction

Solving information science problems with computers rely on multiple machine-readable information representations. For example, a search engine on documents needs an index that provides a list of documents containing a term or a group of terms.

In the case of the GALILEI framework, these representations form the knowledge model. Actually, three types of representations are managed by the GALILEI platform:

Descriptions The basic object representation in the GALILEI framework is the tensor. It is used to represent documents, profiles and groups (communities and topics)

Indexes This representation type is necessary to search for all the objects (documents, profiles, communities or topics) that contain a given concept.

Trees This representation type is only available for documents. It stores the hierarchical organization of each document by tracking the information related to all the occurrences of each concept appearing.

In practice, all these representations are managed through key-value stores: the key is the identifier of an object or of a concept, and the value corresponds to an object description, to a list of objects containing a given concept, or the tree of a document.

2 Descriptions

In the tensor space model, each object (document, profile, community or topic) is described as a tensor. These tensors are sparse, i.e. they contain a lot of null elements. It is therefore only necessary to store the

vectors having non null elements. For these vectors, we must store the identifier of the corresponding meta-concept and the

non-null elements. These elements are simply the identifier of a concept and its weight. Figure 1↓ shows the structure of a description (stored as value if a key representing the identifier of the corresponding object).

Figure 1 An object description.

As indicated on Figure 1↑, all the data stored are of the type size_t. To each data correspond a single call to the Read and Write methods.

3 Indexes

The indexing problem consists of finding all the objects (document, profile, community or topic) that are described by a given concept (such as keyword). An example is the classical Web search engines: the net surfer enters a query and the system retrieves all the Web pages corresponding to the query (the object retrieval problem implies also a ranking algorithm). For the query «“deep and “purple”», the list of objects to retrieved is given\strikeout off\uuline off\uwave off by the intersection of the list of objects described by “deep” and the list of those described by “purple”.

So, for given object type,

, an index,

, for concept,

, is a list of all the objects described by it:

where

are the elements of the tensor describing object,

The simplest way to store such an index is as shown at Figure 2↓: for each concept, we store the number

of objects containing it and their identifiers. All the data stored are of the type size_t. Regarding documents, most indexes add more information, in particular the position of all occurrences of a concept in the documents (which is useful to treat whole expression such as “deep purple”). The choice was made for the GALILEI platform to not store this information in the document index but in the document trees (section 4↓).

Figure 2 An object index.

Several studies have shown that indexing multiterms may increase the performance of search engines on large datasets. In practice, a second index is build for the most used combinations of keywords [1]. Actually, this is not automatically managed by the GALILEI platform, but an engine plug-in could implemented it (by using for example the RIndexesFile class to manage an index for pairs of concepts).

4 Document Trees

Unlike other objects, documents are not just describable by a tensor. In fact, a document is above all a tree of concepts. A given concept can appear multiple times in a documents at different positions and at different depth (for example once in a chapter title, and twice in paragraphs). Storing the kind of information allows to search for an expression such as “deep purple” (we can identify if two words appear next with the positions) and retrieve only a portion of the document (the depths allow to get a logical part of a document). Therefore, for documents, a specific key-value store stores the document trees.

Unsurprisingly, the storage of tree is recursive as shown in Figure 3↓. First, we store the number of nodes, the number of concepts referenced in the tree and the number of top nodes of document tree (for XML documents, there should be only root node). Secondly, the nodes are stored recursively. For each node, we store its type, the identifier of the concept, its syntactic position, its binary position, its depth and the number of child nodes. Then, each child node is stored the same way.

Figure 3 A document tree.

Expected from the node type (which is an enum), the type of all other data is size_t.

5 Implementation

The object storage is managed by the GALILEI platform. Each time an object description is computed, it is saved in the corresponding key-value store (at least if the SaveResults flag is set). In the configuration of the GALILEI platform, one can specify if an index should be created for different object type and if the document trees must be stored as well. If activated, the GALILEI platform manages automatically the saving process.

The GObjects template class implements all the code: the LoadDesc and SaveDesc methods manages the object descriptions; the Loadndex, UpdateIndex and BuildIndex methods manages the object indexes (the latest one is used to recreate entirely an given index); and the LoadTree and SaveTree methods manages the document trees.

References

[1] Christian von der Weth & Anwitaman Datta, “Multiterm Keyword Search in NoSQL Systems”, IEEE Internet Computing, 16(1), pp. 34—42, 2012.