#include <gdocanalyze.h>

Inheritance diagram for GDocAnalyze:
[legend]

Public Member Functions

 GDocAnalyze (GSession *session)
 
GDocGetDoc (void) const
 
void * GetData (void) const
 
void SetData (void *data)
 
GSessionGetSession (void) const
 
const GDescriptionGetDescription (void) const
 
size_t SkipToken (void)
 
GVectorGetCurrentVector (void) const
 
void SetCurrentVector (GVector *vector)
 
GTokenOccurAddToken (const R::RString &token, tTokenType type=ttUnknown, double weight=0.0)
 
GTokenOccurAddToken (const R::RString &token, GConcept *metaconcept, tTokenType type=ttUnknown, double weight=0.0)
 
GTokenOccurAddToken (const R::RString &token, tTokenType type, GConcept *concept, double weight, GConcept *metaconcept, size_t pos, size_t depth=0, size_t spos=SIZE_MAX)
 
GTokenOccurAddDefaultNamedEntityToken (const R::RString &token, double weight, size_t pos, size_t depth=0, size_t spos=SIZE_MAX)
 
void ExtractText (const R::RString &text, GConcept *metaconcept, size_t pos, size_t depth=0, size_t spos=SIZE_MAX)
 
void ExtractText (const R::RString &text, tTokenType type, double weight, GConcept *metaconcept, size_t pos, size_t depth=0, size_t spos=SIZE_MAX)
 
void ExtractDCMI (const R::RString &element, const R::RString &value, size_t pos, size_t depth=0, size_t spos=SIZE_MAX)
 
void ExtractDefaultText (const R::RString &content, size_t pos, size_t depth=0, size_t spos=SIZE_MAX)
 
void ExtractDefaultText (const R::RString &content, tTokenType type, double weight, size_t pos, size_t depth=0, size_t spos=SIZE_MAX)
 
void ExtractDefaultURI (const R::RString &uri, size_t pos, size_t depth=0, size_t spos=SIZE_MAX)
 
void AssignPlugIns (void)
 
R::RCursor< GTokenGetTokens (void) const
 
R::RCursor< GTokenOccurGetOccurs (void) const
 
void DeleteToken (GToken *token)
 
void ReplaceToken (GToken *token, R::RString value)
 
void MoveToken (GTokenOccur *occur, R::RString value)
 
void MoveToken (GTokenOccur *occur, GConcept *concept)
 
GLangGetLang (void) const
 
void SetLang (GLang *lang)
 
void Analyze (GDoc *doc, bool force, bool download)
 
virtual ~GDocAnalyze (void)
 

Private Member Functions

GTokenCreateToken (const R::RString &token, tTokenType type)
 
void BuildTensor (void)
 
void BuildRecords (GTokenOccur *occur)
 
void Print (GTokenOccur *occur)
 
- Private Member Functions inherited from RDownloadFile
 RDownloadFile (void)
 
void Download (const RURI &uri, const R::RURI &local)
 
- Private Member Functions inherited from RDownload
 RDownload (void)
 
void Download (const RURI &uri)
 
RString GetMIMEType (void)
 
virtual ~RDownload (void)
 

Private Attributes

GDocDoc
 
void * Data
 
GSessionSession
 
GDescription Description
 
R::RContainer< GConceptRecord,
false, true > 
Records
 
size_t NbRecords
 
GLangLang
 
GTokenizerTokenizer
 
R::RCastCursor< GPlugIn,
GAnalyzer
Analyzers
 
R::RContainer< GToken, true,
false > 
MemoryTokens
 
size_t NbMemoryTokensUsed
 
R::RContainer< GTokenOccur,
true, false > 
MemoryOccurs
 
size_t NbMemoryOccursUsed
 
R::RHashContainer< GToken, false > OrderTokens
 
R::RContainer< GToken, false,
false > 
Tokens
 
R::RContainer< GTokenOccur,
false, false > 
Occurs
 
R::RContainer< GTokenOccur,
false, false > 
Top
 
R::RStack< GTokenOccur, false,
true, true > 
Depths
 
GVectorCurVector
 
size_t CurPos
 
size_t CurDepth
 
bool DepthError
 
size_t CurSyntacticPos
 
tTokenType CurTokenType
 
double CurTokenWeight
 
R::RContainer
< R::RNumContainer< size_t,
false >, true, false > 
SyntacticPos
 
size_t NbTopRecords
 
size_t NbRefs
 

Detailed Description

The GDocAnalyze class analyzes a given document by coordinating the following steps:

  1. It determines the filter corresponding to the type of the document to analyze.
  2. It uses the current tokenizer to extract the tokens from the text provided by the filter (child classes of GFilter).
  3. The tokens are then passed to the analyzers in the order specified in the configuration to be treated (stemming, filtering, etc.).

In practice, it manages the tokens extracted from the documents by the filter and their occurrences (position, depth and syntactic position). Once the analysis steps are finished, it build a vector and a concept tree using the tokens for which a concept is associated.

It is supposed that each token of a given type in a document as an unique name and corresponds to one concept only. It is the responsibility of the filter to ensure it.

Constructor & Destructor Documentation

GDocAnalyze ( GSession session)

Constructor of the document analysis method.

Parameters
sessionSession.
virtual ~GDocAnalyze ( void  )
virtual

Destruct the document analyzer.

Member Function Documentation

GDoc* GetDoc ( void  ) const

Get the document currently analyzed.

Returns
pointer to the document.
void* GetData ( void  ) const

Get the data assigned to the analyzer.

It is the responsible of the caller of this function to correctly cast the pointer.

Returns
a raw pointer.
void SetData ( void *  data)

Assign some data to the analyzer.

Parameters
dataRaw pointer to the data.
GSession* GetSession ( void  ) const
Returns
the session.
const GDescription& GetDescription ( void  ) const
Returns
the description that was just computed.
size_t SkipToken ( void  )

Inform the document analysis process that a potential token is skipped. In practice, it increments the current syntactic position.

Typically, it is called by the current tokenizer to indicate that an existing character sequence is not considered as a valid token.

Returns
the syntactic position skipped.
GToken* CreateToken ( const R::RString token,
tTokenType  type 
)
private

Create a token with a given name and a given type. In practice, if a free token exists, it is used.

Parameters
tokenToken.
typeType.
Returns
a pointer to the created token.
GVector* GetCurrentVector ( void  ) const

Get the current vector.

Returns
a pointer to the current vector.
void SetCurrentVector ( GVector vector)

Set the current vector.

Be careful with this method.

Parameters
vectorPointer to the vector.
GTokenOccur* AddToken ( const R::RString token,
tTokenType  type = ttUnknown,
double  weight = 0.0 
)

Add a token to the current vector.

The current syntactic position is incremented by one.

Warning
This method should only be called by child classes of GTokenizer.
Parameters
tokenToken to add.
typeToken type. If ttUnknown, the current token type is used.
weightWeight associate to the concept. If null, the current weight is used.
Returns
the occurrence of the token added.
GTokenOccur* AddToken ( const R::RString token,
GConcept metaconcept,
tTokenType  type = ttUnknown,
double  weight = 0.0 
)

Add a token to a given vector.

The current syntactic position is incremented by one.

Warning
This method should only be called by child classes of GTokenizer.
Parameters
tokenToken to add.
metaconceptMeta-concept of the vector associated to the concept.
typeToken type. If ttUnknown, the current token type is used.
weightWeight associate to the concept. If null, the current weight is used.
Returns
the occurrence of the token added.
GTokenOccur* AddToken ( const R::RString token,
tTokenType  type,
GConcept concept,
double  weight,
GConcept metaconcept,
size_t  pos,
size_t  depth = 0,
size_t  spos = SIZE_MAX 
)

Add a token of a given type and representing a concept. It is added to a vector associated with a given meta-concept.

The current syntactic position is incremented by one.

Parameters
tokenToken to add. The name must be unique for a given document whatever its type.
typeToken type.
conceptConcept to add.
weightWeight associate to the concept.
metaconceptMeta-concept of the vector associated to the concept.
posPosition of the concept.
depthDepth of the concept.
sposSyntactic position. If SIZE_MAX, the token is supposed to be next the previous one.
Returns
the occurrence of the token added.
GTokenOccur* AddDefaultNamedEntityToken ( const R::RString token,
double  weight,
size_t  pos,
size_t  depth = 0,
size_t  spos = SIZE_MAX 
)

Add a named-entity token and add it to a vector with a meta-concept corresponding to named entity. The method verifies that each part starts with a character in uppercase and separated by only one space.

Parameters
tokenToken to add. The name must be unique for a given document whatever its type.
weightWeight associate to the concept.
posPosition of the concept.
depthDepth of the concept.
sposSyntactic position. If SIZE_MAX, the token is supposed to be next the previous one.
Returns
the occurrence of the token added.
void ExtractText ( const R::RString text,
GConcept metaconcept,
size_t  pos,
size_t  depth = 0,
size_t  spos = SIZE_MAX 
)

Extract some tokens of a given text, and add them to a vector associated with a given meta-concept. If the vector already exists, the content is added.

The current syntactic position is incremented by the number of tokens extracted.

Parameters
textText to add.
metaconceptMeta-concept of the vector associated to the text.
posPosition of the text.
depthDepth of the text.
sposSyntactic position. If SIZE_MAX, the token is supposed to be next the previous one.
void ExtractText ( const R::RString text,
tTokenType  type,
double  weight,
GConcept metaconcept,
size_t  pos,
size_t  depth = 0,
size_t  spos = SIZE_MAX 
)

Extract some tokens of a given text, and add them to a vector associated with a given meta-concept. If the vector already exists, the content is added.

The current syntactic position is incremented by the number of tokens extracted.

Parameters
textText to add.
typeToken type.
weightWeight associate to the concept.
metaconceptMeta-concept of the vector associated to the text.
posPosition of the text.
depthDepth of the text.
sposSyntactic position. If SIZE_MAX, the token is supposed to be next the previous one.
void ExtractDCMI ( const R::RString element,
const R::RString value,
size_t  pos,
size_t  depth = 0,
size_t  spos = SIZE_MAX 
)

Extract some tokens of a given text, and add them to a vector associated with a given metadata defined by the Dublin core. In practice, to each metadata corresponds one vector. Several contents associated with a given metadata are simply added.

The only allowed elements are: contributor, coverage, creator, date, description, format, identifier, language, publisher, relation, rights, source, subject, title, type.

The current syntactic position is incremented by the number of tokens extracted.

Parameters
elementElement of the DCMI (without namespace and/or prefix).
valueValue of the metadata.
posPosition of the metadata.
depthDepth of the metadata.
sposSyntactic position. If SIZE_MAX, the token is supposed to be next the previous one.
void ExtractDefaultText ( const R::RString content,
size_t  pos,
size_t  depth = 0,
size_t  spos = SIZE_MAX 
)

Extract some tokens from a given text, and add them to a '*' (neutral) meta-concept of the type 'text block. Each time the method is called, the content is added to the vector corresponding to the '*' meta-concept. The tokens are supposed to be text.

The current syntactic position is incremented by the number of tokens extracted.

Parameters
contentContent.
posPosition of the content.
depthDepth of the content.
sposSyntactic position. If SIZE_MAX, the token is supposed to be next the previous one.
void ExtractDefaultText ( const R::RString content,
tTokenType  type,
double  weight,
size_t  pos,
size_t  depth = 0,
size_t  spos = SIZE_MAX 
)

Extract some tokens from a given text, and add them to a '*' (neutral) meta-concept of the type 'text block. Each time the method is called, the content is added to the vector corresponding to the '*' meta-concept. The tokens are supposed to be text.

The current syntactic position is incremented by the number of tokens extracted.

Parameters
contentContent.
typeToken type.
weightWeight associate to the concept.
posPosition of the content.
depthDepth of the content.
sposSyntactic position. If SIZE_MAX, the token is supposed to be next the previous one.
void ExtractDefaultURI ( const R::RString uri,
size_t  pos,
size_t  depth = 0,
size_t  spos = SIZE_MAX 
)

Extract a token that represents a URI, and add it to the '*' (neutral) meta-concept of the type 'URI'. Each time the method is called, the content is added to the vector corresponding to the '*' meta-concept.

The current syntactic position is incremented by one.

Parameters
uriURI.
posPosition of the URI.
depthDepth of the URI.
sposSyntactic position. If SIZE_MAX, the URI is supposed to be next the previous one.
void AssignPlugIns ( void  )

Assign the plug-ins. An exception is generated if no plug-ins are defined.

R::RCursor<GToken> GetTokens ( void  ) const

Get a cursor over the tokens extracted. The order of the container reflects the order of the first occurrence of each token.

Returns
a cursor.
R::RCursor<GTokenOccur> GetOccurs ( void  ) const

Get a cursor over the occurrences of the different tokens extracted as they appear in the document.

Returns
a cursor.
void DeleteToken ( GToken token)

Delete a given token. In practice, it modifies its type to ttDeleted.

Warning
This method may modified the cursor over the tokens.
Parameters
tokenToken to delete.
void ReplaceToken ( GToken token,
R::RString  value 
)

Replace a given token by a given value (for example a word by its stem). If it new value corresponds to an existing token, the occurrences are merged and the type of the current token is set to ttDeleted.

Warning
This method may modified the cursor over the tokens.
Parameters
tokenToken to replace.
valueNew value.
void MoveToken ( GTokenOccur occur,
R::RString  value 
)

Move a token occurrence associated to a particular token to another one given by a value. If it new value corresponds to an existing token, the occurrence is added. If the current token has no more occurrences, its type is set to ttDeleted.

Warning
This method may modified the cursor over the tokens.
Parameters
occurToken occurrence to change.
valueNew value.
void MoveToken ( GTokenOccur occur,
GConcept concept 
)

Move a token occurrence associated to a particular token to another existing concept. If necessary, a new token is created. If the current token has no more occurrences its type is set to ttDeleted.

Warning
This method may modified the cursor over the tokens.
Parameters
occurToken occurrence to change.
conceptConcept.
GLang* GetLang ( void  ) const

Get the language actually determined.

void SetLang ( GLang lang)

Set the language for the document currently analyzed.

Parameters
lang
void BuildTensor ( void  )
private

Create the descriptions.

void BuildRecords ( GTokenOccur occur)
private

Build the records starting with a given token occurrence.

Parameters
parentParent node.
occurToken occurrence.
void Print ( GTokenOccur occur)
private

Print the concept tree starting with a given token occurrence.

Parameters
occurToken occurrence.
void Analyze ( GDoc doc,
bool  force,
bool  download 
)

Analyze a document.

Parameters
docPointer to the document to analyze.
forceForce the analysis of the document?
downloadTry to download locally the document?

Member Data Documentation

GDoc* Doc
private

Current document analysed.

void* Data
private

Some data than can be assigned to the analyser.

GSession* Session
private

Corresponding session.

GDescription Description
private

Description to build during the analysis.

R::RContainer<GConceptRecord,false,true> Records
private

Records to build during the analysis.

size_t NbRecords
private

Number of records really used for the represents the documents.

GLang* Lang
private

Language associated to the document.

GTokenizer* Tokenizer
private

The tokenizer.

R::RCastCursor<GPlugIn,GAnalyzer> Analyzers
private

The analyzers.

R::RContainer<GToken,true,false> MemoryTokens
private

Memory of tokens.

size_t NbMemoryTokensUsed
private

Number of tokens from the memory used.

R::RContainer<GTokenOccur,true,false> MemoryOccurs
private

Memory of occurrences.

size_t NbMemoryOccursUsed
private

Number of occurrences from the memory used.

R::RHashContainer<GToken,false> OrderTokens
private

List of tokens currently added ordered.

R::RContainer<GToken,false,false> Tokens
private

List of tokens currently added.

R::RContainer<GTokenOccur,false,false> Occurs
private

The occurrences of the tokens.

R::RContainer<GTokenOccur,false,false> Top
private

Top occurrences.

R::RStack<GTokenOccur,false,true,true> Depths
private

A stack representing the "active" tokens at each depth.

GVector* CurVector
private

Vector for which new concepts should be added.

size_t CurPos
private

Current position.

size_t CurDepth
private

Current depth.

bool DepthError
private

Is there a depth error for the current situation.

size_t CurSyntacticPos
private

Current syntactic position.

tTokenType CurTokenType
private

Current token type.

double CurTokenWeight
private

Current token weight.

R::RContainer<R::RNumContainer<size_t,false>,true,false> SyntacticPos
private
size_t NbTopRecords
private

Number of top records.

size_t NbRefs
private

Number of valid concepts referenced.