GALILEI Research Project: GDocAnalyze Class Reference

#include <gdocanalyze.h>

Inheritance diagram for GDocAnalyze:

Public Member Functions
	GDocAnalyze (GSession *session)

GDoc *	GetDoc (void) const

void *	GetData (void) const

void	SetData (void *data)

GSession *	GetSession (void) const

const GDescription &	GetDescription (void) const

size_t	SkipToken (void)

GVector *	GetCurrentVector (void) const

void	SetCurrentVector (GVector *vector)

GTokenOccur *	AddToken (const R::RString &token, tTokenType type=ttUnknown, double weight=0.0)

GTokenOccur *	AddToken (const R::RString &token, GConcept *metaconcept, tTokenType type=ttUnknown, double weight=0.0)

GTokenOccur *	AddToken (const R::RString &token, tTokenType type, GConcept concept, double weight, GConcept metaconcept, size_t pos, size_t depth=0, size_t spos=SIZE_MAX)

GTokenOccur *	AddDefaultNamedEntityToken (const R::RString &token, double weight, size_t pos, size_t depth=0, size_t spos=SIZE_MAX)

void	ExtractText (const R::RString &text, GConcept *metaconcept, size_t pos, size_t depth=0, size_t spos=SIZE_MAX)

void	ExtractText (const R::RString &text, tTokenType type, double weight, GConcept *metaconcept, size_t pos, size_t depth=0, size_t spos=SIZE_MAX)

void	ExtractDCMI (const R::RString &element, const R::RString &value, size_t pos, size_t depth=0, size_t spos=SIZE_MAX)

void	ExtractDefaultText (const R::RString &content, size_t pos, size_t depth=0, size_t spos=SIZE_MAX)

void	ExtractDefaultText (const R::RString &content, tTokenType type, double weight, size_t pos, size_t depth=0, size_t spos=SIZE_MAX)

void	ExtractDefaultURI (const R::RString &uri, size_t pos, size_t depth=0, size_t spos=SIZE_MAX)

void	AssignPlugIns (void)

R::RCursor< GToken >	GetTokens (void) const

R::RCursor< GTokenOccur >	GetOccurs (void) const

void	DeleteToken (GToken *token)

void	ReplaceToken (GToken *token, R::RString value)

void	MoveToken (GTokenOccur *occur, R::RString value)

void	MoveToken (GTokenOccur occur, GConcept concept)

GLang *	GetLang (void) const

void	SetLang (GLang *lang)

void	Analyze (GDoc *doc, bool force, bool download)

virtual	~GDocAnalyze (void)

Private Member Functions
GToken *	CreateToken (const R::RString &token, tTokenType type)

void	BuildTensor (void)

void	BuildRecords (GTokenOccur *occur)

void	Print (GTokenOccur *occur)

Private Member Functions inherited from RDownloadFile
	RDownloadFile (void)

void	Download (const RURI &uri, const R::RURI &local)

Private Member Functions inherited from RDownload
	RDownload (void)

void	Download (const RURI &uri)

RString	GetMIMEType (void)

virtual	~RDownload (void)

Private Attributes
GDoc *	Doc

void *	Data

GSession *	Session

GDescription	Description

R::RContainer< GConceptRecord, false, true >	Records

size_t	NbRecords

GLang *	Lang

GTokenizer *	Tokenizer

R::RCastCursor< GPlugIn, GAnalyzer >	Analyzers

R::RContainer< GToken, true, false >	MemoryTokens

size_t	NbMemoryTokensUsed

R::RContainer< GTokenOccur, true, false >	MemoryOccurs

size_t	NbMemoryOccursUsed

R::RHashContainer< GToken, false >	OrderTokens

R::RContainer< GToken, false, false >	Tokens

R::RContainer< GTokenOccur, false, false >	Occurs

R::RContainer< GTokenOccur, false, false >	Top

R::RStack< GTokenOccur, false, true, true >	Depths

GVector *	CurVector

size_t	CurPos

size_t	CurDepth

bool	DepthError

size_t	CurSyntacticPos

tTokenType	CurTokenType

double	CurTokenWeight

R::RContainer < R::RNumContainer< size_t, false >, true, false >	SyntacticPos

size_t	NbTopRecords

size_t	NbRefs

Detailed Description

The GDocAnalyze class analyzes a given document by coordinating the following steps:

It determines the filter corresponding to the type of the document to analyze.
It uses the current tokenizer to extract the tokens from the text provided by the filter (child classes of GFilter).
The tokens are then passed to the analyzers in the order specified in the configuration to be treated (stemming, filtering, etc.).

In practice, it manages the tokens extracted from the documents by the filter and their occurrences (position, depth and syntactic position). Once the analysis steps are finished, it build a vector and a concept tree using the tokens for which a concept is associated.

It is supposed that each token of a given type in a document as an unique name and corresponds to one concept only. It is the responsibility of the filter to ensure it.

Constructor & Destructor Documentation

GDocAnalyze ( GSession * session )

Constructor of the document analysis method.

Parameters

session Session.

virtual ~GDocAnalyze ( void )

virtual

Destruct the document analyzer.

Member Function Documentation

GDoc* GetDoc ( void ) const

Get the document currently analyzed.

Returns: pointer to the document.

void* GetData ( void ) const

Get the data assigned to the analyzer.

It is the responsible of the caller of this function to correctly cast the pointer.

Returns: a raw pointer.

void SetData ( void * data )

Assign some data to the analyzer.

Parameters

data	Raw pointer to the data.

GSession* GetSession ( void ) const

Returns: the session.

const GDescription& GetDescription ( void ) const

Returns: the description that was just computed.

size_t SkipToken ( void )

Inform the document analysis process that a potential token is skipped. In practice, it increments the current syntactic position.

Typically, it is called by the current tokenizer to indicate that an existing character sequence is not considered as a valid token.

Returns: the syntactic position skipped.

GToken* CreateToken	(	const R::RString &	token,
		tTokenType	type
	)

private

Create a token with a given name and a given type. In practice, if a free token exists, it is used.

Parameters

token	Token.
type	Type.

Returns: a pointer to the created token.

GVector* GetCurrentVector ( void ) const

Get the current vector.

Returns: a pointer to the current vector.

void SetCurrentVector ( GVector * vector )

Set the current vector.

Be careful with this method.

Parameters

vector Pointer to the vector.

GTokenOccur* AddToken	(	const R::RString &	token,
		tTokenType	type = `ttUnknown`,
		double	weight = `0.0`
	)

Add a token to the current vector.

The current syntactic position is incremented by one.

Warning: This method should only be called by child classes of GTokenizer.

Parameters

token	Token to add.
type	Token type. If ttUnknown, the current token type is used.
weight	Weight associate to the concept. If null, the current weight is used.

Returns: the occurrence of the token added.

GTokenOccur* AddToken	(	const R::RString &	token,
		GConcept *	metaconcept,
		tTokenType	type = `ttUnknown`,
		double	weight = `0.0`
	)

Add a token to a given vector.

The current syntactic position is incremented by one.

Warning: This method should only be called by child classes of GTokenizer.

Parameters

token	Token to add.
metaconcept	Meta-concept of the vector associated to the concept.
type	Token type. If ttUnknown, the current token type is used.
weight	Weight associate to the concept. If null, the current weight is used.

Returns: the occurrence of the token added.

GTokenOccur* AddToken	(	const R::RString &	token,
		tTokenType	type,
		GConcept *	concept,
		double	weight,
		GConcept *	metaconcept,
		size_t	pos,
		size_t	depth = `0`,
		size_t	spos = `SIZE_MAX`
	)

Add a token of a given type and representing a concept. It is added to a vector associated with a given meta-concept.

The current syntactic position is incremented by one.

Parameters

token	Token to add. The name must be unique for a given document whatever its type.
type	Token type.
concept	Concept to add.
weight	Weight associate to the concept.
metaconcept	Meta-concept of the vector associated to the concept.
pos	Position of the concept.
depth	Depth of the concept.
spos	Syntactic position. If SIZE_MAX, the token is supposed to be next the previous one.

Returns: the occurrence of the token added.

GTokenOccur* AddDefaultNamedEntityToken	(	const R::RString &	token,
		double	weight,
		size_t	pos,
		size_t	depth = `0`,
		size_t	spos = `SIZE_MAX`
	)

Add a named-entity token and add it to a vector with a meta-concept corresponding to named entity. The method verifies that each part starts with a character in uppercase and separated by only one space.

Parameters

token	Token to add. The name must be unique for a given document whatever its type.
weight	Weight associate to the concept.
pos	Position of the concept.
depth	Depth of the concept.
spos	Syntactic position. If SIZE_MAX, the token is supposed to be next the previous one.

Returns: the occurrence of the token added.

void ExtractText	(	const R::RString &	text,
		GConcept *	metaconcept,
		size_t	pos,
		size_t	depth = `0`,
		size_t	spos = `SIZE_MAX`
	)

Extract some tokens of a given text, and add them to a vector associated with a given meta-concept. If the vector already exists, the content is added.