Text Tokenizer. More...

#include <gtokenizer.h>

Inheritance diagram for GTokenizer:
[legend]

Public Member Functions

 GTokenizer (GSession *session, GPlugInFactory *fac)
 
void AddChar (const R::RChar &car)
 
R::RString Extract (size_t begin, size_t end)
 
size_t GetPos (void) const
 
virtual void Start (void)
 
virtual bool TreatChar (GDocAnalyze *analyzer, const R::RChar &car)=0
 
- Public Member Functions inherited from GPlugIn
 GPlugIn (GSession *session, GPlugInFactory *fac)
 
virtual void ApplyConfig (void)
 
void InsertParam (R::RParam *param)
 
template<class T >
T * FindParam (const R::RString &name)
 
R::RCursor< R::RParamGetParams (const R::RString &cat=R::RString::Null)
 
void GetCategories (R::RContainer< R::RString, true, false > &cats)
 
virtual void Init (void)
 
virtual void CreateConfig (void)
 
virtual void Reset (void)
 
GPlugInFactoryGetFactory (void) const
 
int Compare (const GPlugIn &plugin) const
 
int Compare (const R::RString &plugin) const
 
R::RString GetName (void) const
 
R::RString GetDesc (void) const
 
GSessionGetSession (void) const
 
virtual void Done (void)
 
virtual ~GPlugIn (void)
 

Private Attributes

R::RString Buffer
 
size_t CurPos
 

Additional Inherited Members

- Protected Attributes inherited from GPlugIn
GPlugInFactoryFactory
 
GSessionSession
 
size_t Id
 

Detailed Description

Text Tokenizer.

The GTokenizer class provides some methods that break a set of characters into tokens.

It proposes a framework for a finite-state machine with memory.

It is used in the analyze of a document to determine how to extract the basic elements (words, abbreviation, e-mails, etc.).

See the documentation related to GPlugIn for more general information.

Constructor & Destructor Documentation

GTokenizer ( GSession session,
GPlugInFactory fac 
)

Construct the tokenizer.

Parameters
sessionSession.
facFactory.

Member Function Documentation

void AddChar ( const R::RChar car)

Add a character to the memory.

Parameters
carCharacter to save.
R::RString Extract ( size_t  begin,
size_t  end 
)

Extract a string from the memory

Parameters
beginBeginning position.
endEnding position. If it is cNoRef, the end is the last character of the memory. Else, the ending position is to copied.
Returns
size_t GetPos ( void  ) const

Get the position currently treated.

Returns
position.
virtual void Start ( void  )
virtual

Method call each time the tokenizer is started to analyze some text. It must be called by all inheriting method.

virtual bool TreatChar ( GDocAnalyze analyzer,
const R::RChar car 
)
pure virtual

This method is called each time the analyzer treat a character.

The method should called the AddToken method the analyzer to add valid tokens.

Parameters
analyzerAnalyzer.
carCharacter treated.
Returns
true if the character starts a token.

Member Data Documentation

R::RString Buffer
private

Memory.

size_t CurPos
private

Current position.