Falcon::TokenizerParams Class Reference

Parameters for the tokenizer. More...

#include <tokenizer.h>

Inheritance diagram for Falcon::TokenizerParams:

List of all members.

Public Member Functions

TokenizerParams & bindSep (bool mode=true)

Add the tokens to the non-token previous element.

TokenizerParams & groupSep (bool mode=true)

Activate this option to have the Tokenizer return only once for a sequence of separators all alike.

bool isBindSep () const

bool isGroupSep () const

bool isReturnSep () const

bool isTrim () const

bool isWsToken () const

int32 maxToken () const

TokenizerParams & maxToken (int32 size)

Sets the maximum size of the returned tokens.

TokenizerParams & returnSep (bool mode=true)

Returns found tokens separately.

TokenizerParams ()

TokenizerParams & trim (bool mode=true)

Whitespaces are trimmed from the retuned tokens.

TokenizerParams & wsIsToken (bool mode=true)

Treat a sequence of whitespaces of any lenght as a single token.

Detailed Description

Parameters for the tokenizer.

This is used for variable parameter idiom initialization of the Tokenizer class. Pass a direct instance of this class to configure the target Tokenizer.

The setting methods in this class return a reference to this class itself, so that is possible to set several behavior and settings in cascade.

Constructor & Destructor Documentation

Falcon::TokenizerParams::TokenizerParams ( ) [inline]

Member Function Documentation

TokenizerParams& Falcon::TokenizerParams::bindSep ( bool mode = true ) [inline]

Add the tokens to the non-token previous element.

This adds the separators to the token preceding them when returning the token. If grouping is activated, then more than a single separator may be returned.

TokenizerParams& Falcon::TokenizerParams::groupSep ( bool mode = true ) [inline]

Activate this option to have the Tokenizer return only once for a sequence of separators all alike.

In example, if the token list includes a space, then only one token will be returned no matter how many spaces are encountered. If not given, an empty string would be returned as a token if two tokens are found one after another.

bool Falcon::TokenizerParams::isBindSep ( ) const [inline]

bool Falcon::TokenizerParams::isGroupSep ( ) const [inline]

bool Falcon::TokenizerParams::isReturnSep ( ) const [inline]

bool Falcon::TokenizerParams::isTrim ( ) const [inline]

bool Falcon::TokenizerParams::isWsToken ( ) const [inline]

int32 Falcon::TokenizerParams::maxToken ( ) const [inline]

TokenizerParams& Falcon::TokenizerParams::maxToken ( int32 size ) [inline]

Sets the maximum size of the returned tokens.

If the size of the input data exceeds this size while searching for a token, an item is returned as if a separator was found.

TokenizerParams& Falcon::TokenizerParams::returnSep ( bool mode = true ) [inline]

Returns found tokens separately.

This forces the tokenizer to return each token in a separate call. For example, if "," is a token:

         "a, b, c"

would be returned as "a" - "," - " b" - "," - " c".

TokenizerParams& Falcon::TokenizerParams::trim ( bool mode = true ) [inline]

Whitespaces are trimmed from the retuned tokens.

Whitespaces are tab, space, carrige return and line feed characters. If this option is actived, the returned tokens won't include spaces found at the beginning or at the end of the token. In example, if the spearator is ':', and trim is enabled, the following sequence:

         : a: b : :c

Will be parsed as a sequence of "a", "b", "", "c" tokens; otherwise, it would be parsed as " a", " b ", " ", "c".

TokenizerParams& Falcon::TokenizerParams::wsIsToken ( bool mode = true ) [inline]

Treat a sequence of whitespaces of any lenght as a single token.

This separates words between spaces and other tokens. For example, a text analyzer may use this mode to get words and puntactions with a single "next" call.

The documentation for this class was generated from the following file:

/home/gian/Progetti/falcon/core/include/falcon/tokenizer.h


Public Member Functions
TokenizerParams &	bindSep (bool mode=true)
	Add the tokens to the non-token previous element.
TokenizerParams &	groupSep (bool mode=true)
	Activate this option to have the Tokenizer return only once for a sequence of separators all alike.
bool	isBindSep () const
bool	isGroupSep () const
bool	isReturnSep () const
bool	isTrim () const
bool	isWsToken () const
int32	maxToken () const
TokenizerParams &	maxToken (int32 size)
	Sets the maximum size of the returned tokens.
TokenizerParams &	returnSep (bool mode=true)
	Returns found tokens separately.
	TokenizerParams ()
TokenizerParams &	trim (bool mode=true)
	Whitespaces are trimmed from the retuned tokens.
TokenizerParams &	wsIsToken (bool mode=true)
	Treat a sequence of whitespaces of any lenght as a single token.