#include <tokenizer.h>
Public Member Functions | |
TokenizerParams & | bindSep (bool mode=true) |
Add the tokens to the non-token previous element. | |
TokenizerParams & | groupSep (bool mode=true) |
Activate this option to have the Tokenizer return only once for a sequence of separators all alike. | |
bool | isBindSep () const |
bool | isGroupSep () const |
bool | isReturnSep () const |
bool | isTrim () const |
bool | isWsToken () const |
int32 | maxToken () const |
TokenizerParams & | maxToken (int32 size) |
Sets the maximum size of the returned tokens. | |
TokenizerParams & | returnSep (bool mode=true) |
Returns found tokens separately. | |
TokenizerParams () | |
TokenizerParams & | trim (bool mode=true) |
Whitespaces are trimmed from the retuned tokens. | |
TokenizerParams & | wsIsToken (bool mode=true) |
Treat a sequence of whitespaces of any lenght as a single token. |
This is used for variable parameter idiom initialization of the Tokenizer class. Pass a direct instance of this class to configure the target Tokenizer.
The setting methods in this class return a reference to this class itself, so that is possible to set several behavior and settings in cascade.
Falcon::TokenizerParams::TokenizerParams | ( | ) | [inline] |
TokenizerParams& Falcon::TokenizerParams::bindSep | ( | bool | mode = true |
) | [inline] |
Add the tokens to the non-token previous element.
This adds the separators to the token preceding them when returning the token. If grouping is activated, then more than a single separator may be returned.
TokenizerParams& Falcon::TokenizerParams::groupSep | ( | bool | mode = true |
) | [inline] |
Activate this option to have the Tokenizer return only once for a sequence of separators all alike.
In example, if the token list includes a space, then only one token will be returned no matter how many spaces are encountered. If not given, an empty string would be returned as a token if two tokens are found one after another.
bool Falcon::TokenizerParams::isBindSep | ( | ) | const [inline] |
bool Falcon::TokenizerParams::isGroupSep | ( | ) | const [inline] |
bool Falcon::TokenizerParams::isReturnSep | ( | ) | const [inline] |
bool Falcon::TokenizerParams::isTrim | ( | ) | const [inline] |
bool Falcon::TokenizerParams::isWsToken | ( | ) | const [inline] |
int32 Falcon::TokenizerParams::maxToken | ( | ) | const [inline] |
TokenizerParams& Falcon::TokenizerParams::maxToken | ( | int32 | size | ) | [inline] |
Sets the maximum size of the returned tokens.
If the size of the input data exceeds this size while searching for a token, an item is returned as if a separator was found.
TokenizerParams& Falcon::TokenizerParams::returnSep | ( | bool | mode = true |
) | [inline] |
Returns found tokens separately.
This forces the tokenizer to return each token in a separate call. For example, if "," is a token:
"a, b, c"
TokenizerParams& Falcon::TokenizerParams::trim | ( | bool | mode = true |
) | [inline] |
Whitespaces are trimmed from the retuned tokens.
Whitespaces are tab, space, carrige return and line feed characters. If this option is actived, the returned tokens won't include spaces found at the beginning or at the end of the token. In example, if the spearator is ':', and trim is enabled, the following sequence:
: a: b : :c
TokenizerParams& Falcon::TokenizerParams::wsIsToken | ( | bool | mode = true |
) | [inline] |
Treat a sequence of whitespaces of any lenght as a single token.
This separates words between spaces and other tokens. For example, a text analyzer may use this mode to get words and puntactions with a single "next" call.