The tokenizer Library provides four predefined word segmentation objects. char_delimiters_separator has been deprecated. The others are as follows:
1. char_separator
Char_separator has two constructors.
1. char_separator ()
Use the STD: isspace () function to identify the discarded separator, and use STD: ispunct () to identify the reserved separator. In addition, empty words are discarded. (See example 2)
2. char_separator (// Unretained Separator
Const char * dropped_delims,
// Reserved delimiter
Const char * kept_delims = 0,
// By default, the space separator is not reserved. Otherwise, the keep_empty_tokens parameter is added.
Empty_token_policy empty_tokens = drop_empty_tokens)
This function creates a char_separator object, which is used to create a token_iterator or tokenizer to execute word decomposition. Dropped_delims and kept_delims are both strings, each of which is used as the delimiter for decomposition. When a separator is encountered in the input sequence, the current word is completed and the next new word starts. The delimiters in dropped_delims cannot appear in output words, while the delimiters in kept_delims are output as words. If empty_tokens is drop_empty_tokens, blank words are not displayed in the output. If
If empty_tokens is keep_empty_tokens, the blank words will appear in the output. (See example 3)
2. escaped_list_separator
Escaped_list_separator has two constructors.
The following three characters are used as separators :'\',',','"'
1. Explicit escaped_list_separator (char E = '\', char c = ',', char q = '\"');
Parameters |
Description |
E |
Specifies the escape character. The C-style \ (backslash) is used by default ). However, you can pass in different characters to overwrite it. If you have many windows-style file names, escape each \ in the path. You can use other characters as escape characters. |
C |
Specifies the characters used to separate fields. |
Q |
Character used as quotation marks |
2. escaped_list_separator (string_type E, string_type C, string_type q ):
Parameters |
Description |
E |
All characters in string e are considered as escape characters. If the given string is a Null String, no escape characters are involved. |
C |
All characters in string C are considered as separators. If the given string is a Null String, there is no separator. |
Q |
All characters in string Q are considered as quotation marks. If a null string is specified, no quotation marks are provided. |
3. offset_separator
Offset_separator has a useful constructor.
Template <typename ITER>
Offset_separator (ITER begin, ITER end, bool bwrapoffsets = true, bool breturnpartiallast = true );
Parameters |
Description |
Begin, end |
Specify integer offset Sequence |
Bwrapoffsets |
Determines whether to continue starting from the offset sequence after all offsets are used up. For example, the string "1225200101012002" is decomposed by offset (2, 2, 4, If bwrapoffsets is true, it is decomposed into 12 25 2001 01 01 2002. If the value of bwrapoffsets is false, it is decomposed into 12 25 2001, And the offset is used up. |
Breturnpartiallast |
Indicates whether a word is created or ignored when the decomposition sequence ends before the number of characters required to generate the current offset. For example, the string "122501" is decomposed by offset (2, 2, 4, If breturnpartiallast is true, it is decomposed into 12 25 01. If it is false, it is decomposed into 12 25, and then it ends because there are only 2 Characters and less than 4 characters in the sequence. |
Example
Void test_string_tokenizer () {using namespace boost; // 1. use the default template parameters to create Word Segmentation objects. By default, all spaces and punctuation are used as separators. {STD: String STR ("link raise the master-sword. "); tokenizer <> Tok (STR); For (boost_auto (Pos, Tok. begin (); pos! = Tok. end (); ++ POS) STD: cout <"[" <* POS <"]"; STD: cout <STD: Endl; // [LINK] [raise] [the] [Master] [sword]} // 2. char_separator () {STD: String STR ("link raise the master-sword. "); // A char_separator object. The default constructor (retains punctuation but regards it as a separator) char_separator <char> Sep; tokenizer <char_separator <char> Tok (STR, SEP); For (boost_auto (Pos, Tok. begin (); pos! = Tok. end (); ++ POS) STD: cout <"[" <* POS <"]"; STD: cout <STD: Endl; // [LINK] [raise] [the] [Master] [-] [sword] [.]} // 3. char_separator (const char * dropped_delims, // const char * kept_delims = 0, // empty_token_policy empty_tokens = bytes) {STD: String STR = ";!!; Hello | world |-Foo -- bar; yow; Baz | "; char_separator <char> sep1 ("-; | "); tokenizer <char_separator <char> tok1 (STR, sep1); For (boost_auto (Pos, tok1.begin (); pos! = Tok1.end (); ++ POS) STD: cout <"[" <* POS <"]"; STD: cout <STD: Endl; // [!] [Hello] [World] [Foo] [bar] [yow] [BAZ] char_separator <char> sep2 ("-;", "|", keep_empty_tokens ); tokenizer <char_separator <char> tok2 (STR, sep2); For (boost_auto (Pos, tok2.begin (); pos! = Tok2.end (); ++ POS) STD: cout <"[" <* POS <"]"; STD: cout <STD: Endl; // [] [!] [Hello] [|] [World] [|] [] [|] [] [Foo] [] [bar] [yow] [BAZ] [|] []}/ /4. escaped_list_separator {STD: String STR = "Field 1, \" Putting quotes around fields, allows commas \ ", Field 3"; tokenizer <escaped_list_separator <char> Tok (STR ); for (boost_auto (Pos, Tok. begin (); pos! = Tok. end (); ++ POS) STD: cout <"[" <* POS <"]"; STD: cout <STD: Endl; // [Field 1] [Putting quotes around fields, allows commas] [Field 3] // The comma in the quotation marks cannot be used as a separator .} // 5. offset_separator {STD: String STR = "12252001400"; int offsets [] = {2, 2, 4}; offset_separator F (offsets, offsets + 3 ); tokenizer <offset_separator> Tok (STR, f); For (boost_auto (Pos, Tok. begin (); pos! = Tok. end (); ++ POS) STD: cout <"[" <* POS <"]"; STD: cout <STD: Endl ;}}