Boost: tokenizer and boosttokenizer
The tokenizer Library provides four predefined word segmentation objects. char_delimiters_separator has been deprecated. The others are as follows:
1. char_separator
Char_separator has two constructors.
1 char_separator()
Use the std: isspace () function to identify the discarded separator, and use std: ispunct () to identify the reserved separator. In addition, empty words are discarded. (See example 2)
1 char_separator (// non-reserved separator 2 const Char * dropped_delims, 3 // reserved separator 4 const Char * kept_delims = 0, 5 // space separator is not reserved by default, add the keep_empty_tokens6 empty_token_policy empty_tokens = drop_empty_tokens) parameter)
This function creates a char_separator object, which is used to create a token_iterator or tokenizer to execute word decomposition.Dropped_delimsAndKept_delimsIs a string, and each character is usedDelimiter used for decomposition. When a separator is encountered in the input sequence, the current word is completed and the next new word starts.Dropped_delimsSeparator inNot output wordsAndKept_delimsThe delimiter inOutput as a word. IfEmpty_tokens is drop_empty_tokens, Then the blank words will not appear in the output. If empty_tokens is keep_empty_tokens, the blank words will appear in the output. (See example 3)
2. escaped_list_separator
Escaped_list_separator has two constructors. The following three characters are used as separators:'\',',','"'
1 explicit escaped_list_separator(Char e = '\\', Char c = ',',Char q = '\"');
1 escaped_list_separator(string_type e, string_type c, string_type q):
3. offset_separator
Offset_separator has a useful constructor.
1 template<typename Iter>2 offset_separator(Iter begin,Iter end,bool bwrapoffsets = true, bool breturnpartiallast = true);
1 void test_string_tokenizer () 2 {3 using namespace boost; 4 5 // 1. use the default template parameters to create Word Segmentation objects. By default, all spaces and punctuation are used as separators. 6 {7 std: string str ("Link raise the master-sword. "); 8 9 tokenizer <> tok (str); 10 for (BOOST_AUTO (pos, tok. begin (); pos! = Tok. end (); ++ pos) 11 std: cout <"[" <* pos <"]"; 12 std: cout <std: endl; 13 // [Link] [raise] [the] [master] [sword] 14} 15 16 // 2. char_separator () 17 {18 std: string str ("Link raise the master-sword. "); 19 20 // A char_separator object, default constructor (Reserved punctuation but treated as a separator) 21 char_separator <char> sep; 22 tokenizer <char_separator <char> tok (str, sep); 23 for (BOOST_AUTO (pos, tok. begin (); pos! = Tok. end (); ++ pos) 24 std: cout <"[" <* pos <"]"; 25 std: cout <std: endl; 26 // [Link] [raise] [the] [master] [-] [sword] [.] 27} 28 29 // 3. char_separator (const Char * dropped_delims, 30 // const Char * kept_delims = 0, 31 // empty_token_policy empty_tokens = bytes) 32 {33 std: string str = ";!!; Hello | world |-foo -- bar; yow; baz | "; 34 35 char_separator <char> sep1 ("-; | "); 36 tokenizer <char_separator <char> tok1 (str, sep1); 37 for (BOOST_AUTO (pos, tok1.begin (); pos! = Tok1.end (); ++ pos) 38 std: cout <"[" <* pos <"]"; 39 std: cout <std :: endl; 40 // [!] [Hello] [world] [foo] [bar] [yow] [baz] 41 42 char_separator <char> sep2 ("-;", "|", keep_empty_tokens ); 43 tokenizer <char_separator <char> tok2 (str, sep2); 44 for (BOOST_AUTO (pos, tok2.begin (); pos! = Tok2.end (); ++ pos) 45 std: cout <"[" <* pos <"]"; 46 std: cout <std :: endl; 47 // [] [!] [Hello] [|] [world] [|] [] [|] [] [foo] [] [bar] [yow] [baz] [|] [] 48} 49 50 // 4. escaped_list_separator51 {52 std: string str = "Field 1, \" putting quotes around fields, allows commas \ ", Field 3 "; 53 54 tokenizer <escaped_list_separator <char> tok (str); 55 for (BOOST_AUTO (pos, tok. begin (); pos! = Tok. end (); ++ pos) 56 std: cout <"[" <* pos <"]"; 57 std: cout <std: endl; 58 // [Field 1] [putting quotes around fields, allows commas] [Field 3] 59 // The comma in the quotation marks cannot be used as a separator. 60} 61 62 // 5. offset_separator63 {64 std: string str = "12252001400"; 65 66 int offsets [] = {2, 2, 4}; 67 offset_separator f (offsets, offsets + 3 ); 68 tokenizer <offset_separator> tok (str, f); 69 70 for (BOOST_AUTO (pos, tok. begin ()); Pos! = Tok. end (); ++ pos) 71 std: cout <"[" <* pos <"]"; 72 std: cout <std: endl; 73 // [12] [25] [1, 2001] [40] [0] 74} 75}