Boost: tokenizer and boosttokenizer

Source: Internet
Author: User

Boost: tokenizer and boosttokenizer

The tokenizer Library provides four predefined word segmentation objects. char_delimiters_separator has been deprecated. The others are as follows:

1. char_separator

Char_separator has two constructors.

1 char_separator()

Use the std: isspace () function to identify the discarded separator, and use std: ispunct () to identify the reserved separator. In addition, empty words are discarded. (See example 2)

1 char_separator (// non-reserved separator 2 const Char * dropped_delims, 3 // reserved separator 4 const Char * kept_delims = 0, 5 // space separator is not reserved by default, add the keep_empty_tokens6 empty_token_policy empty_tokens = drop_empty_tokens) parameter)

This function creates a char_separator object, which is used to create a token_iterator or tokenizer to execute word decomposition.Dropped_delimsAndKept_delimsIs a string, and each character is usedDelimiter used for decomposition. When a separator is encountered in the input sequence, the current word is completed and the next new word starts.Dropped_delimsSeparator inNot output wordsAndKept_delimsThe delimiter inOutput as a word. IfEmpty_tokens is drop_empty_tokens, Then the blank words will not appear in the output. If empty_tokens is keep_empty_tokens, the blank words will appear in the output. (See example 3)

2. escaped_list_separator

Escaped_list_separator has two constructors. The following three characters are used as separators:'\',',','"'

1 explicit escaped_list_separator(Char e = '\\', Char c = ',',Char q = '\"'); 

 

1 escaped_list_separator(string_type e, string_type c, string_type q):

3. offset_separator

Offset_separator has a useful constructor.

1 template<typename Iter>2   offset_separator(Iter begin,Iter end,bool bwrapoffsets = true, bool breturnpartiallast = true);

1 void test_string_tokenizer () 2 {3 using namespace boost; 4 5 // 1. use the default template parameters to create Word Segmentation objects. By default, all spaces and punctuation are used as separators. 6 {7 std: string str ("Link raise the master-sword. "); 8 9 tokenizer <> tok (str); 10 for (BOOST_AUTO (pos, tok. begin (); pos! = Tok. end (); ++ pos) 11 std: cout <"[" <* pos <"]"; 12 std: cout <std: endl; 13 // [Link] [raise] [the] [master] [sword] 14} 15 16 // 2. char_separator () 17 {18 std: string str ("Link raise the master-sword. "); 19 20 // A char_separator object, default constructor (Reserved punctuation but treated as a separator) 21 char_separator <char> sep; 22 tokenizer <char_separator <char> tok (str, sep); 23 for (BOOST_AUTO (pos, tok. begin (); pos! = Tok. end (); ++ pos) 24 std: cout <"[" <* pos <"]"; 25 std: cout <std: endl; 26 // [Link] [raise] [the] [master] [-] [sword] [.] 27} 28 29 // 3. char_separator (const Char * dropped_delims, 30 // const Char * kept_delims = 0, 31 // empty_token_policy empty_tokens = bytes) 32 {33 std: string str = ";!!; Hello | world |-foo -- bar; yow; baz | "; 34 35 char_separator <char> sep1 ("-; | "); 36 tokenizer <char_separator <char> tok1 (str, sep1); 37 for (BOOST_AUTO (pos, tok1.begin (); pos! = Tok1.end (); ++ pos) 38 std: cout <"[" <* pos <"]"; 39 std: cout <std :: endl; 40 // [!] [Hello] [world] [foo] [bar] [yow] [baz] 41 42 char_separator <char> sep2 ("-;", "|", keep_empty_tokens ); 43 tokenizer <char_separator <char> tok2 (str, sep2); 44 for (BOOST_AUTO (pos, tok2.begin (); pos! = Tok2.end (); ++ pos) 45 std: cout <"[" <* pos <"]"; 46 std: cout <std :: endl; 47 // [] [!] [Hello] [|] [world] [|] [] [|] [] [foo] [] [bar] [yow] [baz] [|] [] 48} 49 50 // 4. escaped_list_separator51 {52 std: string str = "Field 1, \" putting quotes around fields, allows commas \ ", Field 3 "; 53 54 tokenizer <escaped_list_separator <char> tok (str); 55 for (BOOST_AUTO (pos, tok. begin (); pos! = Tok. end (); ++ pos) 56 std: cout <"[" <* pos <"]"; 57 std: cout <std: endl; 58 // [Field 1] [putting quotes around fields, allows commas] [Field 3] 59 // The comma in the quotation marks cannot be used as a separator. 60} 61 62 // 5. offset_separator63 {64 std: string str = "12252001400"; 65 66 int offsets [] = {2, 2, 4}; 67 offset_separator f (offsets, offsets + 3 ); 68 tokenizer <offset_separator> tok (str, f); 69 70 for (BOOST_AUTO (pos, tok. begin ()); Pos! = Tok. end (); ++ pos) 71 std: cout <"[" <* pos <"]"; 72 std: cout <std: endl; 73 // [12] [25] [1, 2001] [40] [0] 74} 75}

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.