[Boost] boost: tokenizer

Source: Internet
Author: User

The tokenizer Library provides four predefined word segmentation objects. char_delimiters_separator has been deprecated. The others are as follows:

1. char_separator

Char_separator has two constructors.
1. char_separator ()
Use the STD: isspace () function to identify the discarded separator, and use STD: ispunct () to identify the reserved separator. In addition, empty words are discarded. (See example 2)
2. char_separator (// Unretained Separator
Const char * dropped_delims,
// Reserved delimiter
Const char * kept_delims = 0,
// By default, the space separator is not reserved. Otherwise, the keep_empty_tokens parameter is added.
Empty_token_policy empty_tokens = drop_empty_tokens)

This function creates a char_separator object, which is used to create a token_iterator or tokenizer to execute word decomposition. Dropped_delims and kept_delims are both strings, each of which is used as the delimiter for decomposition. When a separator is encountered in the input sequence, the current word is completed and the next new word starts. The delimiters in dropped_delims cannot appear in output words, while the delimiters in kept_delims are output as words. If empty_tokens is drop_empty_tokens, blank words are not displayed in the output. If
If empty_tokens is keep_empty_tokens, the blank words will appear in the output. (See example 3)

2. escaped_list_separator

Escaped_list_separator has two constructors.
The following three characters are used as separators :'\',',','"'
1. Explicit escaped_list_separator (char E = '\', char c = ',', char q = '\"');

Parameters Description
E Specifies the escape character. The C-style \ (backslash) is used by default ). However, you can pass in different characters to overwrite it.
If you have many windows-style file names, escape each \ in the path.
You can use other characters as escape characters.
C Specifies the characters used to separate fields.
Q Character used as quotation marks

2. escaped_list_separator (string_type E, string_type C, string_type q ):

Parameters Description
E All characters in string e are considered as escape characters. If the given string is a Null String, no escape characters are involved.
C All characters in string C are considered as separators. If the given string is a Null String, there is no separator.
Q All characters in string Q are considered as quotation marks. If a null string is specified, no quotation marks are provided.

3. offset_separator

Offset_separator has a useful constructor.
Template <typename ITER>
Offset_separator (ITER begin, ITER end, bool bwrapoffsets = true, bool breturnpartiallast = true );

Parameters Description
Begin, end Specify integer offset Sequence
Bwrapoffsets Determines whether to continue starting from the offset sequence after all offsets are used up.
For example, the string "1225200101012002" is decomposed by offset (2, 2, 4,
If bwrapoffsets is true, it is decomposed into 12 25 2001 01 01 2002.
If the value of bwrapoffsets is false, it is decomposed into 12 25 2001, And the offset is used up.
Breturnpartiallast Indicates whether a word is created or ignored when the decomposition sequence ends before the number of characters required to generate the current offset.
For example, the string "122501" is decomposed by offset (2, 2, 4,
If breturnpartiallast is true, it is decomposed into 12 25 01.
If it is false, it is decomposed into 12 25, and then it ends because there are only 2 Characters and less than 4 characters in the sequence.

Example

Void test_string_tokenizer () {using namespace boost; // 1. use the default template parameters to create Word Segmentation objects. By default, all spaces and punctuation are used as separators. {STD: String STR ("link raise the master-sword. "); tokenizer <> Tok (STR); For (boost_auto (Pos, Tok. begin (); pos! = Tok. end (); ++ POS) STD: cout <"[" <* POS <"]"; STD: cout <STD: Endl; // [LINK] [raise] [the] [Master] [sword]} // 2. char_separator () {STD: String STR ("link raise the master-sword. "); // A char_separator object. The default constructor (retains punctuation but regards it as a separator) char_separator <char> Sep; tokenizer <char_separator <char> Tok (STR, SEP); For (boost_auto (Pos, Tok. begin (); pos! = Tok. end (); ++ POS) STD: cout <"[" <* POS <"]"; STD: cout <STD: Endl; // [LINK] [raise] [the] [Master] [-] [sword] [.]} // 3. char_separator (const char * dropped_delims, // const char * kept_delims = 0, // empty_token_policy empty_tokens = bytes) {STD: String STR = ";!!; Hello | world |-Foo -- bar; yow; Baz | "; char_separator <char> sep1 ("-; | "); tokenizer <char_separator <char> tok1 (STR, sep1); For (boost_auto (Pos, tok1.begin (); pos! = Tok1.end (); ++ POS) STD: cout <"[" <* POS <"]"; STD: cout <STD: Endl; // [!] [Hello] [World] [Foo] [bar] [yow] [BAZ] char_separator <char> sep2 ("-;", "|", keep_empty_tokens ); tokenizer <char_separator <char> tok2 (STR, sep2); For (boost_auto (Pos, tok2.begin (); pos! = Tok2.end (); ++ POS) STD: cout <"[" <* POS <"]"; STD: cout <STD: Endl; // [] [!] [Hello] [|] [World] [|] [] [|] [] [Foo] [] [bar] [yow] [BAZ] [|] []}/ /4. escaped_list_separator {STD: String STR = "Field 1, \" Putting quotes around fields, allows commas \ ", Field 3"; tokenizer <escaped_list_separator <char> Tok (STR ); for (boost_auto (Pos, Tok. begin (); pos! = Tok. end (); ++ POS) STD: cout <"[" <* POS <"]"; STD: cout <STD: Endl; // [Field 1] [Putting quotes around fields, allows commas] [Field 3] // The comma in the quotation marks cannot be used as a separator .} // 5. offset_separator {STD: String STR = "12252001400"; int offsets [] = {2, 2, 4}; offset_separator F (offsets, offsets + 3 ); tokenizer <offset_separator> Tok (STR, f); For (boost_auto (Pos, Tok. begin (); pos! = Tok. end (); ++ POS) STD: cout <"[" <* POS <"]"; STD: cout <STD: Endl ;}}

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.