[Boost] boost: tokenizer

Last Update:2018-12-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The tokenizer Library provides four predefined word segmentation objects. char_delimiters_separator has been deprecated. The others are as follows:

1. char_separator

Char_separator has two constructors.
1. char_separator ()
Use the STD: isspace () function to identify the discarded separator, and use STD: ispunct () to identify the reserved separator. In addition, empty words are discarded. (See example 2)
2. char_separator (// Unretained Separator
Const char * dropped_delims,
// Reserved delimiter
Const char * kept_delims = 0,
// By default, the space separator is not reserved. Otherwise, the keep_empty_tokens parameter is added.
Empty_token_policy empty_tokens = drop_empty_tokens)
This function creates a char_separator object, which is used to create a token_iterator or tokenizer to execute word decomposition. Dropped_delims and kept_delims are both strings, each of which is used as the delimiter for decomposition. When a separator is encountered in the input sequence, the current word is completed and the next new word starts. The delimiters in dropped_delims cannot appear in output words, while the delimiters in kept_delims are output as words. If empty_tokens is drop_empty_tokens, blank words are not displayed in the output. If
If empty_tokens is keep_empty_tokens, the blank words will appear in the output. (See example 3)

2. escaped_list_separator

Escaped_list_separator has two constructors.
The following three characters are used as separators :'\',',','"'
1. Explicit escaped_list_separator (char E = '\', char c = ',', char q = '\"');

Parameters	Description
E	Specifies the escape character. The C-style \ (backslash) is used by default ). However, you can pass in different characters to overwrite it. If you have many windows-style file names, escape each \ in the path. You can use other characters as escape characters.
C	Specifies the characters used to separate fields.
Q	Character used as quotation marks

2. escaped_list_separator (string_type E, string_type C, string_type q ):

Parameters	Description
E	All characters in string e are considered as escape characters. If the given string is a Null String, no escape characters are involved.
C	All characters in string C are considered as separators. If the given string is a Null String, there is no separator.
Q	All characters in string Q are considered as quotation marks. If a null string is specified, no quotation marks are provided.

3. offset_separator

Offset_separator has a useful constructor.
Template <typename ITER>
Offset_separator (ITER begin, ITER end, bool bwrapoffsets = true, bool breturnpartiallast = true );

Parameters	Description
Begin, end	Specify integer offset Sequence
Bwrapoffsets	Determines whether to continue starting from the offset sequence after all offsets are used up. For example, the string "1225200101012002" is decomposed by offset (2, 2, 4, If bwrapoffsets is true, it is decomposed into 12 25 2001 01 01 2002. If the value of bwrapoffsets is false, it is decomposed into 12 25 2001, And the offset is used up.
Breturnpartiallast	Indicates whether a word is created or ignored when the decomposition sequence ends before the number of characters required to generate the current offset. For example, the string "122501" is decomposed by offset (2, 2, 4, If breturnpartiallast is true, it is decomposed into 12 25 01. If it is false, it is decomposed into 12 25, and then it ends because there are only 2 Characters and less than 4 characters in the sequence.

Example

Void test_string_tokenizer () {using namespace boost; // 1. use the default template parameters to create Word Segmentation objects. By default, all spaces and punctuation are used as separators. {STD: String STR ("link raise the master-sword. "); tokenizer <> Tok (STR); For (boost_auto (Pos, Tok. begin (); pos! = Tok. end (); ++ POS) STD: cout <"[" <* POS <"]"; STD: cout <STD: Endl; // [LINK] [raise] [the] [Master] [sword]} // 2. char_separator () {STD: String STR ("link raise the master-sword. "); // A char_separator object. The default constructor (retains punctuation but regards it as a separator) char_separator <char> Sep; tokenizer <char_separator <char> Tok (STR, SEP); For (boost_auto (Pos, Tok. begin (); pos! = Tok. end (); ++ POS) STD: cout <"[" <* POS <"]"; STD: cout <STD: Endl; // [LINK] [raise] [the] [Master] [-] [sword] [.]} // 3. char_separator (const char * dropped_delims, // const char * kept_delims = 0, // empty_token_policy empty_tokens = bytes) {STD: String STR = ";!!; Hello | world |-Foo -- bar; yow; Baz | "; char_separator <char> sep1 ("-; | "); tokenizer <char_separator <char> tok1 (STR, sep1); For (boost_auto (Pos, tok1.begin (); pos! = Tok1.end (); ++ POS) STD: cout <"[" <* POS <"]"; STD: cout <STD: Endl; // [!] [Hello] [World] [Foo] [bar] [yow] [BAZ] char_separator <char> sep2 ("-;", "|", keep_empty_tokens ); tokenizer <char_separator <char> tok2 (STR, sep2); For (boost_auto (Pos, tok2.begin (); pos! = Tok2.end (); ++ POS) STD: cout <"[" <* POS <"]"; STD: cout <STD: Endl; // [] [!] [Hello] [|] [World] [|] [] [|] [] [Foo] [] [bar] [yow] [BAZ] [|] []}/ /4. escaped_list_separator {STD: String STR = "Field 1, \" Putting quotes around fields, allows commas \ ", Field 3"; tokenizer <escaped_list_separator <char> Tok (STR ); for (boost_auto (Pos, Tok. begin (); pos! = Tok. end (); ++ POS) STD: cout <"[" <* POS <"]"; STD: cout <STD: Endl; // [Field 1] [Putting quotes around fields, allows commas] [Field 3] // The comma in the quotation marks cannot be used as a separator .} // 5. offset_separator {STD: String STR = "12252001400"; int offsets [] = {2, 2, 4}; offset_separator F (offsets, offsets + 3 ); tokenizer <offset_separator> Tok (STR, f); For (boost_auto (Pos, Tok. begin (); pos! = Tok. end (); ++ POS) STD: cout <"[" <* POS <"]"; STD: cout <STD: Endl ;}}

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

[Boost] boost: tokenizer

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

[Boost] boost: tokenizer

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support