Lua string mode and capture

Source: Internet
Author: User
Tags uppercase letter

Every time you use string mode matching, you always have to go through the instructions on the Lua official website. There is no detailed description on the Internet, and many friends have asked me about the content, in fact, there are three difficulties:

  • I am not familiar with the regular expression of Lua;
  • The other is unfamiliar with the usage of several functions provided by the string library in Lua;
  • Another point is that the string library of Lua puts forward a new concept called capture, which is actually not a new concept. It is hard to understand the mixed behavior with function calls.

Here I will summarize.

Start with several unfamiliar functions provided by Lua's built-in string Library (based on Lua5.1, Lua5.2 is basically unchanged ).

Lua's built-in string library uses four functions in the mode, which are:

  • String. find ()

  • String. match ()

  • String. gmatch ()

  • String. gsub ()

1. string. find (s, pattern, start, plain)

This function is used to find the specified pattern in string s.

If a pattern match is found, the start and end points of the pattern in s are returned; otherwise, nil is returned. It only returns the first matched position. Therefore, the returned value is actually two numbers, the starting point of matching and the end point of matching. The third parameter is a number, which is optional. start specifies the start position for searching in s. The default value is 1, and start can be a negative number.-1 indicates that it starts from the last character, -2 indicates the second to last character. Of course, the last character ends, so if you specify a position starting from the last character, you will only search for this character. The fourth parameter is a bool value, which specifies whether special characters are used in the second parameter pattern. If the fourth parameter is set to true, it means the special characters in the second parameter pattern (these characters include ^ $ * + ?. ([%-, Defined in Lua source code lstrlib. c) are processed as normal characters, that is, a simple string matching, but the so-called pattern matching means that regular expression matching is not supported. Otherwise, false means that pattern is processed with special characters. This is not quite clear. I will understand it in an example, but it involves a special character in Lua mode. If you still don't understand it here, after reading the Lua regular expression, I should be able to understand it. For example:123 local s = "am + df" print (string. find (s, "m +", 1, false) -- 2 2 print (string. find (s, "m +", 1, true) -- 2 3

The character + in the Lua regular expression means matching the character before it once or multiple times. That is to say, m + will match m, mm, mmm… in the regular expression ....... Therefore, when the fourth parameter of string. find is false, the matching m letter can only be found in string s, and the returned result is 2 2.

When the fourth parameter is true, + is treated as a normal character. Then, the matching string is m +, and the location is 2 or 3. If you do not pass the fourth parameter, it means false. The above is a simple introduction to the find function, but this function does not always act like this. Why? This is the Lua capture I mentioned at the beginning of this article. It will also be incorporated into these string library functions. There is no way. I have to first introduce the concept of capture. The second parameter of the above find function is understood as a pattern, which can be understood as a regular expression in regular expression matching. Lua adds a new function for this pattern, that is, the so-called capture. In a pattern string, we can use parentheses () to indicate the matching we want to save, and the content in this parentheses is still a pattern string, that is to say, we just keep some special characters in the pattern for later use. For example, in the above example, the pattern string is m +. If I want to capture a string that matches m +, that is, to save it, I can enclose it with parentheses, in addition to the behavior mentioned above, the find function also returns all the strings to be captured in addition to the start and end positions of the pattern. For example:Local s = "am + df" print (string. find (s, "(m +)", 1, false) -- 2 2 m

If you want to capture more content, you just need to enclose it with parentheses, for example:

Local s = "am + df" print (string. find (s, "(m +)", 1, false) -- 2 2 m mprint (string. find (s, "(m +)", 1, false) -- 2 2 m

Another thing to note about capture is that the capture will only follow the string function when the pattern can be matched successfully. For example, I want to capture the letter, but in fact, this pattern cannot be matched at all, so it is certainly not returned:

Local s = "am + df" print (string. find (s, "(m +) (a)", 1, false) -- nil

In addition, the returned capture sequence is determined by the left parentheses. For example, in the above example of capturing 3 m, the first m is actually captured by the outermost parentheses. Why the capture sequence? Because we can use % n to obtain the nth captured string. What is the purpose of obtaining the corresponding capture? This will be introduced later.

An empty capture, that is, there is nothing in the parentheses, And it will return the position of the current string comparison operation, suchLocal s = "am + df" print (string. find (s, "() (m +) ()", 1, false) -- 2 2 2 m 3

One thing you must mention is that in the Lua5.1 source code, the number of captured strings is limited. The default value is 32. That is to say, you cannot add unlimited parentheses, you can add up to 32. If the capture exceeds the limit, an error is returned, for example:

Local s = "am + df" print (string. find (s ,"()()()()()()()()()()()()()()()()() () ", 1, false )) -- capture 33Of course, you can modify the Lua source code to adjust the number of captures you want to save. This quantity is defined in the luaconf. h file:

In general, for use, analysis is basically here, but for Lua, because the source code is simple and elegant, it is written in C language, it is hard to remember, you must understand the source code to get rid of hate.

The loading method of the Lua built-in library is not mentioned. We can see it in the Articles of various great gods. Let's look at the string. find () function. The function is in the lstrlib. c file:

Staticintstr_find (lua_State * L) {substring (L, 1);} staticintstr_find_aux (lua_State * L, intfind) {size_tl1, l2; constchar * s = luaL_checklstring (L, 1, & l1); constchar * p = luaL_checklstring (L, 2, & l2); ptrdiff_tinit = posrelat (luaL_optinteger (L, 3, 1), l1)-1; if (init <0) init = 0; elseif (size_t) (init)> l1) init = (ptrdiff_t) l1; if (find & (lua_toboolean (L, 4) |/* explicit request? */Strpbrk (p, SPECIALS) = NULL) {/* or no special characters? * // * Do a plain search */constchar * s2 = lmemfind (s + init, l1-init, p, l2); if (s2) {lua_pushinteger (L, s2-s + 1); lua_pushinteger (L, s2-s + l2); return2 ;}} else {MatchState MS; intanchor = (* p = '^ ')? (P ++, 1): 0; constchar * s1 = s + init; ms. L = L; ms. src_init = s; ms. src_end = s + l1; do {constchar * res; ms. level = 0; if (res = match (& ms, s1, p ))! = NULL) {if (find) {lua_pushinteger (L, s1-s + 1);/* start */lua_pushinteger (L, res-s ); /* end */returnpush_captures (& ms, NULL, 0) + 2;} elsereturnpush_captures (& ms, s1, res);} while (s1 ++ <ms. src_end &&! Anchor);} lua_pushnil (L);/* not found */return1 ;}

This function looks quite long at the beginning, but after careful analysis, it is actually very simple. The first six rows receive the first three parameters, but process the start point parameter to prevent exceeding the string length. The most important thing is the next logic of if else. find is the passed parameter. find is 1, so you don't need to worry about it. It's okay to think it's always true. Now that we 've mentioned it, is there anything else that will call this function prototype, bingo! After searching, we will find that the string. match () function actually calls this function prototype, and its find parameter is passed 0. Haha, is the string. match function actually the same as the string. find function?

Staticintstr_match (lua_State * L) {returnstr_find_aux (L, 0);} staticintstr_find (lua_State * L) {returnstr_find_aux (L, 1 );}

This will be discussed later when we introduce the string. match function. Pull back and continue to talk about the if else logic. The if judgment condition is actually that you call string. the fourth parameter of find. If the fourth parameter passes true, that is, as I mentioned above, no special character mode is used, or the medium-pressure root mode has no special characters, the SPECIALS macro is also defined in this file:

If these characters do not exist or are not specially processed, it is a simple string matching. The lmemfind () function is called. If yes, the starting and ending positions matching are returned. In this case, else understands that it uses special characters for matching. The key function here is match (), which processes string and pattern matching, and the capture is carried out. This is left in the Introduction mode and I will continue to talk about it. If the final match is found, the start and end points of the match will still be returned. Note that there is an additional operation to push the captured string to the stack. Therefore, only the captured strings are available when we call and capture them. It seems quite understandable. Under the trend of curiosity, I am very interested. How does the lmemfind () function of Lua perform string matching? Is it the legendary KMP or BM algorithm? If you are not familiar with these two algorithms, you can look at the two articles of Ruan Yifeng: http://www.ruanyifeng.com/blog/2013/05/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm.html,http://www.ruanyifeng.com/blog/2013/05/boyer-moore_string_search_algorithm.html

With a little excitement, I clicked lmemfind ():

Staticconstchar * lmemfind (constchar * s1, size_tl1, constchar * s2, size_tl2) {if (l2 = 0) returns1;/* empty strings are everywhere */elseif (l2> l1) returnNULL;/* avoids a negative 'l1' */else {constchar * init;/* to search for a '* s2 'inside 's1' */l2 --; /* 1st char will be checked by 'memchr' */l1 = L1-L2; /* 's2 'cannot be found after that */while (l1> 0 & (init = (constchar *) memchr (s1, * s2, l1 ))! = NULL) {init ++;/* 1st char is already checked */if (memcmp (init, s2 + 1, l2) = 0) returninit-1; else {/* correct 'l1' and 's1' to try again */l1-= init-s1; s1 = init ;}} returnNULL;/* not found */}}

In general, this comparison method is still quite satisfactory. It searches for the first character of the matching string from the beginning, but only uses the memchr function, after finding it, use the memcmp function to compare whether the two strings are the same. If they are different, skip the checked characters and continue. Compared to the complex string matching algorithms, this is simple and cute, like a one :), the execution of memcmp functions is naturally faster than that of the str series, because you don't need to check the '\ 0' character all the time, there is an article about the memcmp function's practice. Although it refers to its optimization, you can also get a general idea of the memcmp practice by reading its code: http://blog.chinaunix.net/uid-25627207-id-3556923.html

2. string. match (s, pattern, start) is used to search for the specified pattern in string s and return the first capture specified in pattern. The third parameter specifies the start position of the search. The default value is 1. Compared with the string. find function, string. match is much simpler. It does not require you to choose whether to use special characters. It must be used. The content in pattern is string. like find, it is a Lua mode, which is similar to string. find returns the first capture specified in pattern instead of the starting and ending position of the match. If no capture is specified in pattern, then it will return the matching result of the entire pattern. Of course, nil is still returned if no matching is found.

In the string. as mentioned in the find function, the string in the Lua source code. the function called by match is actually a string. find calls the same functions, all of which are str_find_aux (lua_State * L, int find) functions. The only difference lies in string. when the match is called, The find parameter is passed 0, so that it does not enter the Simple Matching branch in str_find_aux () and directly performs the pattern matching.

3. string. gmatch (s, pattern) the two functions described above, whether it is string. find or string. the match function stops when it finds a string that matches the pattern and returns the corresponding content. However, we often need to find all strings that match the pattern in a string, string. the gmatch () function can meet this requirement. The string. gmatch () function can be called as an iterator to obtain all strings matching the pattern. For example, the Lua official website provides an example:S = "hello world from Lua" forwinstring. gmatch (s, "% a +") doprint (w) end -- [helloworldfromLua]

As for the meaning of % a +, the character + usage is mentioned in the Introduction of string. find (). As for % a, it matches all letters. Note that the string s contains four words separated by three spaces. Therefore, the string is called once. gmatch (s, "% a +") only matches the first word in s, because a space match fails.

The usage of the string. gmatch () function is basically used as an iterator in a loop. I have not found any other usage. The only note is that the special character ^ is in string. the usage of the gmatch function is different from that of other functions. In other functions, the usage of ^ is placed before a pattern, so this pattern must start from the beginning of the string, if the match fails, it will not continue to match the subsequent characters:123 local s = "am + df" print (string. find (s, "(m +)", 1, false) -- 2 2 mprint (string. find (s, "^ (m +)", 1, false) -- nil

The second match, because ^ is added before the pattern, the match starts from the very beginning of string s, that is, the match starts from letter a. Of course, a cannot match (m +) the match is successful, so nil is returned directly. This processing is shown in the else branch pattern matching in the source code function str_find_aux () when we talk about string. find above. We can see that there is Code dedicated to handling ^ characters.

In the string. gmatch function, what does it mean to put ^ in front of the pattern? It means no matching, that is, direct return. Note that nil is not returned, but the function is returned directly without any return value on the stack. Although we can imagine that the implementation of string. gmatch () should be similar to the above, we should take a look at the source code to compare the insurance:

The first two rows return the statuses and iteration functions required by the Lua iterator, so you don't have to worry about them. Let's take a look at the gmatch_aux (lua_State * L) function. After it is processed for the iterator, it matches string. there is no difference in the implementation of the match () function. In the end, the match () function is called for pattern matching. The difference is that the character ^ mentioned above does not exist here.

4. string. gsub (s, pattern, rep, n) is a string function. like gmatch (), it also carries a g. As you can imagine, this function will also obtain all matching strings, unlike string. find () and string. match. This is indeed the case. This function is used to search for all strings matching the pattern in string s and replace them with the strings generated by the rep parameter. You may want to ask why the rep string is generated, instead of rep itself? The rep can be a string, a function, or even a table. When rep is a string, it is generally processed as a normal string, that is, the matching string is directly replaced with this string, but here the special processing symbol %, % When the number is followed by 1-9, that is, the string corresponding to serial number 1-9 captured in the previous mode is used to replace the content of the pattern match. In this case, the comparison is still taken as an example:Local s = "am + dmf" print (string. gsub (s, "() (m +)", "% 1") -- a2 + d5f 2 print (string. gsub (s, "() (m +)", "% 2") -- am + dmf 2 print (string. gsub (s, "() (m +)", "% 3") -- error: invalid capture index

The above pattern is () (m +). This pattern has two captures: the current position of the string and m +, string. the first position matched by gsub () is am + dmf. At this time, the two captures are 2 and m respectively. Then % 1 is the first capture, that is, 2, the string after replacement is a2 + dmf, and then matches the second position am + dmf. The two captures Here are 5, m, the first capture pointing to % 1 is 5, and the replaced string is a2 + d5f, which is the content displayed in the result. The number 2 follows indicates that the replacement is successful twice. According to the above analysis, it is not difficult to understand why the string is not changed when % 2 is replaced, because m is replaced by m, of course, unchanged. In addition, the third print () will report an error, because there are only two captures, and you are going to use % 3, then naturally there will be no such capture. Here, you may also need to note that % will only be combined with the following number. In other words, the reason is why we say 1-9, although the number of captures can reach 32 by default, only the first 9 can be used. A special one is % 0, which is replaced by a matched string. in simple words, it is a repeat matched string, for example:Local s = "am + dmf" print (string. gsub (s, "() (m +)", "% 0% 0% 0") -- ammm + dmmmf 2

The matched string is m, and m in the original string is replaced with mmm.

You may ask, since % has been processed separately, what should I do if I want to replace it with %? Only % can be used to represent % itself. For example:

Local s = "am + dmf" print (string. gsub (s, "() (m +)", "%") -- a % + d % f 2

When rep is a table, the first capture is used as the key to query the table, and the matching string is replaced with the table content. If no capture is specified, then, the entire matching string is used as the key for query. If the corresponding key value is not found, or the corresponding value is not a string or number, no replacement is performed:

Local s = "am + dmf" local t1 = {[2] = "hh", [5] = "xx",} local t2 = {} print (string. gsub (s, "() (m +)", t1) -- ahh + dxxf 2 print (string. gsub (s, "() (m +)", t2) -- am + dmf 2 local t3 = {[2] = false} print (string. gsub (s, "() (m +)", t3) -- am + dmf 2 local t4 = {[2] = {123} print (string. gsub (s, "() (m +)", t4) -- error: invalid replacement value (a table)

When rep is a function, every time a string is matched, all the captures in the mode are passed to the function as parameters in the capture order. If no capture is specified, the entire matching string is passed to the function. If the return value of the function is a string or number, the matching is replaced. If not, the matching is not replaced:

Local s = "am + dmf" functionf1 (...) print (...) -- 2 m -- 5 mreturn "hh" endfunctionf2 () return {123} endprint (string. gsub (s, "() (m +)", f1) -- ahh + dhhf 2 print (string. gsub (s, "() (m +)", f2) -- error: invalid replacement value (a table)

The fourth parameter is used to indicate that it must be replaced by the matching number, for example:

Local s = "am + dmf" print (string. gsub (s, "() (m +)", "%",-1) -- am + dmf 0 print (string. gsub (s, "() (m +)", "%", 0) -- am + dmf 0 print (string. gsub (s, "() (m +)", "%", 1) -- a % + dmf 1 print (string. gsub (s, "() (m +)", "%", 2) -- a % + d % f 2 print (string. gsub (s, "() (m +)", "%", 3) -- a % + d % f 2

Let's take a look at how the source code is written:

Evaluate (lua_State * L) {size_tsrcl; constchar * src = luaL_checklstring (L, 1, & srcl); constchar * p = luaL_checkstring (L, 2); intmax_s = luaL_optint (L, 4, srcl + 1); intanchor = (* p = '^ ')? (P ++, 1): 0; intn = 0; MatchState ms; luaL_Buffer B; luaL_buffinit (L, & B); ms. L = L; ms. src_init = src; ms. src_end = src + srcl; while (n <max_s) {constchar * e; ms. level = 0; e = match (& ms, src, p); if (e) {n ++; add_value (& ms, & B, src, e );} if (e & e> src)/* non empty match? */Src = e;/* skip it */elseif (src <ms. src_end) luaL_addchar (& B, * src ++); elsebreak; if (anchor) break;} luaL_addlstring (& B, src, ms. src_end-src); luaL_pushresult (& B); lua_pushinteger (L, n);/* number of substitutions */return2 ;}

It can be seen that it processes the symbol ^ and matches it cyclically. If it matches, it adds the replacement string to the result according to different types, and finally presses all characters back to the stack.

In general, the effects of the string. gsub () function are the same as those of our replacement in the general sense. You may wonder why it is not called string. greplace. In fact, I also wonder.

After we have introduced four functions that use the mode, let's take a look at what's wonderful about the Lua mode.

Mode

Let's take a look at the special characters that need to be explained. In fact, this part is clearly described in Lua's official documentation:

First, all the characters except the special characters above represent themselves. Note that they appear independently.

Secondly, Lua defines some collections as follows:

.: Represents any character.

% A: any letter.

% C: Any control character.

% D: any number.

% L: Any lowercase letter.

% P: represents any punctuation.

% S: any blank characters (such as spaces and tabs ).

% U: Any uppercase letter.

% W: Any letter or number.

% X: Any hexadecimal number.

% Z: any character equal to 0.

% The character followed by any non-letter or number represents the character itself, including the special characters above and any punctuation can be expressed in this way.

[Set]: represents a custom Character set combination. You can use the symbol-to identify a range, such as 1-9 and a-z. Note that the character set combination mentioned above can also be used in this custom set, but you cannot write [% a-z] like this. such a set is meaningless.

[^ Set]: indicates a set of [set] character sets ).

In addition, for all the above mentioned sets composed of % and a letter, if the letter is capitalized, it corresponds to the complement set of that set. For example, % S means all non-blank characters. The Lua website also stressed that the definition is related to the local character set. For example, the set [a-z] is not necessarily equal to % l.

Any single-character expression set, including the % + single-character expression set, can be followed by four types of symbols, which are *, +,-, and ,-,?.

*: Indicates that the previous set matches 0 or more characters and is matched as many as possible.

+: Indicates that the previous set matches one or more characters.

-: Indicates that the previous set matches 0 or more characters with as few matching as possible.

? : Indicates that the previous set matches 0 or 1.

As follows:

Local a = "ammmf" print (string. match (a, "% a") -- aprint (string. match (a, "% a *") -- ammmfprint (string. match (a, "% a +") -- ammmfprint (string. match (a, "% a-") -- print (string. match (a, "%? ") --

After reading the above example, you may think, what is the * and + or does not add? What is the difference? There is a difference, because matching 0 and Matching 1 are sometimes the key to success, such as adding? It means that the matching is successful even if there is no content in the corresponding set. If there is a capture, the capture takes effect at this time. For example:

Local a = "ammmf" print (string. match (a, "() c") -- nilprint (string. match (a, "() c? ") -- 1

If you do not know the meaning of string. match (), go to the front.

Another special character needs to be introduced, namely, % B is followed by two different characters xy, which means matching strings starting from x and ending with y, in addition, the number of x and y in this string must be the same. For example, % B () is the matching normal parentheses, as shown below:

Local a = "aaabb" print (string. match (a, "% bab") -- aabb

Finally, I am introducing string. the usage of character ^ is introduced during gmatch. It is placed at the first part of the pattern, meaning matching starts from the first part of the original string. Here, a special character is similar to its usage, it is the $ character, which is placed at the end of the pattern, meaning matching starts from the end of the original string. In other locations, just like ^, it makes no sense.

Capture

The meaning of capture is described in string. I have already introduced it in detail when I find it. Here I will mention it again. The capture is in the submode enclosed in parentheses in the pattern. When matching occurs, it intercepts the strings matching the pattern in parentheses, save it. By default, up to 32 files can be saved. You can modify the number of files saved in the Lua source code. In addition, the capture sequence is determined by the left bracket of the parentheses. For how to use the capture function, see the usage of the four functions that use the mode described above.

This article from the "cainiao surfaced" blog, please be sure to keep this source http://rangercyh.blog.51cto.com/1444712/1393067

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.