Capture in Lua

Source: Internet
Author: User
Tags processing text

Capture

Capture is a mechanism in which part of the pattern string can be used to match part of the target string. Enclose the pattern you want to capture in parentheses and specify a capture.
When string. Find uses capture, the function returns the captured value as an additional result. This is often used to split a target string into multiple:

Pair = "name = Anna"
_, _, Key, value = string. Find (pair, "(% A +) % S * = % S * (% A + )")
Print (key, value) --> name Anna

'% A +' indicates the letter sequence of the Philippines; '% S *' indicates 0 or multiple spaces. In the preceding example, the entire pattern represents a letter sequence followed by any number of spaces, followed by '=', followed by any number of spaces, followed by a letter sequence. Both letter sequences are child patterns enclosed in parentheses. When they are matched, they will be captured. When a matching occurs, the find function always returns the index subscript of the matching string (in the above example, we store the dummy variable _), and then returns the capture part of the sub-pattern matching. The following example is similar:

Date = "17/7/1990"
_, _, D, M, y = string. Find (date, "(% d +)/(% d + )")
Print (D, M, y) --> 17 7 1990

We can use forward references in the mode, and '% d' (D represents a number from 1 to 9) represents a copy of the D Capture.
Let's look at an example. If you want to find a substring that is enclosed by single or double quotation marks in a string, you may use the format '["']. -["']', but this mode will cause problems when processing a string similar to" It's all right. To solve this problem, you can use forward references and use the first captured quotation marks to indicate the second quotation marks:

S = [[then he said: "It's all right"!]
A, B, C, quotedpart = string. Find (S, "([" ']) (.-) % 1 ")
Print (quotedpart) --> it's all right
Print (c) -->"

The first capture is the quotation mark character, and the second capture is the content in the middle of the quotation marks ('.-' matches the substring in the middle of the quotation marks ).

The third application of the captured value is used in the gsub function. Like other modes, the replacement string of gsub can contain '% d'. When the replacement occurs, it is converted to the corresponding capture value. (By the way, the '%' character in the replacement string must be represented by "% ). In the following example, each letter in a string is copied and connected with the original letter by a hyphen:

Print (string. gsub ("Hello Lua! "," (% A) "," % 1-% 1 "))
--> H-he-el-ll-lo-o l-lu-UA-!

The following code swaps adjacent characters:

Print (string. gsub ("Hello Lua", "(.) (.)", "% 2% 1 "))
--> Ehll Oula

Let's look at a more useful example: Write a format converter: Get a latex-style string from the command line, such:
\ Command {some text}
Convert them to XML-style strings:
<Command> some text </command>
In this case, the following code can implement this function:
S = string. gsub (S, "\ (% A +) {(.-)}", "<% 1> % 2 </% 1> ")
For example, if string S is:
The \ quote {task} is to \ em {change} That.
After calling gsub, convert:
The <quote> task </quote> is to change that.

Another useful example is to remove spaces at the beginning and end of a string:
Function trim (s)
Return (string. gsub (S, "^ % S * (.-) % S * $", "% 1 "))
End

Pay attention to the usage of the mode string. The two operators ('^' and '$') ensure that the entire string is obtained. Because the two '% s' match all spaces at the beginning and end,'.-'matches the remaining part. Another thing to note is that gsub returns two values. We use extra parentheses to discard extra results (replace the number of occurrences ).

The last capture value application may be the most powerful. We can use a function as the third parameter of string. gsub to call gsub. In this case, String. each time gsub finds a match, it calls the given function as a parameter. The captured value can be used as the parameter of the called function, and the return value of this function is used as the replacement string of gsub. Let's take a look at a simple example. The following code replaces the appearance of the global variable $ varname in a string with the value of the variable varname:

Function expand (s)
S = string. gsub (S, "$ (% W +)", function (N)
Return _ g [N]
End)
Return s
End

Name = "Lua"; status = "great"
Print (expand ("$ name is $ status, isn' t it? "))

--> Lua is great, isn' t it?

If you are not sure whether the given variable is of the string type, you can use tostring for conversion:

Function expand (s)
Return (string. gsub (S, "$ (% W +)", function (N)
Return tostring (_ g [N])
End ))
End

Print (expand ("Print = $ print; A = $ "))

--> Print = function: 0x8050ce0; A = Nil

The following is an example of a slightly complex point. loadstring is used to calculate the value of the expression $ following a pair of square brackets in a piece of text:

S = "Sin (3) = $ [math. Sin (3)]; 2 ^ 5 = $[2 ^ 5]"
Print (string. gsub (S, "$ (% B [])", function (X)
X = "return" .. string. sub (x, 2,-2)
Local F = loadstring (X)
Return F ()
End )))

--> Sin (3) = 0.1411200080598672; 2 ^ 5 = 32

The first match is "$ [math. Sin (3)]", and the corresponding capture is "[math. Sin (3)]". Call string. sub to remove the square brackets at the beginning and end,
Therefore, the string to be loaded and executed is "Return math. Sin (3)", and the matching condition of "$ [2 ^ 5]" is similar.
We often need to use string. gsub to traverse strings, but we are not interested in the returned results. For example, we collect all the words in a string and insert them into a table:

Words = {}
String. gsub (S, "(% A +)", function (W)
Table. insert (words, W)
End)

If the string S is "Hello Hi, again! ", The result of the above Code will be:

{"Hello", "Hi", "again "}

The string. gfind function can simplify the above Code:

Words = {}
For W in string. gfind (S, "(% A)") Do
Table. insert (words, W)
End

The gfind function is suitable for reference for loops. It can traverse all matching substrings in a string. We can further simplify the above Code. when calling the gfind function, if the specified capture is not displayed, the function will capture the entire matching mode. Therefore, the code above can be simplified:

Words = {}
For W in string. gfind (S, "% A") Do
Table. insert (words, W)
End

In the following example, we use URL encoding. url encoding is the encoding of parameters in the sent URL. This encoding converts some special characters (such as '=', '&', and '+') to "% xx" encoding, where xx represents the character in hexadecimal notation, then convert the blank space to '+ '. For example, encode the string "A + B = C" as "a % 2BB + % 3d + C ". Add a '=' between the parameter name and the parameter value, and a "&" between the name = value pairs "&". For example, string:
Name = "Al"; query = "A + B = C"; q = "yes or no"
Encoded:
Name = Al & query = A % 2BB + % 3d + C & Q = Yes + or + NO
Now, if we want to decode the URL and store each value in the table, the subscript is the corresponding name. The following function implements the decoding function:

Function Unescape (s)
S = string. gsub (S, "+ ","")
S = string. gsub (S, "% (% x)", function (h)
Return string. Char (tonumber (H, 16 ))
End)
Return s
End

The first statement converts '+' into a blank space. The second gsub matches all '%' with the hexadecimal number of two numbers, and then calls an anonymous function, the anonymous function converts the hexadecimal number into a number (the tonumber is used in hexadecimal format) and then converts it to the corresponding character. For example:

Print (Unescape ("A % 2BB + % 3d + C") --> A + B = C

For name = value pairs, we use gfind to decode them. Because names and values cannot both contain '&' and '=', we can use the mode '[^ & =] +' to match them:

CGI = {}
Function decode (s)
For name, value in string. gfind (S, "([^ & =] +) = ([^ & =] +)") Do
Name = Unescape (name)
Value = Unescape (value)
CGI [name] = Value
End
End

Call the gfind function to match all the name = value pairs. For each name = value pair, the iterator returns the captured value to the variable name and value. The loop body calls the Unescape function to decode the name and value parts and store them in the CGI table.
Encoding corresponding to decoding is also easy to implement. First, we write an escape function, which converts all special characters into '%' followed by the ASCII code corresponding to the characters into two hexadecimal numbers (less than two digits, add 0), and then convert the blank space to '+ ':

Function escape (s)
S = string. gsub (S, "([& = + % C])", function (c)
Return string. Format ("% 02x", String. byte (c ))
End)
S = string. gsub (S, "", "+ ")
Return s
End

The encoding function traverses the table to be encoded and constructs the final result string:

Function encode (t)
Local S = ""
For K, V in pairs (t) Do
S = s .. "&" .. escape (k) .. "=" .. escape (V)
End
Return string. sub (S, 2) -- remove first '&'
End
T = {name = "Al", query = "A + B = C", q = "yes or no "}
Print (encode (t) --> q = Yes + or + NO & query = A % 2BB + % 3d + C & name = Al

Tricks of the trade)
Pattern matching is a powerful tool for string manipulation. You may only need to simply call string. gsub and find can complete complex operations. However, you must use it with caution because it is powerful. Otherwise, unexpected results may occur.
For a normal parser, pattern matching is not a substitute. For a quick-and-dirty program, you can perform some useful operations on the source code, but it is difficult to complete a high-quality product. The pattern matching the comments in the C program mentioned above is a good example: '/% *.-% */'. If your program has a string containing "/*", you will get the following error:

Test = [[char s [] = "A/* Here";/* a tricky string */]
Print (string. gsub (test, "/% *.-% */", "<comment> "))
--> Char s [] = "A <comment>

Although such a string is rare, if you use it yourself, the above pattern may still work together. However, you cannot sell a program with such a problem as a product.
In general, the pattern matching efficiency in Lua is good: A Pentium 333mhz machine matches all words (30 K words) in a 1/10 K text within seconds. However, you cannot take it lightly. You should always treat different situations with a clearer description of the pattern as much as possible. A relaxed mode may be much slower than a strict mode. An extreme example is the pattern '(. -) % $ 'is used to obtain all the characters before the $ symbol in a string. If the $ symbol exists in the target string, there is no problem. However, if the $ symbol does not exist in the target string. The above algorithm will first start from the first character of the target string to match. After traversing the entire string, the $ symbol is not found, and then the matching starts from the second character of the target string ,...... This would take over three hours to process a K text string in a Pentium 333mhz machine. You can use the following mode to avoid the above problem '^ (.-) % $ '. The locator ^ tells the algorithm to stop searching if no matching substring is found at the first position. With this locator, the same environment only takes less than 1/10 seconds.
You also need to be careful about the null mode: match the Null String mode. For example, if you want to use the pattern '% a' to match the name, you will find that the name is everywhere:

I, j = string. Find ("; $ % ** # $ hello13", "% *")
Print (I, j) --> 1 0

In this example, String. Find is called to correctly match null characters at the beginning of the target string. Never write a pattern that starts or ends with '-' because it matches an empty string. This modifier always requires something to locate its extension. Similarly, a pattern containing '. *' requires attention, because this structure may be more scalable than your budget.
Sometimes it is useful to use the Lua constructor. Let's look at an example. We can find a line with more than 70 characters in the text, that is, a line with 70 characters before a non-linefeed. We use the character class '[^ \ n]' to indicate non-linefeed characters. Therefore, we can use this pattern to meet our needs: repeat a single character pattern 70 times, followed by a pattern that matches one character 0 times or multiple times. Instead of writing the final mode manually, we use the string. Rep function:

Pattern = string. Rep ("[^ \ n]", 70) .. "[^ \ n] *"

Another example is case-insensitive search. One way is to convert any character X into the character class '[XX]'. We can also use a function for automatic conversion:

Function nocase (s)
S = string. gsub (S, "% A", function (c)
Return string. Format ("[% S % s]", String. Lower (c ),
String. Upper (c ))
End)
Return s
End

Print (nocase ("Hi there! "))
--> [HH] [II] [TT] [HH] [EE] [RR] [EE]!

Sometimes you may want to convert string S1 to S2. If both string S1 and S2 are string sequences, you can add escape characters to special characters. However, if these strings are variables, you can use gsub to escape them:

S1 = string. gsub (S1, "(% W)", "% 1 ")
S2 = string. gsub (S2, "%", "% ")

In the search string, we escape all non-letter characters. In the replacement string, we only escaped '% '. Another useful technique for pattern matching is to pre-process the target string before real processing. A simple example of preprocessing is to convert strings in double quotation marks in a text into uppercase letters, but note that the double quotation marks can contain escape quotation marks ("""):
This is a typical string example:
"This is" great "! ".
In this case, the pre-processing text converts problematic character sequences into other formats. For example, we can encode "as" \ 1 ", but if the original text contains" \ 1 ", we are in trouble again. One simple way to avoid this problem is to encode all "\ x" types into "\ DDD", where DDD is the decimal representation of the character X:

Function code (s)
Return (string. gsub (S, "\ (.)", function (X)
Return string. Format ("\ % 03d", String. byte (x ))
End ))
End

Note that "\ DDD" in the original string is also encoded, and decoding is easy:

Function decode (s)
Return (string. gsub (S, "\ (% d)", function (d)
Return "\" .. string. Char (d)
End ))
End

If the encoded string does not contain any escape characters, you can simply use '".-"' to find the double quotation mark string:

S = [[follows a typical string: "This is" great "! ".]
S = code (s)
S = string. gsub (S, '(".-")', String. Upper)
S = decode (s)
Print (s)
--> Follows a typical string: "This is" great "! ".

More restrictive form:

Print (decode (string. gsub (code (s), '(".-")', String. Upper )))

Let's go back to the previous example and convert the command in the format of \ command {string} to the XML style:
<Command> string </command>
However, this time our original format can contain a backslash as an escape character, so that we can use "\", "\ {", and "\}", '\', '{', and '}' respectively '}'. To avoid mixing commands and escape characters, we should first re-encode these special sequences in the original string. However, unlike the previous example, we cannot escape all \ x, because it will also convert our command (\ command. Here, we encode \ x only when X is not a character:

Function code (s)
Return (string. gsub (S, '\ (% A)', function (X)
Return string. Format ("\ % 03d", String. byte (x ))
End ))
End

The decoding part is similar to the preceding example, but the final string does not contain a backslash. Therefore, we can directly call String. Char:

Function decode (s)
Return (string. gsub (S, '\ (% d)', String. Char ))
End

S = [[A \ emph {command} is written as \ command \ {text \}.]
S = code (s)
S = string. gsub (S, "\ (% A +) {(.-)}", "<% 1> % 2 </% 1> ")

Print (decode (s ))
--> A <emph> command </emph> is written as \ command {text }.

The last example is to process CSV files (separated by commas). Many programs use text in this format, such as Microsoft Excel. A list of more than 10 records in a CSV file. Each record contains one row. The value in one row is separated by a comma. If a value also contains a comma, the value must be enclosed in double quotation marks, if the value also contains double quotation marks, you must use double quotation marks to escape the double quotation marks (that is, two double quotation marks represent one). For example, the following array:

{'A B ', 'a, B', 'a, "B" C', 'Hello "world "! ',}
It can be viewed:
A B, "A, B", "a," B "" C ", hello" world "!,
It is very easy to convert a string array to a CSV file. All we need to do is to use commas to connect all strings:

Function tocsv (t)
Local S = ""
For _, P in pairs (t) Do
S = s .. "," .. escapecsv (P)
End
Return string. sub (S, 2) -- remove first comma
End

If a string contains commas (,), we need to use quotation marks to enclose the string and escape the original quotation marks:

Function escapecsv (s)
If string. Find (S, '[, "]') then
S = '"'. String. gsub (S ,'"','""')..'"'
End
Return s
End

It is a little difficult to store the contents of a CSV file in an array, because we must distinguish the comma (,) in the middle of the quotation marks from the comma (,) in the separated domain. We can try to escape the comma in the middle of the quotation marks. However, not all quotation marks exist as quotation marks. Only the quotation marks after the comma are the start of a pair of quotation marks. Only a comma that is not in the middle of the quotation mark is a real comma. There are too many details to be aware of. For example, two quotation marks may indicate a single quotation mark, two quotation marks may be used, and null may be used:
"Hello" "hello ","",""
In this example, the first field is the string "hello" hello ", and the second field is the string" (that is, a blank space with two quotation marks ), the last field is an empty string.
We can call gsub multiple times to handle these situations, but it is more effective for this task to use a traditional loop (loop on each domain. The main task of the loop body is to find the next comma and store the domain content in a table. For each domain, we cyclically look for closed quotation marks. In-loop usage mode '"("?) 'To find the closed quotation marks of a domain: If a quotation mark is followed by a quotation mark, the second quotation mark will be captured and assigned to variable C, which means this is still not a closed quotation mark.

Function fromcsv (s)
S = s .. ',' -- Ending comma
Local T = {} -- table to collect Fields
Local fieldstart = 1
Repeat
-- Next field is quoted? (Start '"'?)
If string. Find (S, '^ "', fieldstart) then
Local a, c
Local I = fieldstart
Repeat
-- Find closing quote
A, I, C = string. Find (S ,'"("?) ', I + 1)
Until C ~ = '"' -- Quote not followed by quote?
If not I then error ('unmatched "') End
Local F = string. sub (S, fieldstart + 1, I-1)
Table. insert (T, (string. gsub (F ,'""','"')))
Fieldstart = string. Find (S, ',', I) + 1
Else -- unquoted; find next comma
Local nexti = string. Find (S, ',', fieldstart)
Table. insert (T, String. sub (S, fieldstart,
Nexti-1 ))
Fieldstart = nexti + 1
End
Until fieldstart> string. Len (s)
Return t
End

T = fromcsv ('"hello" "hello ","",""')
For I, S in ipairs (t) Do print (I, S) End
--> 1 hello "Hello
--> 2 ""
--> 3

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.