The library used by the regular expression processing
regexp
Use to RegExp library functions have mustcompile and Compile
MustCompile 的作用和 Compile 一样, 将正则表达式编译成一个正则对象(使用 PERL 语法)。该正则对象会采用“leftmost-first”模式。选择第一个匹配结果.如果正则表达式语法错误,则返回错误信息。不同的是,当正则表达式 str 不合法时,MustCompile 会抛出异常,而 Compile 仅返回一个 error 值
FindString
返回匹配上正则表达式最左边的一个字符串,如果没有匹配上会返回空字符串
Example 1 Get Company (corporate name), address, telephone (phone) in the following text
<ul class="t3"> <li class="company">山东东阿阿胶股份有限公司</li> <li class="address">山东省东阿县阿胶街78号</li> <li class="telephone">0635-3262315</li> </ul>
Regular expressions
<li class="company">(.+)</li> <li class="company">(.+)</li> <li class="telephone">(.+)</li>
Code implementation
Package ... import (...) "RegExp" ...) ... var (company = RegExp. Mustcompile (' <li class= ' Company ' > (. +) </li> ') address = RegExp. Mustcompile (' <li class= ' address > (. +) </li> ') telephone = RegExp. Mustcompile (' <li class= ' telephone > (. +) </li> ') var (respbody = ' <ul cl ass= "T3" > <li class= "Company" > Shandong Donga Gelatin Co., Ltd. </li> <li class= "address "> No. 78th, Qian, Shandong Province, China </li> <li class=" Telephone ">0635-3262315</li> </ul > ') companymatches: = Company. FindString (respbody) Companyrst: = Strings. Trimspace (Strings. Trim (Strings. Trim (companymatches, ' <li class= ' Company > '), ' </') addressmatches: = Address. FindString (respbody) Addressrst: = Strings. Trimspace (Strings. Trim (Strings. Trim (addressmatches, ' <li class= ' address ' > '), ' </') telephOnematches: = telephone. FindString (respbody) Telephonerst: = Strings. Trimspace (Strings. Trim (Strings. Trim (telephonematches, ' <li class= ' telephone ' > '), ' </')
Explain:
其中正则表达式 `<li class="company">(.+)</li>` 的意思是匹配以<li class="company">开头,以</li>结尾,中间匹配一次或多次除换行符之外的任何字符‘.‘ 匹配除换行符(\n、\r)之外的任何单个字符。要匹配包括 ‘\n‘ 在内的任何字符,请使用像"(.|\n)"的模式。‘+‘ 匹配前面的子表达式一次或多次。例如,‘zo+‘ 能匹配 "zo" 以及 "zoo",但不能匹配 "z"。+ 等价于 {1,}。
Example 2 converting HTML tags all to lowercase
re, _ := regexp.Compile("\\<[\\S\\s]+?\\>")或者 re, _ = regexp.Compile(`\<[\S\s]+?\>`)respBody = re.ReplaceAllStringFunc(respBody, strings.ToLower)
Explain:
匹配以‘<‘开头,以’\>‘结尾,中间匹配一个或多个任意字符,而且尽可能少的匹配所搜索到的字符,因为?标示非贪婪模式 举个例子: 正则表达式`\<[\S\s]+?\>处理下面字符串 ’<test1\> nice<test2\>‘,返回结果’<test1\>‘ 而正则表达式`\<[\S\s]+\>处理下面字符串 ’<test1\> nice<test2\>‘,返回接’<test1\> nice<test2\>‘?当该字符紧跟在任何一个其他限制符 (*, +, ?, {n}, {n,}, {n,m}) 后面时,匹配模式是非贪婪的。非贪婪模式尽可能少的匹配所搜索的字符串,而默认的贪婪模式则尽可能多的匹配所搜索的字符串。例如,对于字符串 "oooo",‘o+?‘ 将匹配单个 "o",而 ‘o+‘ 将匹配所有 ‘o‘。\s 匹配任何空白字符,包括空格、制表符、换页符等等。等价于 [ \f\n\r\t\v]。\S 匹配任何非空白字符。等价于 [^ \f\n\r\t\v]。
Example 3 Remove Style
re, _ = regexp.Compile("\\<style[\\S\\s]+?\\</style\\>")或者 re, _ = regexp.Compile(`\<style[\S\s]+?\</style\>`)respBody = re.ReplaceAllString(respBody, "")
Explain:
same Example 2
Example 4 removing script
re, _ = regexp.Compile("\\<script[\\S\\s]+?\\</script\\>") 或者 re, _ = regexp.Compile(`\<script[\S\s]+?\</script\>`)respBody = re.ReplaceAllString(respBody, "")
Explain:
same Example 3
Example 5 removing all the HTML code within the angle brackets and replacing them with line breaks
re, _ = regexp.Compile("\\<[\\S\\s]+?\\>")respBody = re.ReplaceAllString(respBody, "\n")
Explain:
same Example 3
Example 6 removing a continuous line break
re, _ = regexp.Compile("\\s{1,}")respBody = re.ReplaceAllString(respBody, "\n")
Explain:
{n,} n 是一个非负整数。至少匹配n 次。例如,‘o{2,}‘ 不能匹配 "Bob" 中的 ‘o‘,但能匹配 "foooood" 中的所有 o。‘o{1,}‘ 等价于 ‘o+‘。‘o{0,}‘ 则等价于 ‘o*‘。
Example 7 find below last number 15
<a target=‘_self‘ href=‘/search/不孕症-p15/‘ class=‘last‘>尾页</a>
Code:
allPage = regexp.MustCompile(`<a target=‘_self‘ href=‘/search/[.\x{4e00}-\x{9fa5}0-9]+-p[0-9]/‘[ ]*class=‘last‘>尾页</a>`)allPagePrefix = regexp.MustCompile(`<a target=‘_self‘ href=‘/search/[.\x{4e00}-\x{9fa5}0-9]+-p`)allPageSuffix = regexp.MustCompile(`/‘[ ]*class=‘last‘>\x{5c3e}\x{9875}</a>`)numPage = allPage.FindString(body)numPage = allPagePrefix.ReplaceAllString(numPage, "")numPage = allPageSuffix.ReplaceAllString(numPage, "")
Match details:
Regular expression <a target= ' _self ' href= '/search/[.\x{4e00}-\x{9fa5}0-9]+-p[0-9]/' []*class= ' last ' > End </a>:
匹配以"<a target=‘_self‘ href=‘/search/"开头,以class=‘last‘>尾页</a>结尾,中间匹配若干个除换行外的任意字符或者中文字符 简易爬虫实战项目代码路径:https://github.com/KenmyZhang/medicine-manual-spider
Go language Regular Expressions-practice tutorials