Go language Regular Expressions-practice tutorials

Source: Internet
Author: User

The library used by the regular expression processing
regexp
Use to RegExp library functions have mustcompile and Compile
MustCompile 的作用和 Compile 一样,    将正则表达式编译成一个正则对象(使用 PERL 语法)。该正则对象会采用“leftmost-first”模式。选择第一个匹配结果.如果正则表达式语法错误,则返回错误信息。不同的是,当正则表达式 str 不合法时,MustCompile 会抛出异常,而 Compile 仅返回一个 error 值
FindString
返回匹配上正则表达式最左边的一个字符串,如果没有匹配上会返回空字符串
Example 1 Get Company (corporate name), address, telephone (phone) in the following text
         <ul class="t3">                   <li class="company">山东东阿阿胶股份有限公司</li>              <li class="address">山东省东阿县阿胶街78号</li>               <li class="telephone">0635-3262315</li>    </ul>

Regular expressions

        <li class="company">(.+)</li>        <li class="company">(.+)</li>        <li class="telephone">(.+)</li>

Code implementation

    Package ... import (...)    "RegExp" ...) ... var (company = RegExp. Mustcompile (' <li class= ' Company ' > (. +) </li> ') address = RegExp. Mustcompile (' <li class= ' address > (. +) </li> ') telephone = RegExp. Mustcompile (' <li class= ' telephone > (. +) </li> ') var (respbody = ' <ul cl ass= "T3" > <li class= "Company" > Shandong Donga Gelatin Co., Ltd. </li> <li class= "address "> No. 78th, Qian, Shandong Province, China </li> <li class=" Telephone ">0635-3262315</li> </ul > ') companymatches: = Company. FindString (respbody) Companyrst: = Strings. Trimspace (Strings. Trim (Strings. Trim (companymatches, ' <li class= ' Company > '), ' </') addressmatches: = Address. FindString (respbody) Addressrst: = Strings. Trimspace (Strings. Trim (Strings. Trim (addressmatches, ' <li class= ' address ' > '), ' </') telephOnematches: = telephone. FindString (respbody) Telephonerst: = Strings. Trimspace (Strings. Trim (Strings. Trim (telephonematches, ' <li class= ' telephone ' > '), ' </')

Explain:

其中正则表达式   `<li class="company">(.+)</li>`  的意思是匹配以<li class="company">开头,以</li>结尾,中间匹配一次或多次除换行符之外的任何字符‘.‘ 匹配除换行符(\n、\r)之外的任何单个字符。要匹配包括 ‘\n‘ 在内的任何字符,请使用像"(.|\n)"的模式。‘+‘ 匹配前面的子表达式一次或多次。例如,‘zo+‘ 能匹配 "zo" 以及 "zoo",但不能匹配 "z"。+ 等价于 {1,}。
Example 2 converting HTML tags all to lowercase
re, _ := regexp.Compile("\\<[\\S\\s]+?\\>")或者    re, _ = regexp.Compile(`\<[\S\s]+?\>`)respBody = re.ReplaceAllStringFunc(respBody, strings.ToLower)

Explain:

    匹配以‘<‘开头,以’\>‘结尾,中间匹配一个或多个任意字符,而且尽可能少的匹配所搜索到的字符,因为?标示非贪婪模式    举个例子:            正则表达式`\<[\S\s]+?\>处理下面字符串 ’<test1\> nice<test2\>‘,返回结果’<test1\>‘            而正则表达式`\<[\S\s]+\>处理下面字符串 ’<test1\> nice<test2\>‘,返回接’<test1\> nice<test2\>‘?当该字符紧跟在任何一个其他限制符 (*, +, ?, {n}, {n,}, {n,m}) 后面时,匹配模式是非贪婪的。非贪婪模式尽可能少的匹配所搜索的字符串,而默认的贪婪模式则尽可能多的匹配所搜索的字符串。例如,对于字符串 "oooo",‘o+?‘ 将匹配单个 "o",而 ‘o+‘ 将匹配所有 ‘o‘。\s  匹配任何空白字符,包括空格、制表符、换页符等等。等价于 [ \f\n\r\t\v]。\S  匹配任何非空白字符。等价于 [^ \f\n\r\t\v]。
Example 3 Remove Style
re, _ = regexp.Compile("\\<style[\\S\\s]+?\\</style\\>")或者    re, _ = regexp.Compile(`\<style[\S\s]+?\</style\>`)respBody = re.ReplaceAllString(respBody, "")

Explain:
same Example 2

Example 4 removing script
re, _ = regexp.Compile("\\<script[\\S\\s]+?\\</script\\>") 或者    re, _ = regexp.Compile(`\<script[\S\s]+?\</script\>`)respBody = re.ReplaceAllString(respBody, "")

Explain:
same Example 3

Example 5 removing all the HTML code within the angle brackets and replacing them with line breaks
re, _ = regexp.Compile("\\<[\\S\\s]+?\\>")respBody = re.ReplaceAllString(respBody, "\n")

Explain:
same Example 3

Example 6 removing a continuous line break
re, _ = regexp.Compile("\\s{1,}")respBody = re.ReplaceAllString(respBody, "\n")  

Explain:

{n,}  n 是一个非负整数。至少匹配n 次。例如,‘o{2,}‘ 不能匹配 "Bob" 中的 ‘o‘,但能匹配 "foooood" 中的所有 o。‘o{1,}‘ 等价于 ‘o+‘。‘o{0,}‘ 则等价于 ‘o*‘。
Example 7 find below last number 15
<a  target=‘_self‘ href=‘/search/不孕症-p15/‘ class=‘last‘>尾页</a>

Code:

allPage       = regexp.MustCompile(`<a  target=‘_self‘ href=‘/search/[.\x{4e00}-\x{9fa5}0-9]+-p[0-9]/‘[ ]*class=‘last‘>尾页</a>`)allPagePrefix = regexp.MustCompile(`<a  target=‘_self‘ href=‘/search/[.\x{4e00}-\x{9fa5}0-9]+-p`)allPageSuffix = regexp.MustCompile(`/‘[ ]*class=‘last‘>\x{5c3e}\x{9875}</a>`)numPage = allPage.FindString(body)numPage = allPagePrefix.ReplaceAllString(numPage, "")numPage = allPageSuffix.ReplaceAllString(numPage, "")

Match details:
Regular expression <a target= ' _self ' href= '/search/[.\x{4e00}-\x{9fa5}0-9]+-p[0-9]/' []*class= ' last ' > End </a>:

    匹配以"<a  target=‘_self‘ href=‘/search/"开头,以class=‘last‘>尾页</a>结尾,中间匹配若干个除换行外的任意字符或者中文字符    简易爬虫实战项目代码路径:https://github.com/KenmyZhang/medicine-manual-spider

Go language Regular Expressions-practice tutorials

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.