This is a creation in Article, where the information may have evolved or changed.
Objective
How to implement a crawler system or a simple small script? Usually define a portal page, then a page will have other pages of the URL, so from the current page to get these URLs added to the crawler crawl queue, and then go to the new page and then recursively do the above operations, In fact, it is the same as deep traversal or breadth traversal.
Golang because of its fast compilation speed, and the concurrent (Goroutine) natural support, with Chan's co-processing, can be a good way to achieve a stable and efficient crawler system.
The package to use
Without the help of a third-party framework, a crawler application is implemented using the Go SDK's standard library, the main use of the package
- The Net/http standard library has built-in support for the HTTP protocol, which implements an HTTP client that can be requested directly through the get,post.
- Strings Unlike Java string is a reference type, the Go language type is a built-in underlying type, and the go language by default only supports UTF-8 encoding, strings package implements some simple functions for utf-8 string Operations
- RegExp Regular expression packages in the Go SDK
- Io/ioutil IO Processing Toolkit
- Encoding/xml parsing the XML package
Channel mechanism
Golang references the CSP of C.A.R Hoare in concurrent design, namely communicating sequential processes concurrency model theory. The message delivery of the CSP model includes a rendezvous point between the sending and receiving messages process, that is, the sender can only send messages when the receiver is ready to receive the message.
Golang in its concurrent implementation, mainly uses the channel to realize the communication. The channel consists of two types, buffered channel and non-buffered channel.
- Buffer channel: To ensure that the buffer to save data before the corresponding fetch data, simply said in the time when the data must be taken, otherwise it will not be blocked.
- Non-buffered channel: To ensure that the data before the storage data, is to ensure that there must be other goroutine in the deposit, otherwise it will not be put in and blocked.
Go Channel Basic Operation syntax
The basic operating syntax for Go channel is as follows:
make(chanbool//创建一个无缓冲的bool型Channel
c <- x //向一个Channel发送一个值<- c //从一个Channel中接收一个值x = <- c //从Channel c接收一个值并将其存储到x中x, ok = <- c //从Channel接收一个值,如果channel关闭了或没有数据,那么ok将被置为false
The non-buffered channel combines both communication and synchronization features, and is suitable for coordinating multiple routines.
Basic operation of For/select
When we use SELECT, we seldom just evaluation it once, we often use it in conjunction with for {} and choose the appropriate time to exit from for{}.
for { select { case x := <- somechan: // … 使用x进行一些操作 case y, ok := <- someOtherchan: // … 使用y进行一些操作, // 检查ok值判断someOtherchan是否已经关闭 case outputChan <- z: // … z值被成功发送到Channel上时 default: // … 上面case均无法通信时,执行此分支
Range action
The for range in Golang, in addition to iterating over some collection types, can also be used to loop the data from the channel, blocking the current loop when there is no data in the channel.
forrange urlChannel { fmt.Println("routines num = ""chan len = "len(urlChannel)) go Spy(url)}
Goroutine
The Go language provides very clear and straightforward support for concurrent programming through Goroutine, but Goroutine is the function of the go language runtime, not the functionality provided by the operating system. Goroutine is not implemented by threading. Goroutine is a piece of code, a function entry, and a stack allocated to it on the heap. So it's very cheap and we can easily create tens of thousands of goroutine, but they're not executed by the operating system.
除了被系统调用阻塞的线程外,Go运行库最多会启动$GOMAXPROCS个线程来运行goroutine
Implement CSDN Blog Crawler
Since the implementation of the crawler function is simple, so all the code is finished under the main package.
First we need to declare under the main package a global urlchannel that is used to synchronize the HREF properties of the tags obtained on a page by multiple routines <a>
varmake(chanstring, 200//chan中存入string类型的href属性,缓冲200
Declares a regular expression obtained in an HTML document <a>
var atagRegExp = regexp.MustCompile(`<a[^>]+[(href)|(HREF)]\s*\t*\n*=\s*\t*\n*[(".+")|('.+')][^>]*>[^<]*</a>`//以Must前缀的方法或函数都是必须保证一定能执行成功的,否则将引发一次panic
Entry function Main
When entering the main function, a goroutine is launched to start the crawl (Spy function) page Content Analysis tab from the Portal url= "Http:/blog.csdn.net" <a>
Next, use the for range urlchannel to loop through the <a>
href attribute in the crawled tag and open a new goroutine again to crawl the contents of the HTML document for the HREF attribute.
func main() { go Spy("http:/blog.csdn.net") //go Spy("http://www.iteye.com/") forrange urlChannel { fmt.Println("routines num = ""chan len = "len//通过runtime可以获取当前运行时的一些相关参数等 go Spy(url) } fmt.Println("a")}
Spy function
Since each crawl Goroutine calls the spy function to parse an HTML document corresponding to a URL, it is necessary to defer an anonymous function at the beginning of the function to handle the (recover) possible exception (panic) to prevent the exception from causing the program to terminate. The function executed by defer executes before the result is returned after the execution of the current function is completed, whether the function is panic or normal execution
deferfunc() { ifrecovernil { log.Println("[E]", r) } }()
Because the go has built-in support for the HTTP protocol, it can be directly via HTTP under HTTP packets. The Get or Http.post function to request a URL. However, because most websites have the limitation of preventing DDoS and so on, we need to customize the header of the request, set up the proxy server (Csdn seems to be strict on the request of the same IP, the iteye is strict, Tens of thousands of packets per minute are blocked by IP), and can use HTTP under HTTP packets. Newrequest (method, urlstr string, body io. Reader) (*request, error) function, then through the request header object set User-agent,host, etc., and finally call the HTTP package under the built-in Defaultclient object's Do method to complete the requests.
When the server response is received (*response) by converting the tool function under the Ioutil package to string, find the label in the document to <a>
parse the href attribute and deposit it into the urlchannel.
funcSpy (URLstring) {defer func() {ifr: =Recover(); r! =Nil{log. Println ("[E]", R)}} () req, _: = http. Newrequest ("GET"UrlNil) Req. Header.set ("User-agent", Getrandomuseragent ()) Client: = http. Defaultclient res, E: = client. Do (req)ifE! =Nil{FMT. Errorf ("GET request%s returned error:%s", URL, E)return}ifRes. StatusCode = = ${Body: = Res. BodydeferBody. Close () Bodybyte, _: = Ioutil. ReadAll (body) Resstr: =string(bodybyte) Atag: = Atagregexp.findallstring (Resstr,-1) for_, A: =RangeAtag {href,_: = Gethref (a)ifStrings. Contains (HREF,"article/details/") {FMT. Println ("☆", href)}Else{FMT. Println ("-", href)} urlchannel <-href}}}
Random Forgery User-agent
varUserAgent = [...]string{"mozilla/5.0 (compatible, MSIE 10.0, Windows NT, Digext)","mozilla/4.0 (Compatible, MSIE 7.0, Windows NT 5.1, 360SE)","mozilla/4.0 (compatible, MSIE 8.0, Windows NT 6.0, trident/4.0)","mozilla/5.0 (compatible, MSIE 9.0, Windows NT 6.1, trident/5.0,","opera/9.80 (Windows NT 6.1, U, en) presto/2.8.131 version/11.11","mozilla/4.0 (Compatible, MSIE 7.0, Windows NT 5.1, Tencenttraveler 4.0)","mozilla/5.0 (Windows, U, Windows NT 6.1, en-US) applewebkit/534.50 (khtml, like Gecko) version/5.1 safari/534.50" ,"mozilla/5.0 (Macintosh, Intel Mac OS X 10_7_0) applewebkit/535.11 (khtml, like Gecko) chrome/17.0.963.56 safari/535.1 1 ","mozilla/5.0 (Macintosh, U, Intel Mac OS X 10_6_8, en-us) applewebkit/534.50 (khtml, like Gecko) version/5.1 safari/53 4.50 ","mozilla/5.0 (Linux, U, Android 3.0, en-us, Xoom build/hri39) applewebkit/534.13 (khtml, like Gecko) version/4.0 Safar i/534.13 ","mozilla/5.0 (IPad, U, CPU OS 4_3_3 like Mac os X, en-us) applewebkit/533.17.9 (khtml, like Gecko) version/5.0.2 Mobil E/8j2 safari/6533.18.5 ","mozilla/4.0 (Compatible, MSIE 7.0, Windows NT 5.1, trident/4.0, se 2.X metasr 1.0, se 2.X METASR 1.0,. NET CLR 2.0.50 727, SE 2.X METASR 1.0) ","mozilla/5.0 (iphone, U, CPU iphone os 4_3_3 like Mac os X, en-us) applewebkit/533.17.9 (khtml, like Gecko) version/5. 0.2 mobile/8j2 safari/6533.18.5 ","Mqqbrowser/26 mozilla/5.0 (Linux, U, Android 2.3.7, ZH-CN, MB200 build/grj22, CyanogenMod-7) applewebkit/533.1 (khtm L, like Gecko) version/4.0 Mobile safari/533.1 "}varr = Rand. New (Rand. Newsource (time. Now (). Unixnano ()))funcGetrandomuseragent ()string{returnUSERAGENT[R.INTN (Len(useragent))]}
Parsing <a>
elements
<a>
Can be used as an XML document (only a simple XML with a root node) to resolve the Href/href property, through the Go standard library of XML. Newdecoder to finish.
funcGethref (Atagstring) (href,contentstring) {inputreader: = strings. Newreader (atag) Decoder: = XML. Newdecoder (Inputreader) forT, err: = decoder. Token (); Err = =Nil; T, Err = decoder. Token () {SwitchToken: = T. (type) {//Processing element start (label) CaseXml. Startelement: for_, attr: =RangeToken. Attr {attrname: = Attr. Name.local attrValue: = attr. Valueif(Strings. Equalfold (Attrname,"href") || Strings. Equalfold (Attrname,"HREF") {href = attrValue}}//Process element end (label) CaseXml. EndElement://Processing character data (this is the text of the element) CaseXml. Chardata:content =string([]byte(token))default: href =""Content =""} }returnhref, content}
Summarize
With the above code, a simple web crawler is implemented. And the use of Goroutine and range of the basic understanding. But as you can see, Goroutine's operating mechanism and Chan's design principles are not the same as the few code above to see its true colors.
Go language implementation of simple crawler to crawl CSDN blog