Golang Series article: Crawling Web content concurrently

Last Update:2018-09-06 Source: Internet

Author: User

Tags sprintf

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In the previous article, we used the command line URL parameter input, crawled the corresponding page content and saved to the local disk, today to record how to use the concurrency, to crawl the Web content of multiple sites.

First, we made a little makeover on the previous code, making it possible to get content from multiple sites. In the code below, we first define three URLs, then send each network request, get the data and save it, and finally count the total time spent:

Fetch.gopackage mainimport ("OS" "FMT" "Time" "RegExp" "Net/http" "io/ioutil")//Create regular constant var RE = Rege Xp.        Mustcompile ("\\w+\\.\\w+$") func main () {urls: = []string {"http://www.qq.com", "http://www.163.com", "Http://www.sina.com",}//start: = time. Now () for _, url: = Range urls {start: = time. Now ()//Send network request res, ERR: = http. Get (URL) if err! = Nil {fmt. fprintf (OS. Stderr, "Fetch:%v\n", err) OS. Exit (1)}//Read the resource data body, err: = Ioutil. ReadAll (Res. Body)//close resource Res. Body.close () if err! = Nil {fmt. fprintf (OS. Stderr, "fetch:reading%s:%v\n", URL, err) os. Exit (1)} FileName: = GetFileName (URL)//write to file Ioutil. WriteFile (FileName, Body, 0644)//consumed by elapsed: = time. Since (Start). Seconds () fmt. Printf ("%.2fs%s\n", Elapsed, FileName)}//Time consumed elapsed: = times. Since (Start). Seconds () fmt. Printf ("%.2fs elapsed\n", elapsed)}//Gets the file name func getfilename (URL string) string {//from the URL that matches the name of the back part of the return RE. FindString (URL) + ". txt"}

In the above code, we use regular expressions to match the last part of the domain name from the URL as the final file name. For regular expressions, follow-up will summarize.

Here's a look at the console information after the program runs:

$ ./fetch0.12s qq.com.txt0.20s 163.com.txt0.27s sina.com.txt0.59s elapsed

As you can see from the printed message, the total time spent at the end is equal to the sum of three executions. This approach is inefficient and does not take full advantage of computer resources, so we will transform the program so that it can perform three crawl operations concurrently:

Fetch.gopackage mainimport ("OS" "FMT" "Time" "RegExp" "Net/http" "io/ioutil")//Create regular var RE = RegExp .        Mustcompile ("\\w+\\.\\w+$") func main () {urls: = []string {"http://www.qq.com", "http://www.163.com", "Http://www.sina.com",}//Create Channel ch: = Make (Chan string)//start: = time. Now () for _, url: = range urls {//Open a goroutine go fetch (URL, ch)} for range URLs {//print Information in the channel FMT. Println (&LT;-CH)}//Total time consumed elapsed: = times. Since (Start). Seconds () fmt. Printf ("%.2fs elapsed\n", elapsed)}//gets resource content based on URL func fetch (URL string, ch chan<-string) {start: = time. Now ()//Send network request res, ERR: = http. Get (URL) if err! = Nil {//output exception information ch <-FMT. Sprint (ERR) OS. Exit (1)}//Read the resource data body, err: = Ioutil. ReadAll (Res. Body)//close resource Res. Body.close () if err! = Nil {//output exception information ch <-FMT. Sprintf ("While reading%s:%v", URL, err) os. Exit (1)}//write to file Ioutil. WriteFile (GetFileName (URL), body, 0644)//consumed time elapsed: = times. Since (Start). Seconds ()//Output single URL consumes time ch <-fmt. Sprintf ("%.2fs%s", elapsed, URL)}//Gets the file name func getfilename (URL string) string {//from the URL that matches the domain name part of return RE. FindString (URL) + ". txt"}

In the above code, we first create one channel , and then open one for each fetch operation, after the completion of the crawler, goruntine by channel sending a message to inform the main thread, the main thread to do the corresponding processing operations. This part of the principle of detail, follow-up to do a summary.

We run the above program and execute the results as follows:

$ ./fetch0.10s http://www.qq.com0.19s http://www.163.com0.29s http://www.sina.com0.29s elapsed

As you can see from the results, the total elapsed time is the same as the longest-consuming operation, and visible concurrency is a significant improvement in performance.

Golang Series article: Crawling Web content concurrently

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More