Golang Series article: Crawling Web content concurrently

Source: Internet
Author: User
Tags sprintf

In the previous article, we used the command line URL parameter input, crawled the corresponding page content and saved to the local disk, today to record how to use the concurrency, to crawl the Web content of multiple sites.

First, we made a little makeover on the previous code, making it possible to get content from multiple sites. In the code below, we first define three URLs, then send each network request, get the data and save it, and finally count the total time spent:

Fetch.gopackage mainimport ("OS" "FMT" "Time" "RegExp" "Net/http" "io/ioutil")//Create regular constant var RE = Rege Xp.        Mustcompile ("\\w+\\.\\w+$") func main () {urls: = []string {"http://www.qq.com", "http://www.163.com", "Http://www.sina.com",}//start: = time. Now () for _, url: = Range urls {start: = time. Now ()//Send network request res, ERR: = http. Get (URL) if err! = Nil {fmt. fprintf (OS. Stderr, "Fetch:%v\n", err) OS. Exit (1)}//Read the resource data body, err: = Ioutil. ReadAll (Res. Body)//close resource Res. Body.close () if err! = Nil {fmt. fprintf (OS. Stderr, "fetch:reading%s:%v\n", URL, err) os. Exit (1)} FileName: = GetFileName (URL)//write to file Ioutil. WriteFile (FileName, Body, 0644)//consumed by elapsed: = time. Since (Start). Seconds () fmt. Printf ("%.2fs%s\n", Elapsed, FileName)}//Time consumed elapsed: = times. Since (Start). Seconds () fmt. Printf ("%.2fs elapsed\n", elapsed)}//Gets the file name func getfilename (URL string) string {//from the URL that matches the name of the back part of the return RE. FindString (URL) + ". txt"}

In the above code, we use regular expressions to match the last part of the domain name from the URL as the final file name. For regular expressions, follow-up will summarize.

Here's a look at the console information after the program runs:

$ ./fetch0.12s qq.com.txt0.20s 163.com.txt0.27s sina.com.txt0.59s elapsed

As you can see from the printed message, the total time spent at the end is equal to the sum of three executions. This approach is inefficient and does not take full advantage of computer resources, so we will transform the program so that it can perform three crawl operations concurrently:

Fetch.gopackage mainimport ("OS" "FMT" "Time" "RegExp" "Net/http" "io/ioutil")//Create regular var RE = RegExp .        Mustcompile ("\\w+\\.\\w+$") func main () {urls: = []string {"http://www.qq.com", "http://www.163.com", "Http://www.sina.com",}//Create Channel ch: = Make (Chan string)//start: = time. Now () for _, url: = range urls {//Open a goroutine go fetch (URL, ch)} for range URLs {//print Information in the channel FMT. Println (&LT;-CH)}//Total time consumed elapsed: = times. Since (Start). Seconds () fmt. Printf ("%.2fs elapsed\n", elapsed)}//gets resource content based on URL func fetch (URL string, ch chan<-string) {start: = time. Now ()//Send network request res, ERR: = http. Get (URL) if err! = Nil {//output exception information ch <-FMT. Sprint (ERR) OS. Exit (1)}//Read the resource data body, err: = Ioutil. ReadAll (Res. Body)//close resource Res. Body.close () if err! = Nil {//output exception information ch <-FMT. Sprintf ("While reading%s:%v", URL, err) os. Exit (1)}//write to file Ioutil. WriteFile (GetFileName (URL), body, 0644)//consumed time elapsed: = times. Since (Start). Seconds ()//Output single URL consumes time ch <-fmt. Sprintf ("%.2fs%s", elapsed, URL)}//Gets the file name func getfilename (URL string) string {//from the URL that matches the domain name part of return RE. FindString (URL) + ". txt"}

In the above code, we first create one channel , and then open one for each fetch operation, after the completion of the crawler, goruntine by channel sending a message to inform the main thread, the main thread to do the corresponding processing operations. This part of the principle of detail, follow-up to do a summary.

We run the above program and execute the results as follows:

$ ./fetch0.10s http://www.qq.com0.19s http://www.163.com0.29s http://www.sina.com0.29s elapsed

As you can see from the results, the total elapsed time is the same as the longest-consuming operation, and visible concurrency is a significant improvement in performance.

Golang Series article: Crawling Web content concurrently

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.