In the previous article, we used the command line URL parameter input, crawled the corresponding page content and saved to the local disk, today to record how to use the concurrency, to crawl the Web content of multiple sites.
First, we made a little makeover on the previous code, making it possible to get content from multiple sites. In the code below, we first define three URLs, then send each network request, get the data and save it, and finally count the total time spent:
Fetch.gopackage mainimport ("OS" "FMT" "Time" "RegExp" "Net/http" "io/ioutil")//Create regular constant var RE = Rege Xp. Mustcompile ("\\w+\\.\\w+$") func main () {urls: = []string {"http://www.qq.com", "http://www.163.com", "Http://www.sina.com",}//start: = time. Now () for _, url: = Range urls {start: = time. Now ()//Send network request res, ERR: = http. Get (URL) if err! = Nil {fmt. fprintf (OS. Stderr, "Fetch:%v\n", err) OS. Exit (1)}//Read the resource data body, err: = Ioutil. ReadAll (Res. Body)//close resource Res. Body.close () if err! = Nil {fmt. fprintf (OS. Stderr, "fetch:reading%s:%v\n", URL, err) os. Exit (1)} FileName: = GetFileName (URL)//write to file Ioutil. WriteFile (FileName, Body, 0644)//consumed by elapsed: = time. Since (Start). Seconds () fmt. Printf ("%.2fs%s\n", Elapsed, FileName)}//Time consumed elapsed: = times. Since (Start). Seconds () fmt. Printf ("%.2fs elapsed\n", elapsed)}//Gets the file name func getfilename (URL string) string {//from the URL that matches the name of the back part of the return RE. FindString (URL) + ". txt"}
In the above code, we use regular expressions to match the last part of the domain name from the URL as the final file name. For regular expressions, follow-up will summarize.
Here's a look at the console information after the program runs:
$ ./fetch0.12s qq.com.txt0.20s 163.com.txt0.27s sina.com.txt0.59s elapsed
As you can see from the printed message, the total time spent at the end is equal to the sum of three executions. This approach is inefficient and does not take full advantage of computer resources, so we will transform the program so that it can perform three crawl operations concurrently:
Fetch.gopackage mainimport ("OS" "FMT" "Time" "RegExp" "Net/http" "io/ioutil")//Create regular var RE = RegExp . Mustcompile ("\\w+\\.\\w+$") func main () {urls: = []string {"http://www.qq.com", "http://www.163.com", "Http://www.sina.com",}//Create Channel ch: = Make (Chan string)//start: = time. Now () for _, url: = range urls {//Open a goroutine go fetch (URL, ch)} for range URLs {//print Information in the channel FMT. Println (<-CH)}//Total time consumed elapsed: = times. Since (Start). Seconds () fmt. Printf ("%.2fs elapsed\n", elapsed)}//gets resource content based on URL func fetch (URL string, ch chan<-string) {start: = time. Now ()//Send network request res, ERR: = http. Get (URL) if err! = Nil {//output exception information ch <-FMT. Sprint (ERR) OS. Exit (1)}//Read the resource data body, err: = Ioutil. ReadAll (Res. Body)//close resource Res. Body.close () if err! = Nil {//output exception information ch <-FMT. Sprintf ("While reading%s:%v", URL, err) os. Exit (1)}//write to file Ioutil. WriteFile (GetFileName (URL), body, 0644)//consumed time elapsed: = times. Since (Start). Seconds ()//Output single URL consumes time ch <-fmt. Sprintf ("%.2fs%s", elapsed, URL)}//Gets the file name func getfilename (URL string) string {//from the URL that matches the domain name part of return RE. FindString (URL) + ". txt"}
In the above code, we first create one channel
, and then open one for each fetch operation, after the completion of the crawler, goruntine
by channel
sending a message to inform the main thread, the main thread to do the corresponding processing operations. This part of the principle of detail, follow-up to do a summary.
We run the above program and execute the results as follows:
$ ./fetch0.10s http://www.qq.com0.19s http://www.163.com0.29s http://www.sina.com0.29s elapsed
As you can see from the results, the total elapsed time is the same as the longest-consuming operation, and visible concurrency is a significant improvement in performance.
Golang Series article: Crawling Web content concurrently