Golang系列文章：並發抓取網頁內容

最後更新：2018-09-06 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

標籤：package lap compile %s 文章參數 ola for sprintf

在上一篇中，我們根據命令列的URL參數輸入，抓取對應的網頁內容並儲存到本地磁碟，今天來記錄一下如何利用並發，來抓取多個網站的網頁內容。

首先，我們在上一次代碼的基礎上稍作改造，使它能夠擷取多個網站的內容。下面代碼中，我們首先定義好三個URL，然後逐個發送網路請求，擷取資料並儲存，最後統計消耗的總時間：

// fetch.gopackage mainimport (    "os"    "fmt"    "time"    "regexp"    "net/http"    "io/ioutil")// 建立正則常量var RE = regexp.MustCompile("\\w+\\.\\w+$")func main() {    urls := []string {        "http://www.qq.com",        "http://www.163.com",        "http://www.sina.com",    }    // 開始時間    start := time.Now()    for _, url := range urls {        start := time.Now()        // 發送網路請求        res, err := http.Get(url)        if err != nil {            fmt.Fprintf(os.Stderr, "fetch: %v\n", err)            os.Exit(1)        }        // 讀取資源資料        body, err := ioutil.ReadAll(res.Body)        // 關閉資源        res.Body.Close()        if err != nil {            fmt.Fprintf(os.Stderr, "fetch: reading %s: %v\n", url, err)            os.Exit(1)        }        fileName := getFileName(url)        // 寫入檔案        ioutil.WriteFile(fileName, body, 0644)        // 消耗的時間        elapsed := time.Since(start).Seconds()        fmt.Printf("%.2fs %s\n", elapsed, fileName)    }    // 消耗的時間    elapsed := time.Since(start).Seconds()    fmt.Printf("%.2fs elapsed\n", elapsed)}// 擷取檔案名稱func getFileName(url string) string {    // 從URL中匹配網域名稱後面部分    return RE.FindString(url) + ".txt"}

在上面代碼中，我們使用Regex來從URL中匹配網域名稱後面部分，作為最終的檔案名稱。關於Regex，後續會做總結。

下面來看看程式運行後的控制台資訊：

$ ./fetch0.12s qq.com.txt0.20s 163.com.txt0.27s sina.com.txt0.59s elapsed

從列印資訊中可以看出，最後消耗的總時間等於三次執行的總和。這種方式效率低下，並且不能充分利用電腦資源，下面我們就對程式進行改造，使其能夠並發地執行三個抓取操作：

// fetch.gopackage mainimport (    "os"    "fmt"    "time"    "regexp"    "net/http"    "io/ioutil")// 建立正則var RE = regexp.MustCompile("\\w+\\.\\w+$")func main() {    urls := []string {        "http://www.qq.com",        "http://www.163.com",        "http://www.sina.com",    }    // 建立channel    ch := make(chan string)    // 開始時間    start := time.Now()    for _, url := range urls {        // 開啟一個goroutine        go fetch(url, ch)    }    for range urls {        // 列印channel中的資訊        fmt.Println(<-ch)    }    // 總消耗的時間    elapsed := time.Since(start).Seconds()    fmt.Printf("%.2fs elapsed\n", elapsed)}// 根據URL擷取資源內容func fetch(url string, ch chan<- string) {    start := time.Now()    // 發送網路請求    res, err := http.Get(url)    if err != nil {        // 輸出異常資訊        ch <- fmt.Sprint(err)        os.Exit(1)    }    // 讀取資源資料    body, err := ioutil.ReadAll(res.Body)    // 關閉資源    res.Body.Close()    if err != nil {        // 輸出異常資訊        ch <- fmt.Sprintf("while reading %s: %v", url, err)        os.Exit(1)    }    // 寫入檔案    ioutil.WriteFile(getFileName(url), body, 0644)    // 消耗的時間    elapsed := time.Since(start).Seconds()    // 輸出單個URL消耗的時間    ch <- fmt.Sprintf("%.2fs %s", elapsed, url)}// 擷取檔案名稱func getFileName(url string) string {    // 從URL中匹配網域名稱部分    return RE.FindString(url) + ".txt"}

上面代碼中，我們先建立一個channel，然後對每個抓取操作開啟一個goruntine，待抓取程式完成後，通過channel發送訊息告知主線程，主線程再做相應的處理操作。關於這部分的原理細節，後續再做總結。

我們運行上面的程式，執行結果如下：

$ ./fetch0.10s http://www.qq.com0.19s http://www.163.com0.29s http://www.sina.com0.29s elapsed

從結果中可以看出，最後消耗的總時間與耗時最長的那個操作等同，可見並發在效能方面帶來的提升是非常可觀的。

Golang系列文章：並發抓取網頁內容

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More