實現一個簡單的郵箱地址爬蟲（python)

最後更新：2014-08-11 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

標籤：style blog http color 使用 os io 檔案

　　我經常收到關於email爬蟲的問題。有跡象表明那些想從網頁上抓取連絡方式的人對這個問題高度興趣。在這篇文章裡，我想示範一下如何使用python實現一個簡單的郵箱爬蟲。這個爬蟲很簡單，但從這個例子中你可以學到許多東西（尤其是當你想做一個新蟲的時候）。

　　我特意簡化了代碼，儘可能的把主要思路表達清楚。這樣你就可以在需要的時候加上自己的功能。雖然很簡單，但完整的實現從網上抓取email地址的功能。注意，本文的代碼是使用python3寫的。

　　好。讓我們逐步深入吧。我一點一點的實現，並加上注釋。最後再把完整的代碼貼出來。

　　首先引入所有必要的庫。在這個例子中，我們使用的BeautifulSoup 和 Requests 是第三方庫，urllib, collections 和 re 是內建庫。

BeaufulSoup可以使檢索Html文檔更簡便，Requests讓執行web請求更容易。

from bs4 import BeautifulSoupimport requestsimport requests.exceptionsfrom urllib.parse import urlsplitfrom collections import dequeimport re

　　下面我定義了一個列表，用於存放要抓取的網頁地址，比如http://www.huazeming.com/ ，當然你也可以找有明顯email地址的網頁作為地址，數量不限。雖然這個集合應該是個列表（在python中），但我選擇了 deque 這個類型，因為這個更符合我們的需要。

# a queue of urls to be crawlednew_urls = deque([‘http://www.themoscowtimes.com/contact_us/‘])

　　接下來，我們需要把處理過的url存起來，以避免重複處理。我選擇set類型，因為這個集合可以保證元素值不重複。

# a set of urls that we have already crawledprocessed_urls = set()

　　定義一個email集合，用於儲存收集到地址：

# a set of crawled emailsemails = set()

　　讓我們開始抓取吧！我們有一個迴圈，不斷取出隊列的地址進行處理，直到隊列裡沒有地址為止。取出地址後，我們立即把這個地址加到已處理的地址清單中，以免將來忘記。

# process urls one by one until we exhaust the queuewhile len(new_urls):    # move next url from the queue to the set of processed urls    url = new_urls.popleft()    processed_urls.add(url)

　　然後我們需要從當前地址中提取出根地址，這樣當我們從文檔中找到相對位址時，我們就可以把它轉換成絕對位址。

# extract base url and path to resolve relative linksparts = urlsplit(url)base_url = "{0.scheme}://{0.netloc}".format(parts)path = url[:url.rfind(‘/‘)+1] if ‘/‘ in parts.path else url

　　下面我們從網上擷取頁面內容，如果遇到錯誤，就跳過繼續處理下一個網頁。

# get url‘s contentprint("Processing %s" % url)try:    response = requests.get(url)except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError):    # ignore pages with errors    continue

　　當我們得到網頁內容後，我們找到內容裡所有email地址，把其添加到列表裡。我們使用Regex提取email地址：

# extract all email addresses and add them into the resulting setnew_emails = set(re.findall(r"[a-z0-9\.\-+_][email protected][a-z0-9\.\-+_]+\.[a-z]+", response.text, re.I))emails.update(new_emails)

　　在我們提取完當前網頁內容的email地址後，我們找到當前網頁中的其他網頁地址，並將其添加到帶處理的地址隊列裡。這裡我們使用BeautifulSoup庫來分析網頁html。

     # create a beutiful soup for the html documentsoup = BeautifulSoup(response.text)

　　這個庫的find_all方法可以根據html標籤名來抽取元素。

# find and process all the anchors in the documentfor anchor in soup.find_all("a"):

　　但網頁總的有些a標籤可能不包含url地址，這個我們需要考慮到。

# extract link url from the anchorlink = anchor.attrs["href"] if "href" in anchor.attrs else ‘‘

　　如果這個地址以斜線開頭，那麼我們把它當做相對位址，然後給他加上必要的根地址：

# add base url to relative linksif link.startswith(‘/‘):    link = base_url + link

　　到此我們得到了一個有效地址（以http開頭），如果我們的地址隊列沒有，而且之前也沒有處理過，那我們就把這個地址加入地址隊列裡:

# add the new url to the queue if it‘s of HTTP protocol, not enqueued and not processed yetif link.startswith(‘http‘) and not link in new_urls and not link in processed_urls:    new_urls.append(link)

　　好，就是這樣。以下是完整代碼：

from bs4 import BeautifulSoupimport requestsimport requests.exceptionsfrom urllib.parse import urlsplitfrom collections import dequeimport re# a queue of urls to be crawlednew_urls = deque([‘http://www.themoscowtimes.com/contact_us/index.php‘])# a set of urls that we have already crawledprocessed_urls = set()# a set of crawled emailsemails = set()# process urls one by one until we exhaust the queuewhile len(new_urls):    # move next url from the queue to the set of processed urls    url = new_urls.popleft()    processed_urls.add(url)    # extract base url to resolve relative links    parts = urlsplit(url)    base_url = "{0.scheme}://{0.netloc}".format(parts)    path = url[:url.rfind(‘/‘)+1] if ‘/‘ in parts.path else url    # get url‘s content    print("Processing %s" % url)    try:        response = requests.get(url)    except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError):        # ignore pages with errors        continue    # extract all email addresses and add them into the resulting set    new_emails = set(re.findall(r"[a-z0-9\.\-+_][email protected][a-z0-9\.\-+_]+\.[a-z]+", response.text, re.I))    emails.update(new_emails)    # create a beutiful soup for the html document    soup = BeautifulSoup(response.text)    # find and process all the anchors in the document    for anchor in soup.find_all("a"):        # extract link url from the anchor        link = anchor.attrs["href"] if "href" in anchor.attrs else ‘‘        # resolve relative links        if link.startswith(‘/‘):            link = base_url + link        elif not link.startswith(‘http‘):            link = path + link        # add the new url to the queue if it was not enqueued nor processed yet        if not link in new_urls and not link in processed_urls:            new_urls.append(link)

　　這個爬蟲比較簡單，省去了一些功能（比如把郵箱地址儲存到檔案中），但提供了編寫郵箱爬蟲的一些基本原則。你可以嘗試對這個程式進行改進。

　　當然，如果你有任何問題和建議，歡迎指正！

　　英文原文：A Simple Email Crawler in Python

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More