Implement a simple email address crawler (python)

Source: Internet
Author: User
Tags tag name all mail

I often get questions about email crawlers. There are indications that people who want to grab contact information from the Web are interested in the problem. In this article, I want to demonstrate how to use Python to implement a simple mailbox crawler. This crawler is simple, but you can learn a lot from this example (especially when you want to be a new worm).

I deliberately simplified the code, as much as possible to the main ideas to express clearly. This way you can add your own functionality when you need it. Although very simple, but complete implementation from the Web crawl email address function. Note that the code for this article is written using Python3.

Good. Let's step into it. My 1.1 point of implementation, and add comments. Finally, the complete code is pasted out.

First, all the necessary libraries are introduced. In this example, the BeautifulSoup and requests that we use are third-party libraries, Urllib, collections, and re are built-in libraries.

Beaufulsoup can make it easier to retrieve HTML documents, and requests makes it easy to perform Web requests.

 from Import BeautifulSoup Import Requests Import requests.exceptions  from Import Urlsplit  from Import deque Import RE

Below I define a list to hold the Web address to crawl, such as http://www.huazeming.com/, of course, you can also find the obvious email address of the page as an address, unlimited number. Although this collection should be a list (in Python), I chose the deque type because it is more in line with our needs.

# a queue of URLsto be crawled new_urls = deque (['http://www.themoscowtimes.com/contact_us/ '])

Next, we need to save the processed URLs to avoid duplication of processing. I choose the set type because this collection guarantees that the element values are not duplicated.

  

# a set of URLs that we have already crawledProcessed_urls = set ()

Define an email collection to store the collected addresses:

# a set of crawled emailsemails = set ()

Let's start crawling! We have a loop that constantly takes out the address of the queue and processes it until there is no address in the queue. After removing the address, we immediately add this address to the address list that has been processed, lest we forget it in the future.

# process URLs One by one until we exhaust the queue  while Len (new_urls):     # move next URL from the queue to the set of processed URLs    url = new_urls.popleft ()    processed_urls.add (URL)

Then we need to extract the root address from the current address so that when we find the relative address from the document, we can convert it to an absolute address.

# extract base URL and path to resolve relative linksparts ="{0.scheme}://{0.netloc}" c5> "= url[:url.rfind ('/'if'/')  inelse URL

Below we get the content of the page from the Internet, if we encounter an error, we skip to continue processing the next page.

# get URL ' s content Print ("processing%s" % URL) Try :     = requests.get (URL)except  (Requests.exceptions.MissingSchema, Requests.exceptions.ConnectionError):    #  ignore pages with errors     Continue

When we get the Web content, we find all the email addresses in the content and add them to the list. We use regular expressions to extract the e-mail address:

# extract all mail addresses and add them into the resulting setNew_emails = Set (Re.findall (R"[a-z0- 9\.\-+_][email protected][a-z0-9\.\-+_]+\. [a-z]+], Response.text, re. I)) emails.update (new_emails)

After we have extracted the email address of the current Web page, we find the other page address in the current page and add it to the address queue with processing. Here we use the BeautifulSoup Library to analyze Web page HTML.

     # Create a beutiful soup for the HTML documentsoup = BeautifulSoup (response.text)

The Find_all method of this library extracts elements based on the HTML tag name.

# Find and Process all the anchors in the document  for  in Soup.find_all ("a"):

However, some of the total page a tags may not contain URL addresses, which we need to take into account.

# Extract link URLfrom the anchor link = anchor.attrs["href"if"  href"inelse "

If this address starts with a slash, then we treat it as a relative address and add the necessary root address to it:

# Add base URL to relative links if link.startswith ('/'):    = base_url + link

Here we get a valid address (beginning with HTTP), if our address queue is not, and has not been processed before, then we will add this address to the address queue:

# add the new URL to the queue if it's of HTTP protocol, not enqueued and not processed yet if link.startswith ('http' and not in and  not inch processed_urls:    new_urls.append (link)

Okay, that's it. The following is the complete code:

 fromBs4ImportBeautifulSoupImportRequestsImportrequests.exceptions fromUrllib.parseImportUrlsplit fromCollectionsImportdequeImportRe#a queue of URLs to be crawledNew_urls = Deque (['http://www.themoscowtimes.com/contact_us/index.php'])#a set of URLs that we have already crawledProcessed_urls =set ()#a set of crawled emailsEmails =set ()#process URLs One by one until we exhaust the queue whileLen (new_urls):#move next URL from the queue to the set of processed URLsURL =New_urls.popleft () processed_urls.add (URL)#Extract base URL to resolve relative linksParts =urlsplit (URL) base_url="{0.scheme}://{0.netloc}". Format (Parts) path= Url[:url.rfind ('/') +1]if '/' inchParts.pathElseURL#get URL ' s content    Print("Processing%s"%URL)Try: Response=requests.get (URL)except(Requests.exceptions.MissingSchema, requests.exceptions.ConnectionError):#ignore pages with errors        Continue    #extract all email addresses and add them to the resulting setNew_emails = Set (Re.findall (r"[A-z0-9\.\-+_][email protected][a-z0-9\.\-+_]+\. [A-z]+", Response.text, re. I)) emails.update (new_emails)#Create a beutiful soup for the HTML documentSoup =BeautifulSoup (Response.text)#Find and Process all the anchors in the document     forAnchorinchSoup.find_all ("a"):        #Extract link URL from the anchorlink = anchor.attrs["href"]if "href" inchAnchor.attrsElse "'        #Resolve Relative links        ifLink.startswith ('/'): Link= Base_url +Linkelif  notLink.startswith ('http'): Link= Path +Link#add the new URL to the queue if it is not enqueued nor processed yet        if  notLinkinchNew_urls and  notLinkinchprocessed_urls:new_urls.append (link)

This crawler is relatively simple, eliminating some features (such as saving email addresses to a file), but provides some basic principles for writing mailbox crawlers. You can try to make improvements to this program.

Of course, if you have any questions or suggestions, please correct me!

English Original: A simple e-mail Crawler in Python

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.