I often get questions about email crawlers. There are indications that people who want to grab contact information from the Web are interested in the problem. In this article, I want to demonstrate how to use Python to implement a simple mailbox crawler. This crawler is simple, but you can learn a lot from this example (especially when you want to be a new worm).
I deliberately simplified the code, as much as possible to the main ideas to express clearly. This way you can add your own functionality when you need it. Although very simple, but complete implementation from the Web crawl email address function. Note that the code for this article is written using Python3.
Good. Let's step into it. My 1.1 point of implementation, and add comments. Finally, the complete code is pasted out.
First, all the necessary libraries are introduced. In this example, the BeautifulSoup and requests that we use are third-party libraries, Urllib, collections, and re are built-in libraries.
Beaufulsoup can make it easier to retrieve HTML documents, and requests makes it easy to perform Web requests.
from Import BeautifulSoup Import Requests Import requests.exceptions from Import Urlsplit from Import deque Import RE
Below I define a list to hold the Web address to crawl, such as http://www.huazeming.com/, of course, you can also find the obvious email address of the page as an address, unlimited number. Although this collection should be a list (in Python), I chose the deque type because it is more in line with our needs.
# a queue of URLsto be crawled new_urls = deque (['http://www.themoscowtimes.com/contact_us/ '])
Next, we need to save the processed URLs to avoid duplication of processing. I choose the set type because this collection guarantees that the element values are not duplicated.
# a set of URLs that we have already crawledProcessed_urls = set ()
Define an email collection to store the collected addresses:
# a set of crawled emailsemails = set ()
Let's start crawling! We have a loop that constantly takes out the address of the queue and processes it until there is no address in the queue. After removing the address, we immediately add this address to the address list that has been processed, lest we forget it in the future.
# process URLs One by one until we exhaust the queue while Len (new_urls): # move next URL from the queue to the set of processed URLs url = new_urls.popleft () processed_urls.add (URL)
Then we need to extract the root address from the current address so that when we find the relative address from the document, we can convert it to an absolute address.
# extract base URL and path to resolve relative linksparts ="{0.scheme}://{0.netloc}" c5> "= url[:url.rfind ('/'if'/') inelse URL
Below we get the content of the page from the Internet, if we encounter an error, we skip to continue processing the next page.
# get URL ' s content Print ("processing%s" % URL) Try : = requests.get (URL)except (Requests.exceptions.MissingSchema, Requests.exceptions.ConnectionError): # ignore pages with errors Continue
When we get the Web content, we find all the email addresses in the content and add them to the list. We use regular expressions to extract the e-mail address:
# extract all mail addresses and add them into the resulting setNew_emails = Set (Re.findall (R"[a-z0- 9\.\-+_][email protected][a-z0-9\.\-+_]+\. [a-z]+], Response.text, re. I)) emails.update (new_emails)
After we have extracted the email address of the current Web page, we find the other page address in the current page and add it to the address queue with processing. Here we use the BeautifulSoup Library to analyze Web page HTML.
# Create a beutiful soup for the HTML documentsoup = BeautifulSoup (response.text)
The Find_all method of this library extracts elements based on the HTML tag name.
# Find and Process all the anchors in the document for in Soup.find_all ("a"):
However, some of the total page a tags may not contain URL addresses, which we need to take into account.
# Extract link URLfrom the anchor link = anchor.attrs["href"if" href"inelse "
If this address starts with a slash, then we treat it as a relative address and add the necessary root address to it:
# Add base URL to relative links if link.startswith ('/'): = base_url + link
Here we get a valid address (beginning with HTTP), if our address queue is not, and has not been processed before, then we will add this address to the address queue:
# add the new URL to the queue if it's of HTTP protocol, not enqueued and not processed yet if link.startswith ('http' and not in and not inch processed_urls: new_urls.append (link)
Okay, that's it. The following is the complete code:
fromBs4ImportBeautifulSoupImportRequestsImportrequests.exceptions fromUrllib.parseImportUrlsplit fromCollectionsImportdequeImportRe#a queue of URLs to be crawledNew_urls = Deque (['http://www.themoscowtimes.com/contact_us/index.php'])#a set of URLs that we have already crawledProcessed_urls =set ()#a set of crawled emailsEmails =set ()#process URLs One by one until we exhaust the queue whileLen (new_urls):#move next URL from the queue to the set of processed URLsURL =New_urls.popleft () processed_urls.add (URL)#Extract base URL to resolve relative linksParts =urlsplit (URL) base_url="{0.scheme}://{0.netloc}". Format (Parts) path= Url[:url.rfind ('/') +1]if '/' inchParts.pathElseURL#get URL ' s content Print("Processing%s"%URL)Try: Response=requests.get (URL)except(Requests.exceptions.MissingSchema, requests.exceptions.ConnectionError):#ignore pages with errors Continue #extract all email addresses and add them to the resulting setNew_emails = Set (Re.findall (r"[A-z0-9\.\-+_][email protected][a-z0-9\.\-+_]+\. [A-z]+", Response.text, re. I)) emails.update (new_emails)#Create a beutiful soup for the HTML documentSoup =BeautifulSoup (Response.text)#Find and Process all the anchors in the document forAnchorinchSoup.find_all ("a"): #Extract link URL from the anchorlink = anchor.attrs["href"]if "href" inchAnchor.attrsElse "' #Resolve Relative links ifLink.startswith ('/'): Link= Base_url +Linkelif notLink.startswith ('http'): Link= Path +Link#add the new URL to the queue if it is not enqueued nor processed yet if notLinkinchNew_urls and notLinkinchprocessed_urls:new_urls.append (link)
This crawler is relatively simple, eliminating some features (such as saving email addresses to a file), but provides some basic principles for writing mailbox crawlers. You can try to make improvements to this program.
Of course, if you have any questions or suggestions, please correct me!
English Original: A simple e-mail Crawler in Python