Research Target website background 1 Check robotstxt 2 Check site Map 3 estimate site size 4 Identify site All Technology 5 Find site owner first web crawler 1 download Web page retry Download Settings user Agent User_agent 2 crawl site Map 3 Calendar database ID for each page 4 Tracking Web links Advanced function resolution Robotstxt support proxy download speed limit avoid the final version of the Reptile Trap
1 Research Target website background 1.1 Check robots.txt
Http://example.webscraping.com/robots.txt
# section 1
user-agent:badcrawler
Disallow:/
# section 2
user-agent: *
crawl-delay:5
Disallow:/trap
# section 3
sitemap:http://example.webscraping.com/sitemap.xml
Section 1: Prohibits user agents from crawling the site for Badcrawler crawlers, unless the malicious crawler. Section 2: Two download request time interval 5 seconds crawl delay. /trap is used to ban malicious reptiles and will be banned for 1 minutes. Section 3: Define a sitemap file, as the following sections say.
1.2 Check site map
All Web Links: http://example.webscraping.com/sitemap.xml
<urlset xmlns= "http://www.sitemaps.org/schemas/sitemap/0.9" >
<url>
<loc>http:// example.webscraping.com/view/afghanistan-1</loc>
</url>
<url>
<loc>
Http://example.webscraping.com/view/Aland-Islands-2
</loc>
</url>
...
<url>
<loc>http://example.webscraping.com/view/Zimbabwe-252</loc>
</url>
</urlset>
1.3 Estimating Site size
Advanced Search Parameters: Http://www.google.com/advanced_search
Google search: site:http://example.webscraping.com/has 202 pages
Google search: Site:http://example.webscraping.com/view has 117 web pages 1.4 identify all technology
Use the buildwith module to check the technical types of web site construction.
Installation Gallery: Pip install Buildwith
>>> import Builtwith
>>> builtwith.parse (' http://example.webscraping.com ')
{ U ' javascript-frameworks ': [u ' jquery ', U ' Modernizr ', U ' jquery UI '],
u ' web-frameworks ': [u ' web2py ', U ' Twitter Bootstrap '],
u ' programming-languages ': [u ' Python '],
u ' web-servers ': [u ' Nginx ']}
The example Web site uses the Python web2py framework and uses a JavaScript library, possibly embedded in HTML. This easy to crawl. Other construction types:
-Angularjs: Content dynamic loading
-ASP. NET: Crawl Web pages are submitted using Session management and form. (Chapters 5th and 6th) 1.5 Looking for site owners
The whois protocol is used to query domain name registrars.
Documents: Https://pypi.python.org/pypi/python-whois
Installation: Pip Install Python-whois
>>> import whois
>>> print whois.whois (' appspot.com ')
{...]
Name_servers ": [
" NS1. Google.com ",
...
" Ns2.google.com ",
" ns1.google.com "
],
" org ":" Google Inc. ",
" creation_date ": [
" 2005-03-10 00:00:00 ",
" 2005-03-09t18:27:55-0800 "
],
" emails ": [
" abusecomplaints@markmonitor.com ",
"dns-admin@google.com"
]
}
This domain name belongs to Google, uses the Google App engine service. Note: Google often blocks web crawlers. 2 first web crawler
There are many ways to crawl (crawling) a Web site, and choosing which method is more appropriate depends on the structure of the target site.
Here's how to safely 1.4.1 a Web page and then introduce 3 ways to crawl Web sites:
-2.2 Crawl Site map;
-2.3 Traverse the database ID of each page;
-2.4 Tracking Web links. 2.1 Download Web page
1.4.1download1.py
#-*-Coding:utf-8-*-
import urllib2
def download1 (URL): "" "Simple
Downloader" ""
Urllib2.urlopen (URL). Read ()
if __name__ = = ' __main__ ':
print download1 (' https://www.baidu.com ')
1.4.1download2.py
def download2 (URL): "" "
Download function that catches errors" "
print ' downloading: ', url
try:
html = Urllib2.urlopen (URL). Read ()
except URLLIB2. Urlerror as E:
print ' Download error: ', E.reason
html = None return
HTML
1. Retry the download
When server overload returns 503 Service unavailable error, we can try downloading again. If this error is 404 Not Found, the Web page does not exist at this time, and there is no request to try two.
1.4.1download3.py
def download3 (URL, num_retries=2): "" "
Download function that also retries 5XX errors" ""
print ' downloading: ', URL
try:
html = urllib2.urlopen (URL). Read ()
except URLLIB2. Urlerror as E:
print ' Download error: ', E.reason
html = None
if num_retries > 0:
if Hasattr (E, ' code ') <= E.code <:
# retry 5XX HTTP errors
html = download3 (URL, num_retries-1) return
html
download = Download3
if __name__ = = ' __main__ ':
print Download (' http://httpstat.us/500 ')
Internet Engineering Task Group defines a complete list of HTTP errors: https://tools.ietf.org/html/rfc7231#section-6
-4xx: Error present request problem
-5xx: Error appears on service side problem 2. Set up user agent (user_agent)
By default, URLLIB2 uses python-urllib/2.7 to download Web content as a user agent, where 2.7 is the Python version number. Some websites also ban the default user agent if the quality of the Python web crawler (the code above) can overload the server. For example, when using the Python Default User agent, access to https://www.meetup.com/appears:
WU_BEING@UBUNTUKYLIN64:~/GITHUB/WEBSCRAPINGWITHPYTHON/1. Introduction to web crawler $ python 1.4.1download4.py
downloading:https:/ /www.meetup.com/
Download Error: [Errno] Connection reset by peer
None
wu_being@ubuntukylin64:~/ GITHUB/WEBSCRAPINGWITHPYTHON/1. Introduction to web crawler $ python 1.4.1download4.py
downloading:https://www.meetup.com/
Download Error:forbidden
None
In order to download more reliable, we need to set control user agent, the following code set a user agent wu_being.
def download4 (URL, user_agent= ' wu_being ', num_retries=2): "" "
Download function that includes user agent support" " C1/>print ' downloading: ', url
headers = {' User-agent ': user_agent}
request = Urllib2. Request (URL, headers=headers)
try:
HTML = urllib2.urlopen (Request). Read ()
except URLLIB2. Urlerror as E:
print ' Download error: ', E.reason
html = None
if num_retries > 0:
if Hasattr (E, ' code ') <= E.code <:
# retry 5XX HTTP errors
html = download4 (URL, user_agent, num_retries-1)
re Turn HTML
2.2 Crawl Site Map
We download all the pages from the site map sitemap.xml found in the robots.txt file of the sample URL. To parse the site map, we use a simple regular expression to extract the URL from the <loc> tag. In the next chapter, we introduce a kind of analytic method of stronger bond--CSS selector
#-*-Coding:utf-8-*-
Import re from
common import Download
def crawl_sitemap (URL):
# Download the Sitem AP file
sitemap = Download (URL)
#>downloading:http://example.webscraping.com/sitemap.xml
# Extract The Sitemap links Links
= Re.findall (' <loc> (. *?) </loc> sitemap)
# Download each link to link in the
links:
html = download (link)
# Scrape HTML here
# ...
#>downloading:http://example.webscraping.com/view/afghanistan-1
#>downloading:http:// Example.webscraping.com/view/aland-islands-2
#>downloading:http://example.webscraping.com/view/ Albania-3
#>
... if __name__ = = ' __main__ ':
crawl_sitemap (' Http://example.webscraping.com/sitemap.xml ')
2.3 Traverse each page's database ID
Http://example.webscraping.com/view/Afghanistan-1
http://example.webscraping.com/view/China-47
http ://example.webscraping.com/view/zimbabwe-252
Since these URLs have only different suffixes, the input HTTP://EXAMPLE.WEBSCRAPING.COM/VIEW/47 can also display the China page normally, all of which we can traverse ID to download all the country pages.
Import Itertools from
common import Download
def iteration (): For
page in Itertools.count (1):
url = ' http://example.webscraping.com/view/-%d '% page
#url = ' http://example.webscraping.com/view/-{} '. Format (page)
html = Download (URL)
If HTML is None:
# received a error trying to Download this webpage
# so assume hav E reached the last country ID with can stop downloading
break
else:
# Success-can Scrape the result
# ...
Pass
If some IDs are discontinuous, the crawler exits at a breakpoint and can be modified to 5 consecutive downloads to stop the traversal.
def iteration ():
max_errors = 5 # Maximum number of consecutive download errors allowed num_errors
= 0 # current N Umber of consecutive download errors for
page in Itertools.count (1):
url = ' Http://example.webscraping.com/view /-{} '. Format (page)
html = download (URL)
If HTML is None:
# received a error trying to download this webpage
num_errors + + 1
if num_errors = = max_errors:
# Reached maximum amount of errors in a row so exit
Break
# so assume have reached the last country ID and can stop downloading
else:
# Success-can Scrape the result<
c14/># ... num_errors = 0
Some sites do not use a sequential ID, or do not use a numeric ID, this method is difficult to play a role. 2.4 Tracking Web links
We need to make the crawler more like ordinary users, can track links, access to the content of interest. But it is easy to download a lot of pages we do not need, such as we crawl from a forum user account Details page, do not need other pages, we need to use regular expressions to determine which page.
downloading:http://example.webscraping.com
Downloading:/index/1
traceback (most recent call last):
File "1.4.4link_crawler1.py", line, in <module>
link_crawler (' http://example.webscraping.com ', '/(Index |view)
File "1.4.4link_crawler1.py", line one, in Link_crawler
html = Download (URL) ...
File "/usr/lib/python2.7/urllib2.py", line 283, in Get_type
raise ValueError, "Unknown URL type:%s"% self.__original
valueerror:unknown URL Type:/INDEX/1
Since/INDEX/1 is a relative link, the browser can recognize it, but urllib2 cannot know the context, all of which we can use the Urlparse module to convert to an absolute link.
def link_crawler (Seed_url, Link_regex):
crawl_queue = [Seed_url]
seen = set (crawl_queue) # keep track which URL ' s Have seen before while
crawl_queue:
url = crawl_queue.pop ()
html = Download (URL) for
link in get_links (h TML):
if Re.match (Link_regex, link): #匹配正则表达式
link = urlparse.urljoin (seed_url, link)
crawl_ Queue.append (link)
The above code still has a problem, these sites are linked to each other, Australia link to Antarctica, Antarctica link to Australia, so the crawler will continue to download the same content. To avoid duplicate downloads, modifying the function above has the ability to store discovery URLs.
def link_crawler (Seed_url, Link_regex): "" "Crawl from the
given seed URL following links matched by Link_regex" "
crawl_queue = [Seed_url]
seen = set (crawl_queue) # keep track which URL ' s have seen before while
Crawl_queue:
URL = crawl_queue.pop ()
html = Download (URL) for
link in get_links (HTML):
# Check if link matches expecte D regex
If Re.match (Link_regex, link):
# form Absolute link
= urlparse.urljoin (seed_url, link)
# Check if have already seen this link
if link isn't in seen:
seen.add (link)
crawl_queue.append (link)
Advanced Features
1. Analytic robots.txt
The Robotparser module first loads the robots.txt file and then uses the Can_fetch () function to determine whether the specified user agent is allowed to access the Web page.
>>> import robotparser
>>> rp=robotparser. Robotfileparser ()
>>> rp.set_url (' http://example.webscraping.com/robots.txt ')
>>> Rp.read ()
>>> url= ' http://example.webscraping.com '
>>> user_agent= ' Badcrawler
' >>> Rp.can_fetch (user_agent,url)
False
>>> user_agent= ' Goodcrawler '
>>> Rp.can_fetch (user_agent,url)
True
>>> user_agent= ' wu_being '
>>> rp.can_fetch ( User_agent,url)
True
In order to integrate this functionality into the crawler, we need to add the check in the crawl loop.
While crawl_queue:
url = crawl_queue.pop ()
# Check URL passes robots.txt restrictions
if Rp.can_fetch ( User_agent, URL):
...
else:
print ' Blocked by robots.txt: ', url
2. Support Agent (proxy)
Sometimes we need to use a proxy to access a Web site. Netflix, for example, shields most countries outside the United States. Using the URLLIB2 support agent is not as easy as it might be (try using the more intimate Python HTTP module requests to implement this feature, document: http://docs.python-requests.org). The following is code that supports proxies using URLLIB2.
def download5 (URL, user_agent= ' wswp ', Proxy=none, num_retries=2): "" "
Download function with support for proxies" " C1/>print ' downloading: ', url
headers = {' User-agent ': user_agent}
request = Urllib2. Request (URL, headers=headers)
opener = Urllib2.build_opener ()
if proxy:
proxy_params = { Urlparse.urlparse (URL). Scheme:proxy}
Opener.add_handler (urllib2. Proxyhandler (proxy_params))
try:
HTML = opener.open (Request). Read ()
except URLLIB2. Urlerror as E:
print ' Download error: ', E.reason
html = None
if num_retries > 0:
if Hasattr (E, ' Cod E ') and <= E.code <:
# retry 5XX HTTP errors
html = DOWNLOAD5 (URL, user_agent, proxy, num_retries- 1) return
HTML
3. Download speed limit
When we crawl the site too fast, may be banned or caused by the risk of server overload. To reduce these risks, we can add a delay between two downloads to limit the speed of the crawler.
Class Throttle: "" "
throttle downloading by sleeping between requests to same domain" "
def __init__ (self, de Lay):
# Amount of delay between downloads for each domain
self.delay = delay
# Timestamp of ' When a ' domain was Last accessed
self.domains = {}
def wait (self, url):
domain = urlparse.urlparse (URL). Netloc
Last_ Accessed = self.domains.get (domain)
if self.delay > 0 and last_accessed are not None:
sleep_secs = Self.delay -(DateTime.Now ()-last_accessed). Seconds
If sleep_secs > 0:
time.sleep (sleep_secs)
self.domains[ Domain] = DateTime.Now ()
The throttle class records the time of each last visit and performs a sleep operation if the current time distance is less than the specified delay. We can call throttle to speed limits on the crawler before each download.
Throttle = throttle (delay) ...
Throttle.wait (URL)
html = download (URL, headers, proxy=proxy, num_retries=num_retries)
4. Avoid Reptile traps
One simple way to avoid being caught in a reptile trap is to record how many links, or depths , have passed through the current page. When the maximum attempt is reached, the link in the page is no longer added to the queue, we need to modify the seen variable to a dictionary and increase the record of the page attempt. If you want to disable this feature, simply set the max_depth to a negative number.
Def link_crawler (..., max_depth=2):
seen = {seed_url:0}
...
depth = Seen[url]
If depth!= max_depth: for
link in Links:
if link not in seen:
seen[link] = depth + 1< C12/>crawl_queue.append (link)
5. Final version
1.4.4link_crawler4_ultimateversion.py
# coding:utf-8 Import re import urlparse import urllib2 import time from datetime import datetime import Robotparser Impo RT Queue def link_crawler (Seed_url, Link_regex=none, delay=5, Max_depth=-1, Max_urls=-1, Headers=none, user_agent= ' WSWP ', Proxy=none, Num_retries=1: "" "" Crawl from the given seed URL following links matched by Link_regex "" "# th E Queue of URL ' s that still need to is crawled crawl_queue = Queue.deque ([Seed_url]) # The URL ' s that have been SE En and at what depth seen = {seed_url:0} # track how many URL ' s have been downloaded Num_urls = 0 RP = ge T_robots (seed_url) throttle = throttle (delay) headers = headers or {} if user_agent:headers[' User-age NT '] = user_agent while crawl_queue:url = Crawl_queue.pop () # Check URL passes robots.txt restrictio
NS If Rp.can_fetch (user_agent, URL): #if get_robots (Seed_url): throttle.wait (URL) html = Download (URL, headERs, Proxy=proxy, num_retries=num_retries) links = []