In the previous section we downloaded and used the width-first crawler, this section we will specifically look at the principle of this crawler.
First, look at the source code for the html.py.
First function:
defget_html (URL):Try: par=urlparse (URL) default_header= {'X-requested-with':'XMLHttpRequest', 'Referer': Par[0] +'://'+ par[1], 'user-agent':'mozilla/5.0 (Windows NT 6.3; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/54.0.2840.87 safari/537.36', 'Host': par[1]} HTML= Requests.get (URL, Headers=default_header, timeout=10) ifHtml.status_code! = 200: returnNonereturnhtml.contentexceptException as E:Print(e)returnNone
The function is to crawl the contents of the URL (binary content, can be directly into the BeautifulSoup analysis). It is more complicated because of the addition of some exception handling, which makes the function more reliable. Also added some anti-crawler considerations, try to simulate the real browser (referer and user-agent parameters).
A second function:
defFull_link (URL1, Url2, flag_site=True):Try: ifUrl2[0] = ='#': returnNone Filepat= Re.compile (r'(.*?) \. (.*?)') Htmpat= Re.compile (r'(.*?) \.htm$| (.*?) \.html$| (.*?) \.php$| (.*?) \.aspx$') U1=urlparse (URL1)ifFilepat.match (U1.path) and notHtmpat.match (u1.path):returnNoneifURL1[-1] = ='/': Url1= url1+"index.html" elifFilepat.match (U1.path) isNONE:URL1= url1+"/index.html"Url2=Urljoin (URL1,URL2) U2=urlparse (URL2)ifU1.netloc!=u2.netloc andFlag_site:returnNonereturnUrl2exceptException as E:Print(e)returnNone
This function is actually a very critical function. because the width takes precedence to make the while loop work, you need to have a common approach to each element of the queue. This is also the reason why this function is very important. Its function is for the known URL1 page, there is a <a> tag in the href attribute inside is URL2, which returns the true full link of url2. Of course, if URL2 itself is a complete link, it will return directly to itself. But if it's just a link to a relative path, it needs to be processed and returned. For example, Http://www.cnblogs.com/itlqs page, link out a./p/136810721.html, then after this function is processed, the return is http://www.cnblogs.com/itlqs/p/ 6810721.html. In fact, Python comes with the Urljoin function to do this thing, but after the test found that the built-in is not very perfect, so here to modify a bit. None is returned when an exception is encountered, or if the link does not meet the requirements.
A third function:
defPremake (URL):#the directory required to build the URL ifURL[-1] = ='/': URL= Url[:-1] up=urlparse (URL) Pat= Re.compile (r'(.*?) \.htm$| (.*?) \.html$| (.*?) \.php$| (.*?) \.aspx$') path= Up.path.split ('/') name='index.html' ifPat.match (Up.path) is notNone:name= Path[-1] Path= Path[:-1] Dirn='/'. Join (PATH)ifup.query!="': Name= up.query+' - '+name Os.makedirs (Up.netloc+ Dirn, exist_ok=True)returnUp.netloc + Dirn +'/'+ Name
The purpose of this function is to create a local folder for the URL, which is primarily to preserve the original directory structure locally. And the query information is also reflected in the file name.
A fourth function:
def Save (URL): = Url.replace ('\ n',') = premake (URL) = get_ HTML (URL) if is not None: 'WB ' ) as F: f.write (HTML) return HTML
Grab a link and save it locally. is written on the basis of the previous three functions.
This is html.py. Here's another look at crawler.py. Some of the previous setup parameters are not looked at, directly looking at the core code of the wide search.
now =0 while notq.empty ():Try: Front=q.get () Link=Front[0] Depth= Front[1] Print('Crawling:', link) HTML=html.save (link)ifHtml isNone:ContinueSoup= BeautifulSoup (HTML,'Html.parser', from_encoding='GB18030') forAinchSoup.find_all ('a'): Try: Url2= a['href'] FL=html.full_link (link, url2, flag_site)ifFl isNone:Continue if(fl not inchPool and(Depth + 1 <=flag_depth): Pool.add (FL) q.put (fl, depth+ 1)) Print('In queue:', FL)exceptException as E:Print(e) now+ = 1ifNow >=Flag_most: Break exceptException as E:Print(e)
In fact, with the above four functions as the basis, it is very easy. Each time a link is taken from the team header. Fetch and save. Then extract all the href of this page, then use the Full_link function to get the full link, to determine whether it has occurred, if not, join the queue.
This is the principle of this program, some implementation details can be used to try to figure out the code, of course, the code may be imperfect, but for some simple crawl requirements are basically enough.
2.3 Web crawler principle based on width first search