2.3 Web crawler principle based on width first search

Source: Internet
Author: User

In the previous section we downloaded and used the width-first crawler, this section we will specifically look at the principle of this crawler.

First, look at the source code for the html.py.

First function:

defget_html (URL):Try: par=urlparse (URL) default_header= {'X-requested-with':'XMLHttpRequest',                          'Referer': Par[0] +'://'+ par[1],                          'user-agent':'mozilla/5.0 (Windows NT 6.3; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/54.0.2840.87 safari/537.36',                          'Host': par[1]} HTML= Requests.get (URL, Headers=default_header, timeout=10)        ifHtml.status_code! = 200:            returnNonereturnhtml.contentexceptException as E:Print(e)returnNone

The function is to crawl the contents of the URL (binary content, can be directly into the BeautifulSoup analysis). It is more complicated because of the addition of some exception handling, which makes the function more reliable. Also added some anti-crawler considerations, try to simulate the real browser (referer and user-agent parameters).

A second function:

defFull_link (URL1, Url2, flag_site=True):Try:        ifUrl2[0] = ='#':            returnNone Filepat= Re.compile (r'(.*?) \. (.*?)') Htmpat= Re.compile (r'(.*?) \.htm$| (.*?) \.html$| (.*?) \.php$| (.*?) \.aspx$') U1=urlparse (URL1)ifFilepat.match (U1.path) and  notHtmpat.match (u1.path):returnNoneifURL1[-1] = ='/': Url1= url1+"index.html"        elifFilepat.match (U1.path) isNONE:URL1= url1+"/index.html"Url2=Urljoin (URL1,URL2) U2=urlparse (URL2)ifU1.netloc!=u2.netloc andFlag_site:returnNonereturnUrl2exceptException as E:Print(e)returnNone

This function is actually a very critical function. because the width takes precedence to make the while loop work, you need to have a common approach to each element of the queue. This is also the reason why this function is very important. Its function is for the known URL1 page, there is a <a> tag in the href attribute inside is URL2, which returns the true full link of url2. Of course, if URL2 itself is a complete link, it will return directly to itself. But if it's just a link to a relative path, it needs to be processed and returned. For example, Http://www.cnblogs.com/itlqs page, link out a./p/136810721.html, then after this function is processed, the return is http://www.cnblogs.com/itlqs/p/ 6810721.html. In fact, Python comes with the Urljoin function to do this thing, but after the test found that the built-in is not very perfect, so here to modify a bit. None is returned when an exception is encountered, or if the link does not meet the requirements.

A third function:

defPremake (URL):#the directory required to build the URL    ifURL[-1] = ='/': URL= Url[:-1] up=urlparse (URL) Pat= Re.compile (r'(.*?) \.htm$| (.*?) \.html$| (.*?) \.php$| (.*?) \.aspx$') path= Up.path.split ('/') name='index.html'    ifPat.match (Up.path) is  notNone:name= Path[-1] Path= Path[:-1] Dirn='/'. Join (PATH)ifup.query!="': Name= up.query+' - '+name Os.makedirs (Up.netloc+ Dirn, exist_ok=True)returnUp.netloc + Dirn +'/'+ Name

The purpose of this function is to create a local folder for the URL, which is primarily to preserve the original directory structure locally. And the query information is also reflected in the file name.

A fourth function:

def Save (URL):     = Url.replace ('\ n',')    = premake (URL)    = get_ HTML (URL)    if is not None:        'WB ' ) as F:            f.write (HTML)    return HTML

Grab a link and save it locally. is written on the basis of the previous three functions.

This is html.py. Here's another look at crawler.py. Some of the previous setup parameters are not looked at, directly looking at the core code of the wide search.

now =0 while  notq.empty ():Try: Front=q.get () Link=Front[0] Depth= Front[1]        Print('Crawling:', link) HTML=html.save (link)ifHtml isNone:ContinueSoup= BeautifulSoup (HTML,'Html.parser', from_encoding='GB18030')         forAinchSoup.find_all ('a'):            Try: Url2= a['href'] FL=html.full_link (link, url2, flag_site)ifFl isNone:Continue                if(fl not inchPool and(Depth + 1 <=flag_depth): Pool.add (FL) q.put (fl, depth+ 1))                    Print('In queue:', FL)exceptException as E:Print(e) now+ = 1ifNow >=Flag_most: Break    exceptException as E:Print(e)

In fact, with the above four functions as the basis, it is very easy. Each time a link is taken from the team header. Fetch and save. Then extract all the href of this page, then use the Full_link function to get the full link, to determine whether it has occurred, if not, join the queue.

This is the principle of this program, some implementation details can be used to try to figure out the code, of course, the code may be imperfect, but for some simple crawl requirements are basically enough.

2.3 Web crawler principle based on width first search

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.