2.3 Web crawler principle based on width first search

Last Update:2017-05-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In the previous section we downloaded and used the width-first crawler, this section we will specifically look at the principle of this crawler.

First, look at the source code for the html.py.

First function:

defget_html (URL):Try: par=urlparse (URL) default_header= {'X-requested-with':'XMLHttpRequest',                          'Referer': Par[0] +'://'+ par[1],                          'user-agent':'mozilla/5.0 (Windows NT 6.3; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/54.0.2840.87 safari/537.36',                          'Host': par[1]} HTML= Requests.get (URL, Headers=default_header, timeout=10)        ifHtml.status_code! = 200:            returnNonereturnhtml.contentexceptException as E:Print(e)returnNone

The function is to crawl the contents of the URL (binary content, can be directly into the BeautifulSoup analysis). It is more complicated because of the addition of some exception handling, which makes the function more reliable. Also added some anti-crawler considerations, try to simulate the real browser (referer and user-agent parameters).

A second function:

defFull_link (URL1, Url2, flag_site=True):Try:        ifUrl2[0] = ='#':            returnNone Filepat= Re.compile (r'(.*?) \. (.*?)') Htmpat= Re.compile (r'(.*?) \.htm$| (.*?) \.html$| (.*?) \.php$| (.*?) \.aspx$') U1=urlparse (URL1)ifFilepat.match (U1.path) and  notHtmpat.match (u1.path):returnNoneifURL1[-1] = ='/': Url1= url1+"index.html"        elifFilepat.match (U1.path) isNONE:URL1= url1+"/index.html"Url2=Urljoin (URL1,URL2) U2=urlparse (URL2)ifU1.netloc!=u2.netloc andFlag_site:returnNonereturnUrl2exceptException as E:Print(e)returnNone

This function is actually a very critical function. because the width takes precedence to make the while loop work, you need to have a common approach to each element of the queue. This is also the reason why this function is very important. Its function is for the known URL1 page, there is a <a> tag in the href attribute inside is URL2, which returns the true full link of url2. Of course, if URL2 itself is a complete link, it will return directly to itself. But if it's just a link to a relative path, it needs to be processed and returned. For example, Http://www.cnblogs.com/itlqs page, link out a./p/136810721.html, then after this function is processed, the return is http://www.cnblogs.com/itlqs/p/ 6810721.html. In fact, Python comes with the Urljoin function to do this thing, but after the test found that the built-in is not very perfect, so here to modify a bit. None is returned when an exception is encountered, or if the link does not meet the requirements.

A third function:

defPremake (URL):#the directory required to build the URL    ifURL[-1] = ='/': URL= Url[:-1] up=urlparse (URL) Pat= Re.compile (r'(.*?) \.htm$| (.*?) \.html$| (.*?) \.php$| (.*?) \.aspx$') path= Up.path.split ('/') name='index.html'    ifPat.match (Up.path) is  notNone:name= Path[-1] Path= Path[:-1] Dirn='/'. Join (PATH)ifup.query!="': Name= up.query+' - '+name Os.makedirs (Up.netloc+ Dirn, exist_ok=True)returnUp.netloc + Dirn +'/'+ Name

The purpose of this function is to create a local folder for the URL, which is primarily to preserve the original directory structure locally. And the query information is also reflected in the file name.

A fourth function:

def Save (URL):     = Url.replace ('\ n',')    = premake (URL)    = get_ HTML (URL)    if is not None:        'WB ' ) as F:            f.write (HTML)    return HTML

Grab a link and save it locally. is written on the basis of the previous three functions.

This is html.py. Here's another look at crawler.py. Some of the previous setup parameters are not looked at, directly looking at the core code of the wide search.

now =0 while  notq.empty ():Try: Front=q.get () Link=Front[0] Depth= Front[1]        Print('Crawling:', link) HTML=html.save (link)ifHtml isNone:ContinueSoup= BeautifulSoup (HTML,'Html.parser', from_encoding='GB18030')         forAinchSoup.find_all ('a'):            Try: Url2= a['href'] FL=html.full_link (link, url2, flag_site)ifFl isNone:Continue                if(fl not inchPool and(Depth + 1 <=flag_depth): Pool.add (FL) q.put (fl, depth+ 1))                    Print('In queue:', FL)exceptException as E:Print(e) now+ = 1ifNow >=Flag_most: Break    exceptException as E:Print(e)

In fact, with the above four functions as the basis, it is very easy. Each time a link is taken from the team header. Fetch and save. Then extract all the href of this page, then use the Full_link function to get the full link, to determine whether it has occurred, if not, join the queue.

This is the principle of this program, some implementation details can be used to try to figure out the code, of course, the code may be imperfect, but for some simple crawl requirements are basically enough.

2.3 Web crawler principle based on width first search

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

2.3 Web crawler principle based on width first search

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

2.3 Web crawler principle based on width first search

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support