One traversal of a single domain name
Web crawler, is to capture the target page, and then traverse to the data information, and then have a link to continue to traverse, so callback.
Step one: Get all links to the page
1 fromUrllib.requestImportUrlopen2 fromBs4ImportBeautifulSoup3 ImportRe4 5html = Urlopen ("https://www.yahoo.com/")6Html_str = Html.read (). Decode ('Utf-8')7 #print (HTML_STR)8Bsobj =BeautifulSoup (HTML_STR)9 ##获取页面链接地址Ten forLinkinchBsobj.findall ("a"): One if 'href' inchLink.attrs: A Print(link.attrs['href'])
Run
Found there will be some useless data, some of the value of the href is only as a page block jump, we can use the regular expression to filter out, only get the HTML end of the link
1 fromUrllib.requestImportUrlopen2 fromBs4ImportBeautifulSoup3 ImportRe4 5html = Urlopen ("https://www.yahoo.com/")6Html_str = Html.read (). Decode ('Utf-8')7 #print (HTML_STR)8Bsobj =BeautifulSoup (HTML_STR)9 ##获取页面链接地址Ten forLinkinchBsobj.findall ("a", href= Re.compile (". *\.html")): One if 'href' inchLink.attrs: A Print(link.attrs['href'])
Step two: Get the Web page recursively
The first step is that we basically get all the link addresses of a Web page, and the second step is obviously to get links to these linked pages for further information on these pages.
For example, we get a link to the relevant entry under the Python entry in the wiki, because there is not a link that we care about, all need regular expressions to filter out a portion, then a lot of links link links, we cannot exhaust, all randomly get some entries.
1 fromUrllib.requestImportUrlopen2 fromBs4ImportBeautifulSoup3 ImportRe4 Importdatetime5 ImportRandom6 7RD =Random.seed (Datetime.datetime.now ())8 Print(RD)9 Ten defgetlinks (articleurl): Onehtml = Urlopen ("https://en.wikipedia.org"+Articleurl) ABsobj = BeautifulSoup (HTML,"lxml") - returnBsobj.findall ("a", Href=re.compile ("^ (/wiki/) ((?!:).) *$")) - theLinks = getlinks ("/wiki/python") - - whileLen (links) >0: - #Print (links) +newarticle = Links[random.randint (0, Len (links)-1)].attrs["href"]#randomly get one to continue crawling - Print(newarticle) +Links = getlinks (newarticle)
Run results (one minute 150 data generation, such as non-manual stop should not stop crawling)
Two capturing the entire site
All the links to the entire site collection, of course, such as the wiki of these large web site data, to collect the basic impossible.
1 fromUrllib.requestImportUrlopen2 fromBs4ImportBeautifulSoup3 ImportRe4Pages =set ()5 defgetlinks (pageurl):6 Globalpages7html = Urlopen ("http://en.wikipedia.org"+pageurl)8Bsobj = BeautifulSoup (HTML,"lxml")9 Try:Ten Print(BsObj.h1.get_text ()) One Print(Bsobj.find (id="Mw-content-text"). FindAll ("P") [0]) A Print(Bsobj.find (id="Ca-edit"). Find ("span"). Find ("a"). attrs['href']) - exceptAttributeerror: - Print("The page is missing some properties! But don't worry! ") the forLinkinchBsobj.findall ("a", Href=re.compile ("^ (/wiki/)")): - if 'href' inchLink.attrs: - iflink.attrs['href'] not inchPages: - #we ran into a new page. +NewPage = link.attrs['href'] - Print("----------------\ n"+newPage) + Pages.Add (newPage) A getlinks (newPage) atGetlinks ("")
Run results
Recursive Crawl page principle:
Three uses scrapy acquisition
High-rise buildings are from the simplest brick by brick stacked up, write web crawler is also a lot of simple duplication of operations, find the page key information and the chain, and then so loop. And the Scrapy library, you can significantly reduce the Web page link lookup (do not have to do a lot of filters and regular expressions) can also reduce the recognition of the complexity of the work.
Use reference; https://scrapy-chs.readthedocs.io/zh_CN/latest/intro/tutorial.html
The first step is to create the Scrapy project
Error, installing SCRAPY,CMD-PIP install Scrapy
Error, no visual 14 installed
Re-install successfully, execute again
Tutorial
After successful creation, the directory structure is as follows
The second step is to define the data source and modify the item (see official website)
The third step is to create a crawler class (reference website)
Fourth step into the spider directory and then run the crawler
Error, missing Win32 Library
Pip Install Pywin32
Run successfully again
The first scrapy HelloWorld is basically complete, and the process is roughly as follows:
Scrapy start_urls
creates an object for each URL in the spider's Properties scrapy.Request
and assigns the parse
method as a callback function (callback) to the request.
The request object is dispatched, executes the generated scrapy.http.Response
object, and sends it back to the spider parse()
method.
If useful, continue to study scrapy further.
Python data acquisition-start crawler