Python data acquisition-start crawler

Source: Internet
Author: User

One traversal of a single domain name

Web crawler, is to capture the target page, and then traverse to the data information, and then have a link to continue to traverse, so callback.

Step one: Get all links to the page

1  fromUrllib.requestImportUrlopen2  fromBs4ImportBeautifulSoup3 ImportRe4 5html = Urlopen ("https://www.yahoo.com/")6Html_str = Html.read (). Decode ('Utf-8')7 #print (HTML_STR)8Bsobj =BeautifulSoup (HTML_STR)9 ##获取页面链接地址Ten  forLinkinchBsobj.findall ("a"): One     if 'href' inchLink.attrs: A         Print(link.attrs['href'])

Run

Found there will be some useless data, some of the value of the href is only as a page block jump, we can use the regular expression to filter out, only get the HTML end of the link

1  fromUrllib.requestImportUrlopen2  fromBs4ImportBeautifulSoup3 ImportRe4 5html = Urlopen ("https://www.yahoo.com/")6Html_str = Html.read (). Decode ('Utf-8')7 #print (HTML_STR)8Bsobj =BeautifulSoup (HTML_STR)9 ##获取页面链接地址Ten  forLinkinchBsobj.findall ("a", href= Re.compile (". *\.html")): One     if 'href' inchLink.attrs: A         Print(link.attrs['href'])

Step two: Get the Web page recursively

The first step is that we basically get all the link addresses of a Web page, and the second step is obviously to get links to these linked pages for further information on these pages.

For example, we get a link to the relevant entry under the Python entry in the wiki, because there is not a link that we care about, all need regular expressions to filter out a portion, then a lot of links link links, we cannot exhaust, all randomly get some entries.

1  fromUrllib.requestImportUrlopen2  fromBs4ImportBeautifulSoup3 ImportRe4 Importdatetime5 ImportRandom6 7RD =Random.seed (Datetime.datetime.now ())8 Print(RD)9 Ten defgetlinks (articleurl): Onehtml = Urlopen ("https://en.wikipedia.org"+Articleurl) ABsobj = BeautifulSoup (HTML,"lxml") -     returnBsobj.findall ("a", Href=re.compile ("^ (/wiki/) ((?!:).) *$")) -  theLinks = getlinks ("/wiki/python") -  -  whileLen (links) >0: -     #Print (links) +newarticle = Links[random.randint (0, Len (links)-1)].attrs["href"]#randomly get one to continue crawling -     Print(newarticle) +Links = getlinks (newarticle)

Run results (one minute 150 data generation, such as non-manual stop should not stop crawling)

Two capturing the entire site

All the links to the entire site collection, of course, such as the wiki of these large web site data, to collect the basic impossible.

1  fromUrllib.requestImportUrlopen2  fromBs4ImportBeautifulSoup3 ImportRe4Pages =set ()5 defgetlinks (pageurl):6     Globalpages7html = Urlopen ("http://en.wikipedia.org"+pageurl)8Bsobj = BeautifulSoup (HTML,"lxml")9     Try:Ten         Print(BsObj.h1.get_text ()) One         Print(Bsobj.find (id="Mw-content-text"). FindAll ("P") [0]) A         Print(Bsobj.find (id="Ca-edit"). Find ("span"). Find ("a"). attrs['href']) -     exceptAttributeerror: -         Print("The page is missing some properties! But don't worry! ") the      forLinkinchBsobj.findall ("a", Href=re.compile ("^ (/wiki/)")): -         if 'href' inchLink.attrs: -             iflink.attrs['href'] not inchPages: -                 #we ran into a new page. +NewPage = link.attrs['href'] -                 Print("----------------\ n"+newPage) + Pages.Add (newPage) A getlinks (newPage) atGetlinks ("")

Run results

Recursive Crawl page principle:

Three uses scrapy acquisition

High-rise buildings are from the simplest brick by brick stacked up, write web crawler is also a lot of simple duplication of operations, find the page key information and the chain, and then so loop. And the Scrapy library, you can significantly reduce the Web page link lookup (do not have to do a lot of filters and regular expressions) can also reduce the recognition of the complexity of the work.

Use reference; https://scrapy-chs.readthedocs.io/zh_CN/latest/intro/tutorial.html

The first step is to create the Scrapy project

Error, installing SCRAPY,CMD-PIP install Scrapy

Error, no visual 14 installed

Re-install successfully, execute again

Tutorial

After successful creation, the directory structure is as follows

The second step is to define the data source and modify the item (see official website)

The third step is to create a crawler class (reference website)

Fourth step into the spider directory and then run the crawler

Error, missing Win32 Library

Pip Install Pywin32

Run successfully again

The first scrapy HelloWorld is basically complete, and the process is roughly as follows:

Scrapy start_urls creates an object for each URL in the spider's Properties scrapy.Request and assigns the parse method as a callback function (callback) to the request.

The request object is dispatched, executes the generated scrapy.http.Response object, and sends it back to the spider parse() method.

If useful, continue to study scrapy further.

Python data acquisition-start crawler

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.