Python data acquisition-start crawler

Last Update:2018-07-19 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

One traversal of a single domain name

Web crawler, is to capture the target page, and then traverse to the data information, and then have a link to continue to traverse, so callback.

Step one: Get all links to the page

1  fromUrllib.requestImportUrlopen2  fromBs4ImportBeautifulSoup3 ImportRe4 5html = Urlopen ("https://www.yahoo.com/")6Html_str = Html.read (). Decode ('Utf-8')7 #print (HTML_STR)8Bsobj =BeautifulSoup (HTML_STR)9 ##获取页面链接地址Ten  forLinkinchBsobj.findall ("a"): One     if 'href' inchLink.attrs: A         Print(link.attrs['href'])

Run

Found there will be some useless data, some of the value of the href is only as a page block jump, we can use the regular expression to filter out, only get the HTML end of the link

1  fromUrllib.requestImportUrlopen2  fromBs4ImportBeautifulSoup3 ImportRe4 5html = Urlopen ("https://www.yahoo.com/")6Html_str = Html.read (). Decode ('Utf-8')7 #print (HTML_STR)8Bsobj =BeautifulSoup (HTML_STR)9 ##获取页面链接地址Ten  forLinkinchBsobj.findall ("a", href= Re.compile (". *\.html")): One     if 'href' inchLink.attrs: A         Print(link.attrs['href'])

Step two: Get the Web page recursively

The first step is that we basically get all the link addresses of a Web page, and the second step is obviously to get links to these linked pages for further information on these pages.

For example, we get a link to the relevant entry under the Python entry in the wiki, because there is not a link that we care about, all need regular expressions to filter out a portion, then a lot of links link links, we cannot exhaust, all randomly get some entries.

1  fromUrllib.requestImportUrlopen2  fromBs4ImportBeautifulSoup3 ImportRe4 Importdatetime5 ImportRandom6 7RD =Random.seed (Datetime.datetime.now ())8 Print(RD)9 Ten defgetlinks (articleurl): Onehtml = Urlopen ("https://en.wikipedia.org"+Articleurl) ABsobj = BeautifulSoup (HTML,"lxml") -     returnBsobj.findall ("a", Href=re.compile ("^ (/wiki/) ((?!:).) *$")) -  theLinks = getlinks ("/wiki/python") -  -  whileLen (links) >0: -     #Print (links) +newarticle = Links[random.randint (0, Len (links)-1)].attrs["href"]#randomly get one to continue crawling -     Print(newarticle) +Links = getlinks (newarticle)

Run results (one minute 150 data generation, such as non-manual stop should not stop crawling)

Two capturing the entire site

All the links to the entire site collection, of course, such as the wiki of these large web site data, to collect the basic impossible.

1  fromUrllib.requestImportUrlopen2  fromBs4ImportBeautifulSoup3 ImportRe4Pages =set ()5 defgetlinks (pageurl):6     Globalpages7html = Urlopen ("http://en.wikipedia.org"+pageurl)8Bsobj = BeautifulSoup (HTML,"lxml")9     Try:Ten         Print(BsObj.h1.get_text ()) One         Print(Bsobj.find (id="Mw-content-text"). FindAll ("P") [0]) A         Print(Bsobj.find (id="Ca-edit"). Find ("span"). Find ("a"). attrs['href']) -     exceptAttributeerror: -         Print("The page is missing some properties! But don't worry! ") the      forLinkinchBsobj.findall ("a", Href=re.compile ("^ (/wiki/)")): -         if 'href' inchLink.attrs: -             iflink.attrs['href'] not inchPages: -                 #we ran into a new page. +NewPage = link.attrs['href'] -                 Print("----------------\ n"+newPage) + Pages.Add (newPage) A getlinks (newPage) atGetlinks ("")

Run results

Recursive Crawl page principle:

Three uses scrapy acquisition

High-rise buildings are from the simplest brick by brick stacked up, write web crawler is also a lot of simple duplication of operations, find the page key information and the chain, and then so loop. And the Scrapy library, you can significantly reduce the Web page link lookup (do not have to do a lot of filters and regular expressions) can also reduce the recognition of the complexity of the work.

Use reference; https://scrapy-chs.readthedocs.io/zh_CN/latest/intro/tutorial.html

The first step is to create the Scrapy project

Error, installing SCRAPY,CMD-PIP install Scrapy

Error, no visual 14 installed

Re-install successfully, execute again

Tutorial

After successful creation, the directory structure is as follows

The second step is to define the data source and modify the item (see official website)

The third step is to create a crawler class (reference website)

Fourth step into the spider directory and then run the crawler

Error, missing Win32 Library

Pip Install Pywin32

Run successfully again

The first scrapy HelloWorld is basically complete, and the process is roughly as follows:

Scrapy start_urls creates an object for each URL in the spider's Properties scrapy.Request and assigns the parse method as a callback function (callback) to the request.

The request object is dispatched, executes the generated scrapy.http.Response object, and sends it back to the spider parse() method.

If useful, continue to study scrapy further.

Python data acquisition-start crawler

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python data acquisition-start crawler

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python data acquisition-start crawler

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support