Python scrapy crawling dynamic pages

Source: Internet
Author: User

Preface: In addition to the recent study work, there is a friend of the opposite sex need to crawl Dynamic Web page requirements, enter keywords to crawl a patent website under the keyword of some patent description. Previous direct Python urllib2 can be broken, but that is only for static web pages can be broken, but for the use of JS and other generated dynamic Web pages, it seems not good (never tried). Then find some information on the internet, found scrapy combined with selenium bag seems to be able. (The reason for this is that, for the time being, the Lord has not yet achieved it, record it first.) )

#===================== according to the official website of the simple introduction for personal understanding ========================

First, install Scrapy,selenium two packages:

The owner is under Ubuntu and has installed Anaconda and Pip as well as EASY_INTASLL, so install one step directly with Pip (or Easy_install):

Pip install-u seleniumpip Install scrapyeasy_install-  u seleniumeasy_install  scrapy

Next, you need to create a new project with Scrapy and run the following command on the terminal to create a new project:

Scrapy Startproject Tutorial
The following forms of folders are automatically generated:

Figure 1: Folder after creating a new project

Again, start writing the project:

You need to define some variables in the items.py file:

Import Scrapyclass Dmozitem (scrapy. Item):    title = Scrapy. Field ()    link = scrapy. Field ()    desc = scrapy. Field ()

To tutorial/spiders create a new dmoz_spider.py file under a folder:


Import Scrapyclass Dmozspider (scrapy. Spider):    name = "DMOZ"    allowed_domains = ["dmoz.org"]    start_urls = [        "http://www.dmoz.org/Computers/ programming/languages/python/books/",        " http://www.dmoz.org/Computers/Programming/Languages/Python/ Resources/"    ]    def parse (self, response):        filename = response.url.split ("/") [-2]        with open (filename, ' WB ') as F:            F.write (Response.body)
in this file, you need to define three variables, one of which is name,start_urls,parse three variables.

Finally, the terminal runs under the outermost folder:

<span style= "FONT-SIZE:18PX;" >scrapy Crawl Dmoz</span>

DMOZ is the tutorial/spiders value of name for one of the important variables in the Dmozspider class in the new file under folder.

can start crawling.

#=============================================

Bo Friends: http://chenqx.github.io/2014/12/23/Spider-Advanced-for-Dynamic-Website-Crawling/

It describes some of the content of crawling dynamic sites and shares the full project code on GitHub. (Download the code first, look at the document in the code)

Its gouwu.sogou.com task is to dynamically crawl page information.

Halogen Master for his own task, change its gouwu.sogou.com/etao/lstdata.py file, inside the Lstdata class LST list variable inside for the search keyword, passed into the spider.py file, compose the URL, start crawling.

Analysis Bo Friend's code did not find the crawl down the dynamic page information where the code exists,

Add your own code to the spider.py file:

    Def parse (self, Response): #crawl all Display page to link in self.link_extractor[' Page_down '].extract_ Links (response): yield Request (url = link.url, callback=self.parse) #browser Self.browser.get (res Ponse.url) Time.sleep (5) # Get the data and write it to scrapy items Etaoitem_loader = Itemloader (ite M=etaoitem (), response = response) URL = str (response.url) etaoitem_loader.add_value (' url ', url) etao        Item_loader.add_xpath (' title ', self._x_query[' title ']) Etaoitem_loader.add_xpath (' name ', self._x_query[' name ') Etaoitem_loader.add_xpath (' Price ', self._x_query[' price ') #====================================# for Li NK in self.link_extractor[' Page_down '].extract_links (response): # yield Request (url = link.url, callback = self. Parse_detail) for SEL in Response.xpath ('//ul/li '): title = Sel.xpath (' A/text () '). Extract () L INK2 = Sel.xpath (' A/@hreF '). Extract () desc = Sel.xpath (' text () '). Extract () for i in Title:print I, For J in Link2:print J, "+++++++++++++" #==================================== yield etaoite M_loader.load_item ()

Can analyze some things, but not enough, still need to continue to analyze the source code of the dynamic page returned, change the extractor extractor, selector selector (not yet started), in order to get the desired results. The Selenium bag seems useless.

#=============================================

Some information of Halogen master reference:

Scrapy Official website: http://doc.scrapy.org/en/latest/intro/tutorial.html

Selenium official website: http://selenium-python.readthedocs.org/

Scrapy selector: http://doc.scrapy.org/en/0.24/topics/selectors.html#topics-selectors

Bo Friends Blog: http://chenqx.github.io/2014/12/23/Spider-Advanced-for-Dynamic-Website-Crawling/
bo Friends Blog: http://chenqx.github.io/2014/11/09/Scrapy-Tutorial-for-BBSSpider/

bo Friends blog (Selenium): http://www.cnblogs.com/fnng/archive/2013/05/29/3106515.html

Access problem: http://my.oschina.net/HappyRoad/blog/173510


Python scrapy crawling dynamic pages

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.