The birth of a website 02--crawl data with Scrapy

Source: Internet
Author: User
Tags xpath


If you want to  you need to have a crawler, the industry is called crawler or spider.


There are various language versions of the open source Crawler, C + +, Java, PHP, search on GitHub, with "Spider C + +" as the key word, there are 245 open-source crawler, with "Spider Java" as the key word, there are 48. What about Python? 156 of them.


Crawler technology in the industry is already very mature, there are many open source framework, with their help to write a crawler can quickly, a few hours will be able to write a good stuff to use. Crawler technology can also be very complex, if you want to do distributed crawling and full-text retrieval, generally use Nutch.


The most famous Python crawler frame is scrapy, the name is also very interesting, it is estimated that the Spider,crawler,python each take part of the character combination, its official homepage in http://scrapy.org/.


1. Installing Scrapy
Installing Scrapy is simple, the Linux system, a statement on it, "sudo pip install Scrapy". Python is usually installed by default on Linux distributions. Pip is a Python package management tool that, after executing the install command, downloads the latest version of the Scrapy installation from the Web. After installation, the command line Input command "Scrapy version" will print the Scrapy version, indicating that the installation was successful.


2. Build the project directory
If you want to crawl the public review network, you need to create the project first. Select an empty directory, such as/tmp/scrapy-test on my Computer, enter "Scrapy startproject crawdp" on the command line, Startproject is a command of scrapy. After execution, a new directory CRAWDP appears in the/tmp/scrapy-test directory, which contains everything you need for the SCRAWDP project, such as source code and configuration files. Run "tree CRAWDP" to see the entire directory structure.


So, what is the truth behind the tall word "startproject" "Create Project"? It is not complicated, it is to create a directory of the project name and its subdirectories, and then copy a few of the specified files to these directories, and then replace the files with a few specific strings into the project name. This piece of code is in the startproject.py file in the Scrapy/command/commands directory of the Scrapy source code, and the files seen in "Tree Crawdp", their prototypes in scrapy/templates/ Project/module directory, such as Settings.py.tmpl file, put it inside the string "$project _name" with "CRAWDP" replace, Then change the file name to settings.py, which is the CRAWDP settings.py file.


Files such as Settings.py.tmpl, also known as template files, replace the variables in the template with strings, called rendering (render). This technique is also used in Web server development, where front-end engineers design HTML files that contain variables such as "$project _name", which are saved as template files, and when the server is running, replace the variables in the template with the strings generated by the runtime. Then send the replacement content to the browser, that is, we see the page, so called dynamic page.


3. The first spider
Start writing the first Spider, which is a very simple spider, with only 13 lines of code.
Under the/tmp/scrapy-test/crawdp/crawdp/spiders directory, create a file cityid_spider.py with the following content:
-----------------------------------------
From Scrapy.spider import Basespider
From Scrapy.selector import Htmlxpathselector


Class Cityidspider (Basespider):
Name = "Cityid_spider"
Start_urls = ["Http://www.dianping.com/shanghai/food"]


Def parse (self, Response):
HXS = Htmlxpathselector (response)
xs = hxs.select ('//ul[@class =\ "nc_list\"]/li/a/@href ')
x = Xs[0]
Cityid = X.extract (). Split ('/') [3]
print "\n\n\ncityid =%s\n\n\n"% Cityid
-----------------------------------------
Return to the/tmp/scrapy-test/crawdp/directory, under this directory to execute "scrapy crawl Cityid_spider", you can see the result is printed "Cityid = 1", in the default configuration, incidentally, will print a lot of log information, So I added a few line breaks in this sentence to make it clearer.


Note that there are several key points here:
A. cityid_spider.py this file, must be placed in the/tmp/scrapy-test/crawdp/crawdp/spiders directory, placed elsewhere, scrapy can not find.
B. In the source code, name = "Cityid_spider" must have, here the name of the spider is specified, scrapy by name to identify different spiders.
C. "Scrapy crawl Cityid_spider" must be executed under/tmp/scrapy-test/crawdp/and not in any other directory.
Other, you can change the class name to other classes Cityidspider, it does not matter, you can cityid_spider.py the file name for another name, it doesn't matter, does not affect the effect.


This spider is the basic model of all spiders, and the more complex spiders are built on it to add more functionality.


4. What does this spider do?
In the public review network, each city has a digital code, such as Shanghai is 1, Beijing is 2, Shenzhen is 7. Its home is headquartered in Shanghai, so Shanghai is no. 1th. If you want to catch more city restaurants, you need to know the city code. The city code of Shanghai can be crawled from Http://www.dianping.com/shanghai/food. In the browser open Http://www.dianping.com/shanghai/food, in the left side of the "business area" below, there is "Lujiazui", right click on it, select "Review element", if you are using Chrome, the browser will open a window, Using high-light to locate the corresponding HTML element of the Lujiazui, this element contains a hyperlink called "/search/category/1/10/r801", so the Cityid in Shanghai is the "1" in the category behind the hyperlink, and this link, is nested in a fixed hierarchy of HTML elements, that is, the shape of the code like "//ul[@class =\" nc_list\ "]/li/a/@href" structure, this is called XPath.


When performing "Scrapy crawl Cityid_spider", the spider starts running, checking the spider's source code for "Start_urls = [" Http://www.dianping.com/shanghai/food " ] ", according to this URL, from the public comments on the Web crawl HTML file content, there is response, and then call the parse function, from the response with XPath parse out all conform to"//ul[@class =\ "nc_list\"]/li/a/@ href "HTML element, take the first one, get a string similar to"/search/category/1/10/r801 ", and then divide it with '/' for the interval, take 3rd, that is," 1 ", is Cityid, This is what the first four statements of the parse function do.


This piece is a little bit impersonal, with more terminology. Then, want to do crawl words, at least to understand a little html,css,web server,xpath, a little bit basic on the line. These things do not unfold here, find a primer book to see it.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.