--Installation of scrapy frame
Let's Su->>python python>import OpenSSL python>import OpenSSL python>quit () >>sudo apt-get install py Thon-dev >>sudo apt-get Install libevent-dev >>apt-get install python-pip >>pip Install Scrapy--"Error E Rror:caught exception Reading instance data Traceback (most recent call last): Add code settings.py = {' Download_handlers ' in S3 : None,}--the prompt for no active project  CD to the Scrapy project created--"There is an error while obtaining start requests the URL in Start_url increased htt p://--"error name ' Selector ' is not defined join from Scrapy.selector import Selector traceback (most recent called last)--" system teaches Cheng URL http://scrapy-chs.readthedocs.org/zh_CN/latest/intro/overview.html--"Python_scrapy installation executes the following command Python, Next is the import lxml, import openss sudo apt-get install python-dev sudo apt-get install libevent-dev apt-get Install IP pip Install scrapy-"Get help scrapy Fetch--help-" Start scrapy terminal scrapy shell ' http://scrapy.org '--nolog--"The shell XPath some Sel.xpath ("//h3/text ()"). Extract () [2] &nbSp 2 represents the third Sel.xpath ("//title/text ()") of the list. Extract () --"printed as garbled problem print U ' \u4e0d\u662f\u4e71\u7801 '- "The operation of the file--" scrapy shell did not see the HxS and Xxs scrapy version of the problem--"The shell to browse the new URL fetch (" http://www.dmoz.org/Computers/Programming/ languages/python/books/")--" in the shell using Firefox browser to browse response view (response), right-click on the Web page opened by Firefox browser to select inspect element for inspection elements--" Scrapy Shell quits by Ctrl+d--"scrapy set at what time to terminate--" scrapy set crawling Rules--"scrapy the first program 1. New Project Scrapy Startproject Tutor ial 2. Clear target modification items.py, add your own class header to from Scrapy.item import Item,field #这地方是死的, Add &NBS pass P #scrapy. Item.item class to declare, define its properties for &N Bsp #为scrapy. Item.field Object cl Ass Dmozitem (Item): &NBSp #更改DmozItem title = Field ()   ; #更改title link = field () desc = field ()
3. Make the crawler, first crawl and then take the dmoz_spider.py folder under the Spiders folder, save in the Tutorial/spiders directory, add the following code from the Scrapy.spiders import spider from SCRA Py.selector Import Selector
Import Dmozitem module from tutorial.items import dmozitem #从tutorial文件夹下的items. py file
Class Dmozspider (Spider): name = "DMOZ" #蜘蛛名称, only allowed_domains = ["dmoz.org"] #允许范围, spider activity Fan Wai Start_urls = [#蜘蛛起始位置, where to put "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "Http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"]
Def parse (self, Response): SEL = Selector (response) Sites = Sel.xpath ('//ul/li ') items = [] for site in sites: item = Dmozitem () item[' title '] = Site.xpath (' A/text () '). Extract () item[' link ' = Site.xpath (' @href '). Extract () &N Bsp item[' desc '] = Site.xpath (' text () '). Extract () items. Append (item) return items Save, running code on console scrapy crawl dmoz #启动蜘蛛 [scrapy] INF O:spider closed (finished) indicates a successful operation--"Create scrapy program scrapy startproject XXX will automatically create the XXX folder and the following create XXX folder and Scrapy.cfg project configuration, And in the following generation scrapy.cfg,pipeline.py,item.py,spider,settings.py spiders placed spider directory, a spider cage, inside are spider items.py need to extract the data structure definition file, Project workers, handling information brought by Spider pipelines.py Pipe definition, further processing of the data structure for items extraction, a pipe settings.cy crawler configuration, scrapy.cfg project configuration-"Create spider example after scrapy startproject xxx enter scrapy genspid Er example example.com then the CD goes into the Modify PY file--"Start the build a good reptile project run Scrapy crawl reptile name (name definition in spider) in the Created folder--" error importerror:no Module named Cryptography.hazmat.bindings._openssl su apt-get install libffi-dev pip Install Pyopenssl Cryptography--"climbing Bug file return to get all pages Def parse (self, Response): Print Response.body
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.