All said that Windows Python3 does not support scrapy, here is a solution
1, Introduction
"Scrapy of the structure of the first" article on the Scrapy architecture, this article on the actual installation run Scrapy crawler. This article takes the official website tutorial as the example, the complete code may download in the GitHub.
2, Operating environment configuration
The environment of this test is: WINDOWS10, Python3.4.3 32bit
Install Scrapy: $ pip Install scrapy #实际安装时, due to the instability of the server state, there have been several half-way out of the situation
3, write run the first scrapy crawler
3.1. Build a new project: Tutorial
Default
$ scrapy Startproject Tutorial
The project directory structure is as follows:
Windows7 PYTHON3 Environment Operation scrapy-Security person
3.2. Define the item to crawl
Default
#-*-Coding:utf-8-*-
# Define The models for your scraped items
#
# documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
Import Scrapy
Class Dmozitem (Scrapy. Item):
title = Scrapy. Field ()
link = scrapy. Field ()
desc = scrapy. Field ()
3.3. Define Spider
Default
Import Scrapy
From Tutorial.items import Dmozitem
Class Dmozspider (Scrapy. Spider):
Name = "DMOZ"
Allowed_domains = ["dmoz.org"]
Start_urls = [
"Http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"Http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
Def parse (self, Response):
For SEL in Response.xpath ('//ul/li '):
item = Dmozitem ()
item[' title '] = Sel.xpath (' A/text () '). Extract ()
item[' link ' = Sel.xpath (' A/@href '). Extract ()
item[' desc ' = Sel.xpath (' text () '). Extract ()
Yield item
3.4. Run
$ scrapy Crawl Dmoz-o Item.json
1) The results of the error:
A) importerror:cannot import name ' _win32stdio '
B) importerror:no module named ' Win32API '
2 Error-checking process: Check the official FAQ and StackOverflow information on the original is scrapy on the Python3 test is not sufficient, there are small problems.
3) Solution process:
A) need to manually download the twisted/internet under the _win32stdio and _pollingfile, stored in the Python directory under the Lib\sitepackages\twisted\internet
B) Download and install Pywin32
Run again, success! You can see the output of scrapy on the console, and after you have finished exiting, you can see the crawl results stored in JSON format in the project directory when you open the result file Items.json.
[
{"title": ["about"], "desc": [","], "link": ["/docs/en/about.html"]},
{"title": ["Become an Editor"], "desc": ["," "]," link ": ["/docs/en/help/become.html "]},
{"title": ["Suggest a Site"], "desc": ["," "]," link ": ["/docs/en/add.html "]},
{"title": ["help"], "desc": [","], "link": ["/docs/en/help/helpmain.html"]},
{"title": ["Login"], "desc": [","], "link": ["/editors/"]},
{"title": [], "desc": ["", "Share via Facebook"], "link": []},
{"title": [], "desc": ["", "Share via Twitter"], "link": []},
{"title": [], "desc": ["", "Share via LinkedIn"], "link": []},
{"title": [], "desc": ["", "Share via e-mail"], "link": []},
{"title": [], "desc": ["", ""], "link": []},
{"title": [], "desc": ["", ""], "link": []},
{"title": ["about"], "desc": [","], "link": ["/docs/en/about.html"]},
{"title": ["Become an Editor"], "desc": ["," "]," link ": ["/docs/en/help/become.html "]},
{"title": ["Suggest a Site"], "desc": ["," "]," link ": ["/docs/en/add.html "]},
{"title": ["help"], "desc": [","], "link": ["/docs/en/help/helpmain.html"]},
{"title": ["Login"], "desc": [","], "link": ["/editors/"]},
{"title": [], "desc": ["", "Share via Facebook"], "link": []},
{"title": [], "desc": ["", "Share via Twitter"], "link": []},
{"title": [], "desc": ["", "Share via LinkedIn"], "link": []},
{"title": [], "desc": ["", "Share via e-mail"], "link": []},
{"title": [], "desc": ["", ""], "link": []},
{"title": [], "desc": ["", ""], "link": []}
]
The first run of the Scrapy test was successful.
4, the next job
Next, I'll use the Gooseeker API to implement the Web crawler, eliminating the effort to generate and test XPath for each item. There are currently 2 plans:
Encapsulate a method in Gsextractor: Automatically extract XPath for each item from the XSLT content
Automatically extracts the results of each item from the Gsextractor extraction results
Which option to choose, will be identified in the next experiment, and published to the new version of Gsextractor