Example of Windows7 Python3 environment Operation Scrapy

Last Update:2017-01-13 Source: Internet

Author: User

Tags json xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

All said that Windows Python3 does not support scrapy, here is a solution

1, Introduction

"Scrapy of the structure of the first" article on the Scrapy architecture, this article on the actual installation run Scrapy crawler. This article takes the official website tutorial as the example, the complete code may download in the GitHub.

2, Operating environment configuration

The environment of this test is: WINDOWS10, Python3.4.3 32bit

Install Scrapy: $ pip Install scrapy #实际安装时, due to the instability of the server state, there have been several half-way out of the situation
3, write run the first scrapy crawler

3.1. Build a new project: Tutorial

Default

$ scrapy Startproject Tutorial

The project directory structure is as follows:

Windows7 PYTHON3 Environment Operation scrapy-Security person

3.2. Define the item to crawl

Default

#-*-Coding:utf-8-*-

# Define The models for your scraped items
#
# documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

Import Scrapy

Class Dmozitem (Scrapy. Item):
title = Scrapy. Field ()
link = scrapy. Field ()
desc = scrapy. Field ()

3.3. Define Spider

Default

Import Scrapy
From Tutorial.items import Dmozitem

Class Dmozspider (Scrapy. Spider):
Name = "DMOZ"
Allowed_domains = ["dmoz.org"]
Start_urls = [
"Http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"Http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]

Def parse (self, Response):
For SEL in Response.xpath ('//ul/li '):
item = Dmozitem ()
item[' title '] = Sel.xpath (' A/text () '). Extract ()
item[' link ' = Sel.xpath (' A/@href '). Extract ()
item[' desc ' = Sel.xpath (' text () '). Extract ()
Yield item

3.4. Run

$ scrapy Crawl Dmoz-o Item.json

1) The results of the error:
A) importerror:cannot import name ' _win32stdio '
B) importerror:no module named ' Win32API '

2 Error-checking process: Check the official FAQ and StackOverflow information on the original is scrapy on the Python3 test is not sufficient, there are small problems.

3) Solution process:
A) need to manually download the twisted/internet under the _win32stdio and _pollingfile, stored in the Python directory under the Lib\sitepackages\twisted\internet
B) Download and install Pywin32

Run again, success! You can see the output of scrapy on the console, and after you have finished exiting, you can see the crawl results stored in JSON format in the project directory when you open the result file Items.json.

[
{"title": ["about"], "desc": [","], "link": ["/docs/en/about.html"]},
{"title": ["Become an Editor"], "desc": ["," "]," link ": ["/docs/en/help/become.html "]},
{"title": ["Suggest a Site"], "desc": ["," "]," link ": ["/docs/en/add.html "]},
{"title": ["help"], "desc": [","], "link": ["/docs/en/help/helpmain.html"]},
{"title": ["Login"], "desc": [","], "link": ["/editors/"]},
{"title": [], "desc": ["", "Share via Facebook"], "link": []},
{"title": [], "desc": ["", "Share via Twitter"], "link": []},
{"title": [], "desc": ["", "Share via LinkedIn"], "link": []},
{"title": [], "desc": ["", "Share via e-mail"], "link": []},
{"title": [], "desc": ["", ""], "link": []},
{"title": [], "desc": ["", ""], "link": []},
{"title": ["about"], "desc": [","], "link": ["/docs/en/about.html"]},
{"title": ["Become an Editor"], "desc": ["," "]," link ": ["/docs/en/help/become.html "]},
{"title": ["Suggest a Site"], "desc": ["," "]," link ": ["/docs/en/add.html "]},
{"title": ["help"], "desc": [","], "link": ["/docs/en/help/helpmain.html"]},
{"title": ["Login"], "desc": [","], "link": ["/editors/"]},
{"title": [], "desc": ["", "Share via Facebook"], "link": []},
{"title": [], "desc": ["", "Share via Twitter"], "link": []},
{"title": [], "desc": ["", "Share via LinkedIn"], "link": []},
{"title": [], "desc": ["", "Share via e-mail"], "link": []},
{"title": [], "desc": ["", ""], "link": []},
{"title": [], "desc": ["", ""], "link": []}
]

The first run of the Scrapy test was successful.
4, the next job

Next, I'll use the Gooseeker API to implement the Web crawler, eliminating the effort to generate and test XPath for each item. There are currently 2 plans:

Encapsulate a method in Gsextractor: Automatically extract XPath for each item from the XSLT content
Automatically extracts the results of each item from the Gsextractor extraction results
Which option to choose, will be identified in the next experiment, and published to the new version of Gsextractor

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More