Example of Windows7 Python3 environment Operation Scrapy

Source: Internet
Author: User
Tags json xpath

All said that Windows Python3 does not support scrapy, here is a solution

1, Introduction

"Scrapy of the structure of the first" article on the Scrapy architecture, this article on the actual installation run Scrapy crawler. This article takes the official website tutorial as the example, the complete code may download in the GitHub.

2, Operating environment configuration

The environment of this test is: WINDOWS10, Python3.4.3 32bit

Install Scrapy: $ pip Install scrapy #实际安装时, due to the instability of the server state, there have been several half-way out of the situation
3, write run the first scrapy crawler

3.1. Build a new project: Tutorial

Default

$ scrapy Startproject Tutorial


The project directory structure is as follows:

Windows7 PYTHON3 Environment Operation scrapy-Security person

3.2. Define the item to crawl

Default

#-*-Coding:utf-8-*-

# Define The models for your scraped items
#
# documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

Import Scrapy

Class Dmozitem (Scrapy. Item):
title = Scrapy. Field ()
link = scrapy. Field ()
desc = scrapy. Field ()


3.3. Define Spider

Default

Import Scrapy
From Tutorial.items import Dmozitem

Class Dmozspider (Scrapy. Spider):
Name = "DMOZ"
Allowed_domains = ["dmoz.org"]
Start_urls = [
"Http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"Http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]

Def parse (self, Response):
For SEL in Response.xpath ('//ul/li '):
item = Dmozitem ()
item[' title '] = Sel.xpath (' A/text () '). Extract ()
item[' link ' = Sel.xpath (' A/@href '). Extract ()
item[' desc ' = Sel.xpath (' text () '). Extract ()
Yield item


3.4. Run

$ scrapy Crawl Dmoz-o Item.json

1) The results of the error:
A) importerror:cannot import name ' _win32stdio '
B) importerror:no module named ' Win32API '

2 Error-checking process: Check the official FAQ and StackOverflow information on the original is scrapy on the Python3 test is not sufficient, there are small problems.

3) Solution process:
A) need to manually download the twisted/internet under the _win32stdio and _pollingfile, stored in the Python directory under the Lib\sitepackages\twisted\internet
B) Download and install Pywin32

Run again, success! You can see the output of scrapy on the console, and after you have finished exiting, you can see the crawl results stored in JSON format in the project directory when you open the result file Items.json.


[
{"title": ["about"], "desc": [","], "link": ["/docs/en/about.html"]},
{"title": ["Become an Editor"], "desc": ["," "]," link ": ["/docs/en/help/become.html "]},
{"title": ["Suggest a Site"], "desc": ["," "]," link ": ["/docs/en/add.html "]},
{"title": ["help"], "desc": [","], "link": ["/docs/en/help/helpmain.html"]},
{"title": ["Login"], "desc": [","], "link": ["/editors/"]},
{"title": [], "desc": ["", "Share via Facebook"], "link": []},
{"title": [], "desc": ["", "Share via Twitter"], "link": []},
{"title": [], "desc": ["", "Share via LinkedIn"], "link": []},
{"title": [], "desc": ["", "Share via e-mail"], "link": []},
{"title": [], "desc": ["", ""], "link": []},
{"title": [], "desc": ["", ""], "link": []},
{"title": ["about"], "desc": [","], "link": ["/docs/en/about.html"]},
{"title": ["Become an Editor"], "desc": ["," "]," link ": ["/docs/en/help/become.html "]},
{"title": ["Suggest a Site"], "desc": ["," "]," link ": ["/docs/en/add.html "]},
{"title": ["help"], "desc": [","], "link": ["/docs/en/help/helpmain.html"]},
{"title": ["Login"], "desc": [","], "link": ["/editors/"]},
{"title": [], "desc": ["", "Share via Facebook"], "link": []},
{"title": [], "desc": ["", "Share via Twitter"], "link": []},
{"title": [], "desc": ["", "Share via LinkedIn"], "link": []},
{"title": [], "desc": ["", "Share via e-mail"], "link": []},
{"title": [], "desc": ["", ""], "link": []},
{"title": [], "desc": ["", ""], "link": []}
]

The first run of the Scrapy test was successful.
4, the next job

Next, I'll use the Gooseeker API to implement the Web crawler, eliminating the effort to generate and test XPath for each item. There are currently 2 plans:

Encapsulate a method in Gsextractor: Automatically extract XPath for each item from the XSLT content
Automatically extracts the results of each item from the Gsextractor extraction results
Which option to choose, will be identified in the next experiment, and published to the new version of Gsextractor

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.