In this textbook, we assume that you have installed the scrapy. If you are not installed, you can refer to this installation guide.
We will use the Open Directory Project (DMOZ) As our example to crawl.
This textbook will take you through the following areas:
Create a new Scrapy project
Define the item that you will extract
Write a spider to crawl
Reference below: http://www.jb51.net/article/57183.htmIndividual is also a little tidy up, modify some of these errors, these errors related to Scrapy version selection, personal use of Python2.7 + scrapy1.1Another example of the URL (http://www.dmoz.org/Computers/Programming/Languages/Python/Books/) is often inaccessible, people notice, do not think that the scr
In this textbook, we assume that you have installed the scrapy. If you do not have the installation, you can refer to this installation guide.
We will use the Open Directory Project (DMOZ) As our example to crawl.
This textbook will take you through the following areas:
To create a new Scrapy project
Define the item that you will extract
Write a spid
configuration filetutorial/: The Python module for the project, which will refer to code from here tutorial/items.py: Project's Items file tutorial/pipelines.py: Pipelines file for the project tutorial/ settings.py: Project Settings file tutorial/spiders/: directory where reptiles are stored
2. Clear Objectives (Item)
In Scrapy, items are containers that are used to load crawled content, somewhat like DiC
results in the most commonly used JSON, with the following commands:
Scrapy Crawl Dmoz-o items.json-t JSON
-O is followed by the export file name, and-T followed by the export type.
Then take a look at the results of the export, open the JSON file with a text editor (for easy display, delete the attribute except the title in item):
Because this is just a small example, so simple processing is possible.
If
together with the Meta data detail_parse method processing forIteminchItems:yieldScrapy. Request (url=item['Sonurls'], meta={'meta_2': item}, callback =self.detail_parse)#data parsing method, get article title and content defDetail_parse (Self, Response): Item= response.meta['meta_2'] Content=""Head= Response.xpath ('//h1[@id = "Main_title"]/text ()') Content_list= Response.xpath ('//div[@id = "Artibody"]/p/text ()'). Extract ()#Merge the text content in the P tag together fo
) arora/0.3 (change:287 c9dfb30)", "mozilla/5.0 (X11; U Linux; En-US) applewebkit/527+ (khtml, like Gecko, safari/419.3) arora/0.6", "mozilla/5.0 (Windows; U Windows NT 5.1; En-us; Rv:1.8.1.2pre) gecko/20070215 k-ninja/2.1.1", "mozilla/5.0 (Windows; U Windows NT 5.1; ZH-CN; rv:1.9) gecko/20080705 firefox/3.0 kapiko/3.0", "mozilla/5.0 (X11; Linux i686; U;) gecko/20070322 kazehakase/0.4.5", "mozilla/5.0 (X11; U Linux i686; En-us; rv:1.9.0.8) Gecko fedora/1.9.0.8-1.fc10 kazehakase/0.
brief introduction to the role of each file: Scrapy.cfg: Project configuration filetutorial/: The Python module for the project, which will refer to code from here tutorial/items.py: Project's Items file tutorial/pipelines.py: Pipelines file for the project tutorial/ settings.py: Project Settings file tutorial/spiders/: directory where reptiles are stored
2. Clear Objectives (Item)
In Scrapy, items are co
In the Scrapy project, build a Python directory that is similar to the spider and add a py file with the contents below
# encoding:utf-8import Base64proxyserver = Proxy server address # #我的是 ': 9010 ' # Proxy tunneling authentication Information This is the application on that website Proxyuser = Username Proxypass = password P Roxyauth = "Basic" + base64.b64encode (Proxyuser + ":" + Proxypass) class Pro
This article describes how to use scrapy to parse js in python.
The code is as follows:
From selenium import selenium
Class MySpider (crawler ):Name = 'cnbeta'Allowed_domains = ['cnbeta. com']Start_urls = ['http: // www.jb51.net']
Rules = (# Extract links matching 'Category. php' (but not matching 'subsection. php ')# And follow links from them (since no callback means follow = True by default ).Rule (Sgm
The Scrapy terminal is an interactive terminal for you to try and debug your crawl code without starting the spider. The intent is to test the code that extracts the data, but you can use it as a normal Python terminal to test any Python code on it.The terminal is used to test XPath or CSS expressions to see how they work and the data extracted from the crawled p
observed that PIP error before the operation of theRunning setup.py clean for TwistedFailed to build TwistedThat is to say that there is an error installing the Twisted module, we can choose to manually download the twisted module and install it. Python's various libraries have a lot of downloads, many people may download to twisted-xx.x.x.tar.bz2, after the decompression to install, found that the same error occurred. At this point, we carefully observe the previous installation of
The Scrapyd module is dedicated to deploying scrapy projects and can deploy and manage scrapy projects : Https://github.com/scrapy/scrapydRecommended installationPIP3 Install ScrapydInstall the scrapyd module First, after installation in the Python installation directory in the Scripts folder will generate Scrapyd.exe
Scrapy is a generic crawler framework written by the Python language, and a brother of the recent project team is using Scrapy to get information from some large electric dealers on the big data side. As a result of modifying a little bit of the project, here also recorded some of the content of Scray, can write how much it. SCRAYP source code hosted on the GitHu
Create a search engine -------- scrapy implementation using python distributed crawler and scrapy distributed Crawler
I recently learned a scrapy crawler course on the Internet. I think it is not bad. The following is the directory is still being updated. I think it is necessary to take a good note and study it.
Chapte
[Python] [Scrapy framework] installation of Python3 Scrapy, scrapypython31. Method (only pip installation is introduced)
PS. It is unclear that pip (easy_install) can be Baidu or left a message.
Cmd command:
(Pip can be used directly, instead of jumping to the pip.exe directory because the directory is added to the Path environment variable)
Advantages of p
successful
pip list
#Output is as follows
Cffi (0.8.6)
Cryptography (0.6.1)
cssselect (0.9.1)
lxml (3.4.1)
pip (1.5.6)
Pycparser (2.10) Pyopenssl (0.14) queuelib (1.2.2) scrapy (0.24.4) setuptools
(3.6) Six
( 1.8.0)
Twisted (14.0.2)
w3lib (1.10.0)
wsgiref (0.1.2) zope.interface (4.1.1)
More virtual environment operations can view my blog
3. Scrapy Tutorial
Before you crawl, you
1, method (only the PIP mode installation)PS. Not clear Pip (Easy_install) can Baidu or leave a message.CMD command: (You can directly pip without jumping to the Pip.exe directory, because the directory is added to the PATH environment variable)Benefits of installing via PIP install:Very easy to set upInstalling Scrapy and Dependency packagesThe consistency of the package can guarantee2. Some problems that may be encounteredWhen installed, some of
Course Cataloguewhat 01.scrapy is. mp4python Combat-02. Initial use of Scrapy.mp4The basic use steps of Python combat -03.scrapy. mp4python Combat-04. Introduction to Basic Concepts 1-scrapy command-line tools. mp4python Combat-05. This concept introduces the important components of 2-
In the previous example, we know that defining an item class is as simple as inheriting scrapy. Item, and then add several types to scrapy. Field object as a class property, as in the followingImport Scrapyclass Product (scrapy. Item): name = Scrapy. Field () Price =
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.