Scrapy is a generic crawler framework written by the Python language, and a brother of the recent project team is using Scrapy to get information from some large electric dealers on the big data side. As a result of modifying a little bit of the project, here also recorded some of the content of Scray, can write how much it. SCRAYP source code hosted on the GitHub, official website (http://scrapy.org). It i
In this textbook, we assume that you have installed the scrapy. If you are not installed, you can refer to this installation guide.
We will use the Open Directory Project (DMOZ) As our example to crawl.
This textbook will take you through the following areas:
Create a new Scrapy project
Define the item that you will extract
Write a spider to crawl
Scrapy is an application framework for crawling Web site data and extracting structured data. It can be applied in a series of programs including data mining, information processing or storing historical data. Using frames to crawl data that can save a lot of energy, such as do not need to download their own pages, data processing we do not have to write. We only need to focus on the data crawl rules on the line,
All said that Windows Python3 does not support scrapy, here is a solution
1, Introduction
"Scrapy of the structure of the first" article on the Scrapy architecture, this article on the actual installation run Scrapy crawler. This article takes the official website tutorial as the
together with the Meta data detail_parse method processing forIteminchItems:yieldScrapy. Request (url=item['Sonurls'], meta={'meta_2': item}, callback =self.detail_parse)#data parsing method, get article title and content defDetail_parse (Self, Response): Item= response.meta['meta_2'] Content=""Head= Response.xpath ('//h1[@id = "Main_title"]/text ()') Content_list= Response.xpath ('//div[@id = "Artibody"]/p/text ()'). Extract ()#Merge the text content in the P tag together fo
Reference below: http://www.jb51.net/article/57183.htmIndividual is also a little tidy up, modify some of these errors, these errors related to Scrapy version selection, personal use of Python2.7 + scrapy1.1Another example of the URL (http://www.dmoz.org/Computers/Programming/Languages/Python/Books/) is often inaccessible, people notice, do not think that the script has a problem.Nonsense said, the followin
In this textbook, we assume that you have installed the scrapy. If you do not have the installation, you can refer to this installation guide.
We will use the Open Directory Project (DMOZ) As our example to crawl.
This textbook will take you through the following areas:
To create a new Scrapy project
Define the item that you will extract
Write a spid
results in the most commonly used JSON, with the following commands:
Scrapy Crawl Dmoz-o items.json-t JSON
-O is followed by the export file name, and-T followed by the export type.
Then take a look at the results of the export, open the JSON file with a text editor (for easy display, delete the attribute except the title in item):
Because this is just a small example, so simple processing is possible.
If
(Suggest everyone to read more about the official website tutorial: Tutorial address)
We use the dmoz.org site as a small grab to catch a show of skill.
First you have to answer a question.
Q: Put the Web site into a reptile, a total of several steps.
The answer is simple, step four: New Project (Project): Create a new reptile project clear goal (items): Define the target you want to crawl (Spider): Make crawler start crawl Web page storage content (Pipeline): Design Pipeline Store crawl conte
Example code of several crawling methods of scrapy spider, scrapyspider
This section describes the scrapy crawler framework, focusing on the scrapy component spider.
Several crawling methods of spider:
Crawl 1 page
Create a link based on the given list to crawl multiple pages
Find the 'next page' tag for crawlin
) arora/0.3 (change:287 c9dfb30)", "mozilla/5.0 (X11; U Linux; En-US) applewebkit/527+ (khtml, like Gecko, safari/419.3) arora/0.6", "mozilla/5.0 (Windows; U Windows NT 5.1; En-us; Rv:1.8.1.2pre) gecko/20070215 k-ninja/2.1.1", "mozilla/5.0 (Windows; U Windows NT 5.1; ZH-CN; rv:1.9) gecko/20080705 firefox/3.0 kapiko/3.0", "mozilla/5.0 (X11; Linux i686; U;) gecko/20070322 kazehakase/0.4.5", "mozilla/5.0 (X11; U Linux i686; En-us; rv:1.9.0.8) Gecko fedora/1.9.0.8-1.fc10 kazehakase/0.
reproduced from: http://blog.csdn.net/pleasecallmewhy/article/details/19642329
(Suggest everyone to read more about the official website tutorial: Tutorial address)
We use the dmoz.org site as a small grab to catch a show of skill.
First you have to answer a question.
Q: Put the Web site into a reptile, a total of several steps.
The answer is simple, step four: New Project (Project): Create a new reptile project clear goal (items): Define the target you want to crawl (Spider): Make crawler st
In the Scrapy project, build a Python directory that is similar to the spider and add a py file with the contents below
# encoding:utf-8import Base64proxyserver = Proxy server address # #我的是 ': 9010 ' # Proxy tunneling authentication Information This is the application on that website Proxyuser = Username Proxypass = password P Roxyauth = "Basic" + base64.b64encode (Proxyuser + ":" + Proxypass) class Proxymiddleware (object): def process_request
This article describes how to use scrapy to parse js in python.
The code is as follows:
From selenium import selenium
Class MySpider (crawler ):Name = 'cnbeta'Allowed_domains = ['cnbeta. com']Start_urls = ['http: // www.jb51.net']
Rules = (# Extract links matching 'Category. php' (but not matching 'subsection. php ')# And follow links from them (since no callback means follow = True by default ).Rule (SgmlLinkExtractor (allow = ('/articles/. * \. ht
://DOC.SCRAPY.ORG/EN/L Atest/topics/items.htmlfrom Scrapy.item Import Item, field class Dmozitem (item): Name = field () Description = Field () URL = Field ()Iv. Rewriting pipeline.py#-*-Coding:utf-8-*-# Define Here the models for your scraped items## see documentation in:# HTTP://DOC.SCRAPY.ORG/EN/L Atest/topics/items.htmlfrom Scrapy.item Import Item, field class Dmozitem (item): Name = field () Description = Field () URL = Field ()V. Execute in the DMOZ folder root directoryScrapy Crawl Dmoz-o
can be used to create a spider using a template that is defined in advance. You can also create your own spider's source files.Example:$ scrapy Genspider-lavailable templates:basic Crawl csvfeed xmlfeed$ scrapy genspider-d basicimport scrapyclass $classname (scrapy. Spider): Name="$name"Allowed_domains= ["$domain"] Start_urls= ( 'http://www. $domain/',) d
1. What can scrapy do? Scrapy is an application framework written to crawl Web site data and extract structural data. Can be applied in a series of programs including data mining, information processing, or storing historical data. It was originally designed for page fetching (more specifically, network crawling) and could also be applied to get the data returned by the API (for
://doc.scrapy.org/en/latest/intro/install.html
Several concepts to be understood in Scrapy
SpidersWhat the Spider class wants to express is: how to capture the data of a fixed website. For example, in start_urls, you can define which link to capture and what data to capture in the parse () method. When a Spider starts execution, it first initiates a request from the first link in start_urls () and then
Scrapy crawls its own blog content and scrapy crawls its blog
The libraries commonly used for writing crawlers in python include urllib2 and requests. These libraries can be used for most simple scenarios or for learning purposes. Here is an example of how to capture popular Baidu music songs using urllib2 + BeautifulSoup I have previously written. If you are int
Save, running code on console scrapy crawl dmoz #启动蜘蛛 [scrapy] INF O:spider closed (finished) indicates a successful operation--"Create scrapy program scrapy startproject XXX will automatically create the XXX folder and the following create XXX folder and Scrapy.cfg project configuration, And in the following gener
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.