Luffy-python Crawler Training-3rd Chapter

Last Update:2018-07-12 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1, this section learning experience, experience:

The content of this chapter is very much, the use of scrapy framework. For those of you who have learned Django. Might want to better understand some of the put. The personal feeling is quite simple. All the knowledge point a listen to understand, the only bad is the time is too urgent, not fast to watch, can only wait to turn back and then carefully taste. The difficulty of this chapter is to download the middleware, in fact, it is similar to Django, but still need to listen carefully to understand, and to combine their own code to contact. This will be very well mastered. Other aspects of the application is much simpler, can quickly browse, know the function and the use of methods, contact can be at any time to turn the notes to complete. For the time-pressed small partners, it is better to focus on learning

2, the Knowledge point in this section summarizes:

Summarize:
1. Installing Scrapy

Linux installation  PIP3 install Scrapy

Windows installation    pip3 install wheel download Twisted  http://www.lfb.uci.edu/~gohlke/pythonlibs/#  Twisted    pip3 install twisted-18.4.0-cp36-cp36m-WIN_AMD64.WHL installation scrapy

2. Basic use of Scarpy

Create Project        scrapy startproject day710    # Create an item        D Day 710   #  Go         to the project directory Scrapy genspider example example.com Create a file for the      site you want to crawl        scrapy   Crawl Chouti-- Nolog     #  run up to set the initial URL (default setting OK)
　　    Response.text is the content crawled and then parsed.
 from Import # Import Internal Parser

Parse: Tag object: XPath ('/html/body/url/li/a/@href ') List:          xpathe ('/html/body/url/li/a/@href '). Extract () Value:            xpathe ('/html/body/url/li/a/@href '). Extract_first ()

3. Crawler's architecture

Scrapy mainly includes the following components:

Engine (Scrapy)
Data flow processing for the entire system, triggering transactions (framework core)
Scheduler (Scheduler)
Used to accept requests sent by the engine, pressed into the queue, and returned when the engine was requested again. It can be imagined as a priority queue for a URL (crawling the URL of a Web page or a link), which determines what the next URL to crawl is, and removes duplicate URLs
Downloader (Downloader)
Used to download Web content and return Web content to spiders (Scrapy downloader is built on twisted, an efficient asynchronous model)
Reptile (Spiders)
Crawlers are primarily working to extract the information they need from a particular Web page, the so-called entity (Item). The user can also extract a link from it, allowing Scrapy to continue scratching a page
Project Pipeline (Pipeline)
Responsible for processing the entities extracted from the Web page, the main function is to persist the entity, verify the validity of the entity, eliminate unnecessary information. When the page is parsed by the crawler, it is sent to the project pipeline, and the data is processed in several specific order.
Downloader middleware (Downloader middlewares)
The framework is located between the Scrapy engine and the downloader, primarily to handle requests and responses between the Scrapy engine and the downloader.
Crawler middleware (spider middlewares)
Between the Scrapy engine and the crawler, the main task is to handle the spider's response input and request output.
Dispatch middleware (Scheduler middewares)
The middleware between the Scrapy engine and the dispatch, sent from the Scrapy engine to the scheduled request and response.

The scrapy running process is probably as follows:

The engine pulls a link (URL) from the scheduler for the next crawl
The engine encapsulates the URL into a request to pass to the downloader
The Downloader downloads the resource and encapsulates it as a response packet (Response)
Reptile parsing response
Resolves the entity (Item) and gives it to the entity pipeline for further processing
Parse out a link (URL), then the URL to the scheduler waiting to crawl

4.post/Request Header/cookie  to heavy  middleware

post/Request Header/cookie Automatic landing drawer first visit a page to get the Cookie original Cookieprint (response.headers.getlist (' Set-cookie ')) parsing Cookiecookie = Cookiejar () cookie.extract_cookies (response,response.request) Cookie_dic={}fork,vincookie._cookies.items (): Fori, Jinv.items (): Form,ninj.items (): Self.cookie_dic[m]=n.valueprint (self.cookie_dic) req=request (url= '/HTTP/ Dig.chouti.com/login ', method= ' POST ', headers={' content-type ': ' Application/x-www-form-urlencoded;charset=utf-8 '} , body= ' phone=8613503851931&password=abc1234&onemonth=1 ', cookies=self.cookie_dic,callback=self.parse_ Check,) Yieldreq take advantage of meta={' Cookiejar ': True} Auto Action cookiedefstart_requests (self): forurlinself.start_urls:yieldRequest (url=url,callback=self.parse_index,meta={' Cookiejar ': True}) req=request (url= ' Http://dig.chouti.com/login ', Method= ' POST ', headers={' content-type ': ' Application/x-www-form-urlencoded;charset=utf-8 '},body= ' phone= 8613503851931&password=abc1234&onemonth=1 ', meta={' Cookiejar ': True},callback=self.parse_check,) Yield Req Turn off cookie config file in #cookies_enabled=false Avoid repeated access scrapy default use Scrapy.dupefilter.RFPDupeFilter for deduplication, related configuration has:1dupefilter_class =  ' Scrapy.dupefilter.RFPDupeFilter ' 2dupefilter_debug = false3jobdir =  ' saves the log path of the sample record, such as:/root/"   # final path for/root/requests.seen from 
Write so much to update first
Luffy-python Crawler Training-3rd Chapter

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Luffy-python Crawler Training-3rd Chapter

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Luffy-python Crawler Training-3rd Chapter

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support