1, this section learning experience, experience:
The content of this chapter is very much, the use of scrapy framework. For those of you who have learned Django. Might want to better understand some of the put. The personal feeling is quite simple. All the knowledge point a listen to understand, the only bad is the time is too urgent, not fast to watch, can only wait to turn back and then carefully taste. The difficulty of this chapter is to download the middleware, in fact, it is similar to Django, but still need to listen carefully to understand, and to combine their own code to contact. This will be very well mastered. Other aspects of the application is much simpler, can quickly browse, know the function and the use of methods, contact can be at any time to turn the notes to complete. For the time-pressed small partners, it is better to focus on learning
2, the Knowledge point in this section summarizes:
Summarize:
1. Installing Scrapy
Linux installation PIP3 install Scrapy
Windows installation pip3 install wheel download Twisted http://www.lfb.uci.edu/~gohlke/pythonlibs/# Twisted pip3 install twisted-18.4.0-cp36-cp36m-WIN_AMD64.WHL installation scrapy
2. Basic use of Scarpy
Create Project scrapy startproject day710 # Create an item D Day 710 # Go to the project directory Scrapy genspider example example.com Create a file for the site you want to crawl scrapy Crawl Chouti-- Nolog # run up to set the initial URL (default setting OK)
Response.text is the content crawled and then parsed.
from Import # Import Internal Parser
Parse: Tag object: XPath ('/html/body/url/li/a/@href ') List: xpathe ('/html/body/url/li/a/@href '). Extract () Value: xpathe ('/html/body/url/li/a/@href '). Extract_first ()
3. Crawler's architecture
Scrapy mainly includes the following components:
- Engine (Scrapy)
Data flow processing for the entire system, triggering transactions (framework core)
- Scheduler (Scheduler)
Used to accept requests sent by the engine, pressed into the queue, and returned when the engine was requested again. It can be imagined as a priority queue for a URL (crawling the URL of a Web page or a link), which determines what the next URL to crawl is, and removes duplicate URLs
- Downloader (Downloader)
Used to download Web content and return Web content to spiders (Scrapy downloader is built on twisted, an efficient asynchronous model)
- Reptile (Spiders)
Crawlers are primarily working to extract the information they need from a particular Web page, the so-called entity (Item). The user can also extract a link from it, allowing Scrapy to continue scratching a page
- Project Pipeline (Pipeline)
Responsible for processing the entities extracted from the Web page, the main function is to persist the entity, verify the validity of the entity, eliminate unnecessary information. When the page is parsed by the crawler, it is sent to the project pipeline, and the data is processed in several specific order.
- Downloader middleware (Downloader middlewares)
The framework is located between the Scrapy engine and the downloader, primarily to handle requests and responses between the Scrapy engine and the downloader.
- Crawler middleware (spider middlewares)
Between the Scrapy engine and the crawler, the main task is to handle the spider's response input and request output.
- Dispatch middleware (Scheduler middewares)
The middleware between the Scrapy engine and the dispatch, sent from the Scrapy engine to the scheduled request and response.
The scrapy running process is probably as follows:
- The engine pulls a link (URL) from the scheduler for the next crawl
- The engine encapsulates the URL into a request to pass to the downloader
- The Downloader downloads the resource and encapsulates it as a response packet (Response)
- Reptile parsing response
- Resolves the entity (Item) and gives it to the entity pipeline for further processing
- Parse out a link (URL), then the URL to the scheduler waiting to crawl
4.post/Request Header/cookie to heavy middleware
post/Request Header/cookie Automatic landing drawer first visit a page to get the Cookie original Cookieprint (response.headers.getlist (' Set-cookie ')) parsing Cookiecookie = Cookiejar () cookie.extract_cookies (response,response.request) Cookie_dic={}fork,vincookie._cookies.items (): Fori, Jinv.items (): Form,ninj.items (): Self.cookie_dic[m]=n.valueprint (self.cookie_dic) req=request (url= '/HTTP/ Dig.chouti.com/login ', method= ' POST ', headers={' content-type ': ' Application/x-www-form-urlencoded;charset=utf-8 '} , body= ' phone=8613503851931&password=abc1234&onemonth=1 ', cookies=self.cookie_dic,callback=self.parse_ Check,) Yieldreq take advantage of meta={' Cookiejar ': True} Auto Action cookiedefstart_requests (self): forurlinself.start_urls:yieldRequest (url=url,callback=self.parse_index,meta={' Cookiejar ': True}) req=request (url= ' Http://dig.chouti.com/login ', Method= ' POST ', headers={' content-type ': ' Application/x-www-form-urlencoded;charset=utf-8 '},body= ' phone= 8613503851931&password=abc1234&onemonth=1 ', meta={' Cookiejar ': True},callback=self.parse_check,) Yield Req Turn off cookie config file in #cookies_enabled=false Avoid repeated access scrapy default use Scrapy.dupefilter.RFPDupeFilter for deduplication, related configuration has:1dupefilter_class = ' Scrapy.dupefilter.RFPDupeFilter ' 2dupefilter_debug = false3jobdir = ' saves the log path of the sample record, such as:/root/" # final path for/root/requests.seen from Write so much to update first
Luffy-python Crawler Training-3rd Chapter