Scrapy crawler + practice, scrapy Crawler

Source: Internet
Author: User

Scrapy crawler + practice, scrapy Crawler

 

 

Crawlers have been involved in this issue for a long time, but they have not been in-depth. Today we are engaged in crawlers. Select scrapy.

Http://scrapy-chs.readthedocs.io/zh_CN/0.24/intro/tutorial.html getting started

In fact, the installation is still very simple, we can directly pip install scrapy

If an error is reported, You can also download the https://pypi.python.org/pypi/Scrapy

Scrapy-1.4.0-py2.py3-none-any.whl (md5) installation, installation error, need to download the prompt library,

Twisted installation:

Download http://www.lfd.uci.edu /~ Gohlke/pythonlibs/# twisted

Then install

Install PyWin32

Download the corresponding version of the installation package from the official website, link: https://sourceforge.net/projects/pywin32/files/pywin32/Build%20221/

Shows the general architecture of Scrapy,

 

Ii. Components

1. Scrapy Engine (Scrapy Engine)

The Scrapy engine is used to control the data processing process of the entire system and trigger transaction processing. For more details, see the following data processing process.

2. sched)

The scheduler receives and sorts requests from the Scrapy engine into the queue, and returns the requests to the Scrapy engine after sending the requests.

3. Downloader)

The main responsibility of the download tool is to capture the webpage and return the webpage content to Spiders ).

4. Spiders)

A spider is a class defined by Scrapy users to parse webpages and capture the content returned by URLs. Each spider can process a domain name or a group of domain names. In other words, it is used to define the crawling and parsing rules for a specific website.

The entire crawling process (cycle) of a spider is as follows:

First, obtain the initial request of the first URL. When the request returns, retrieve a callback function. The first request is to call the start_requests () method. By default, this method generates a request from the Url in start_urls and performs resolution to call the callback function.

In the callback function, you can parse the webpage response and return the iterations of the Project object and request object or the two. These requests will also contain a callback, Which is downloaded by Scrapy and then processed by the specified callback.

In the callback function, you parse the website content and use the Xpath selector in the same process (but you can also use BeautifuSoup, lxml or any other programs you like) and generate parsed data items.

Finally, projects returned from the spider are usually placed in the project pipeline.

5. Item Pipeline (project Pipeline)

The main responsibility of the project pipeline is to process projects extracted from webpages by Spider. Its main task is to clarify, verify, and store data. After the page is parsed by a spider, it will be sent to the project pipeline and processed in several specific order. The components of each project pipeline are a Python class consisting of a simple method. They get the project and execute their methods, and they also need to determine whether to continue to execute the next step in the project pipeline or directly discard it for non-processing.

The project pipeline generally performs the following processes:

Clean HTML data
Verify the parsed data (check whether the project contains necessary fields)
Check whether the data is duplicated (delete the data if it is repeated)
Store parsed data in the database

6. Downloader middlewares (Downloader middleware)

Download middleware is a hook framework between the Scrapy engine and the download tool. It mainly processes requests and responses between the Scrapy engine and the download tool. It provides a custom code to expand Scrapy functions. The download intermediary is a hook framework for processing requests and responses. It is a lightweight underlying system that allows Scrapy to enjoy global control.

7. Spider middlewares (Spider middleware)

Spider middleware is a hook framework between the Scrapy engine and the spider. It mainly processes the spider's response input and request output. It provides a way to customize code to expand Scrapy functions. Spider middleware is a framework of spider processing mechanisms attached to Scrapy. you can insert custom code to send requests to SPIDER and return the response content and project obtained by the spider.

8. Scheduler middlewares (scheduling middleware)

Scheduling middleware is a middleware between the Scrapy engine and scheduling. It mainly serves to send scheduling requests and responses from the Scrapy engine. It provides a custom code to expand the Scrapy function.

After installation, we can create a project named baidu.

scrapy startproject baidu

Directory structure

Scrapy. cfg: project configuration file
Tutorial/: python module of the project. Then you will add the code here.
Tutorial/items. py: The item file in the project.
Tutorial/pipelines. py: pipelines file in the project.
Tutorial/settings. py: the setting file of the project.
Tutorial/spiders/: directory where spider code is stored.

Start a crawler.

Let's analyze this website https://movie.douban.com/top250.


We can see that the interface is all under the ol label, and the element class is unique, so we can locate it and find a collection of all movies.
Xptach is // ol [@ class = "grid_view"]/li
Let's analyze the next movie.



 

 


We can see that the elements we want to obtain are very simple, and they are good elements,
So let's talk about what we want to get. We define it in item. py.
Ranking = scrapy. Field ()
Movie_name = scrapy. Field () # Movie name
Score = scrapy. Field () # score
Score_num = scrapy. Field () # Number of Students
Daoyan = scrapy. Field () # Director, actor
Bieming = scrapy. Field ()
Url = scrapy. Field () # url
Almost we can see that all the elements we want can be obtained,
Then let's organize our code.
From baidu. items import BaiduItemfrom scrapy import Requestimport reclass XiaoHuarSpider (scrapy. spider): name = 'douban _ movie_top250 'start _ urls = ['https: // movie.douban.com/top250'{def parse (self, response): item = BaiduItem () movies = response. xpath ('// ol [@ class = "grid_view"]/li') for movie in movies: item ['ranking'] = movie. xpath ('. // div [@ class = "pic"]/em/text ()'). extract () [0] item ['movie _ name'] = movie. xpath ('. // div [@ class = "hd"]/a/span [1]/text ()'). extract () [0] item ['score '] = movie. xpath ('. // div [@ class = "star"]/span [@ class = "rating_num"]/text ()'). extract () [0] item ['score _ num'] = movie. xpath ('. // div [@ class = "star"]/span/text ()'). re ('(\ d +) Comments') [0] item ['dao'] = movie. xpath ('. // div [@ class = "bd"]/p/text ()'). extract () [0] item ['bieming'] = movie. xpath ('. // div [@ class = "bd"]/p [@ class = "quote"]/span/text ()'). extract () [0] item ['url'] = movie. xpath ('. // div [@ class = "hd"]/a/@ href '). extract () [0] yield item
 

 

Let's run our code on the command line.
Scrapy crawl douban_movie_top250 -- logfile = test. log-o douban_movie_top250.json-t json

We can see that no data is obtained, only log

If someone says it is forbidden, let's modify it and forge it.
Add a headers.
Class XiaoHuarSpider (scrapy. spider): name = 'douban _ movie_top250 'headers = {'user-agent': 'mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36 ',} def start_requests (self): url = 'https: // movie.douban.com/top250'yield Request (url, headers = self. headers) def parse (self, response): item = BaiduItem () movies = response. xpath ('// ol [@ class = "grid_view"]/li') for movie in movies: item ['ranking'] = movie. xpath ('. // div [@ class = "pic"]/em/text ()'). extract () [0] item ['movie _ name'] = movie. xpath ('. // div [@ class = "hd"]/a/span [1]/text ()'). extract () [0] item ['score '] = movie. xpath ('. // div [@ class = "star"]/span [@ class = "rating_num"]/text ()'). extract () [0] item ['score _ num'] = movie. xpath ('. // div [@ class = "star"]/span/text ()'). re ('(\ d +) Comments') [0] item ['dao'] = movie. xpath ('. // div [@ class = "bd"]/p/text ()'). extract () item ['bieming'] = movie. xpath ('. // div [@ class = "bd"]/p [@ class = "quote"]/span/text ()'). extract () [0] item ['url'] = movie. xpath ('. // div [@ class = "hd"]/a/@ href '). extract () [0] yield item
 

Then let's run our code again.

See our douba.csv

 

Let's take a look at our log module:

 

Our logs can be printed normally, so let's take a look, we seem to have a lot of data, only the first page, then we will analyze our page

 

What do we understand?
Then we can improve our code.
Add the following code:
next_url=response.xpath('//span[@class="next"]/a/@href').extract()if next_url:next_url='https://movie.douban.com/top250'+next_url[0]yield Request(next_url,headers=self.headers)
 

Then let's crawl our crawlers.

Looking at the size of our current data file, we are successful, so let's take a look at our content.

 

The log file shows many articles that have not been crawled before.

 

Crawlers are actually not that complicated, as long as you stick to it on the road.




 

In addition, I provide vip guidance, python automation, python learning, python testing and development, I will provide guidance, vip is valid for life, the current price is 700, contact qq: 952943386, I will update the QQ Group later, and my favorite friends can come. I have been able to do things since I was not familiar with python. From 15 k to 17 k +, I feel sad, we can share our ideas with you, provide ideas, and let you go further. I also walked step by step from a blank sheet of paper. If you are a new graduate student, this is also suitable for you, I can give you guidance, but what I can do is to bring you, but I am not a training class and have not provided a job opportunity. I will only recommend you when I have a chance, and there is no end to learning, at this time, I may only teach you some tests and provide career guidance. I may bring you more in the future. I am a college student, and I am also dedicated to today.


 



 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.