Scrapy Reptile Combat

Last Update:2018-07-29 Source: Internet

Author: User

Tags add time xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Recently, want to study well under the Scrapy crawler framework, a very powerful Python crawler framework, after watching the course of the Geek College, the realization of their own Scrapy crawler film top250, coexist in the MySQL database. The implementation process is now introduced.
First, look at the structure of the web.

The corresponding HTML code is:

As shown above, the main is to crawl out the name of the film, film Brief introduction, watercress grading, film generalization.

Environment Installation:

PIP3 install-u scrapy
pip3 install-u pymysql #连接数据库

Create a new Scrapy project using the scrapy startproject Movie command. The new project structure follows the following figure:

Among them, several Python files have the following features,
1. items.py defines data that needs to be crawled and later processed;
2. settings.py file configuration Scrapy to modify User-agent, set crawl time interval, set agent, configure various middleware and so on;
3. pipeline.py is used to store the later data processing function, so that the data crawl and processing separate, you can store data in this file to the MySQL database;
4. moviespider.py custom crawler, mainly crawling the name of the film, film Brief introduction, watercress grading, film generalization.

crawled data structure definition (items.py)

From Scrapy Import Item, Field


class Movieitem (item):
    title = field ()
    movieinfo = field ()
    star = field () C6/>quote = Field () Pass

crawler (moviespider.py)

From scrapy.spiders import Spider to scrapy.http import Request from scrapy.selector import selector from Movie.items I  Mport Movieitem class Moviespider (Spider): name = ' movie ' url = ' https://movie.douban.com/top250 ' Start_urls = [' https://movie.douban.com/top250 '] def parse (self, response): item = Movieitem () selector = Selec Tor (response) Movies = Selector.xpath ('//div[@class = ' info '] ') for movie in movies:title = mov
                Ie.xpath (' div[@class = "HD"]/a/span/text ()). Extract () Fulltitle = ' For each in title: Fulltitle + = Each Movieinfo = Movie.xpath (' div[@class = "BD"]/p/text ()). Extract () star = movie . XPath (' div[@class = "BD"]/div[@class = "star"]/span[@class = "Rating_num"]/text ()). Extract () [0] quote = Movie.xpat
                H (' div[@class = "BD"]/p/span/text ()). Extract () If quote:quote = quote[0] Else:
Quote = ' '            item[' title '] = Fulltitle item[' movieinfo ' = '; '.
            Join (Movieinfo). Replace (', '). replace (' \ n ', ') item[' star ' = star[0] item[' quote '] = Quote
            Yield Item nextPage = Selector.xpath ('//span[@class = Next ']/link/@href '). Extract () If nextPage: NextPage = nextpage[0] Print (Self.url + str (nextPage)) yield Request (self.url + str (NE Xtpage), Callback=self.parse)

data stored in MySQL database
First, to create a Movie table in the local database, the table statement is as follows:

CREATE TABLE Movie (
  ID         INT not           NULL PRIMARY KEY auto_increment
  COMMENT ' Auto id ',
  name       VARCHAR ( 1024) NOT null
  COMMENT ' movie name ',
  movieinfo  VARCHAR (1024) NOT null
  COMMENT ' movie Details ',
  star       VARCHAR ()                        default null
  COMMENT ' watercress score ',
  quote      VARCHAR (1024)                      default null
  COMMENT ' Classic lines ',
  createtime DATETIME                           DEFAULT current_timestamp
  COMMENT ' Add Time '
)
  ENGINE = InnoDB
  DEFAULT CHARSET = UTF8;

The database can then be connected and stored. This process can be completed in pipeline.py with the following code:

Import Pymysql


class Moviepipeline (object):
    def __init__ (self):
        self.conn = pymysql.connect (host= ' 127.0.0.1 ', port=3306, user=***, passwd=***, db=***,
                                    charset= ' UTF8 ')
        self.cursor = Self.conn.cursor ()
        Self.cursor.execute ("TRUNCATE TABLE Movie")
        Self.conn.commit ()

    def process_item (self, item, spider):
        Try:
            self.cursor.execute (INSERT into Movie (name,movieinfo,star,quote) VALUES (%s,%s,%s,%s), (
                item[' Title '], item[' Movieinfo '], item[' star ', item[' quote ']) self.conn.commit
            () except Pymysql
        . Error:
            print ("error%s,%s,%s,%s"% (item[' title '), item[' Movieinfo '], item[' star '), item[' quote ']) return
        Item

Here, the crawler program is finished, look at the results,

There are 250 of data, and the number of web movies consistent, and then look at the content,

The above is to use the Scrapy framework to crawl all processes, the detailed code can be viewed in my GitHub, the author level is limited, if there are deficiencies, please do not hesitate to enlighten.

Note: When you connect to a database using Mypysql, the table field properties of the database are defined as String types, Pymysql cannot be inserted into the MySQL table with the shape and floating-point type, which has a hole in it for a long time.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More