Scrapy Reptile Combat

Source: Internet
Author: User
Tags add time xpath

Recently, want to study well under the Scrapy crawler framework, a very powerful Python crawler framework, after watching the course of the Geek College, the realization of their own Scrapy crawler film top250, coexist in the MySQL database. The implementation process is now introduced.
First, look at the structure of the web.

The corresponding HTML code is:

As shown above, the main is to crawl out the name of the film, film Brief introduction, watercress grading, film generalization.

Environment Installation:

PIP3 install-u scrapy
pip3 install-u pymysql #连接数据库

Create a new Scrapy project using the scrapy startproject Movie command. The new project structure follows the following figure:

Among them, several Python files have the following features,
1. items.py defines data that needs to be crawled and later processed;
2. settings.py file configuration Scrapy to modify User-agent, set crawl time interval, set agent, configure various middleware and so on;
3. pipeline.py is used to store the later data processing function, so that the data crawl and processing separate, you can store data in this file to the MySQL database;
4. moviespider.py custom crawler, mainly crawling the name of the film, film Brief introduction, watercress grading, film generalization.

crawled data structure definition (items.py)

From Scrapy Import Item, Field


class Movieitem (item):
    title = field ()
    movieinfo = field ()
    star = field () C6/>quote = Field () Pass
    

crawler (moviespider.py)

From scrapy.spiders import Spider to scrapy.http import Request from scrapy.selector import selector from Movie.items I  Mport Movieitem class Moviespider (Spider): name = ' movie ' url = ' https://movie.douban.com/top250 ' Start_urls = [' https://movie.douban.com/top250 '] def parse (self, response): item = Movieitem () selector = Selec Tor (response) Movies = Selector.xpath ('//div[@class = ' info '] ') for movie in movies:title = mov
                Ie.xpath (' div[@class = "HD"]/a/span/text ()). Extract () Fulltitle = ' For each in title: Fulltitle + = Each Movieinfo = Movie.xpath (' div[@class = "BD"]/p/text ()). Extract () star = movie . XPath (' div[@class = "BD"]/div[@class = "star"]/span[@class = "Rating_num"]/text ()). Extract () [0] quote = Movie.xpat
                H (' div[@class = "BD"]/p/span/text ()). Extract () If quote:quote = quote[0] Else:
Quote = ' '            item[' title '] = Fulltitle item[' movieinfo ' = '; '.
            Join (Movieinfo). Replace (', '). replace (' \ n ', ') item[' star ' = star[0] item[' quote '] = Quote
            Yield Item nextPage = Selector.xpath ('//span[@class = Next ']/link/@href '). Extract () If nextPage: NextPage = nextpage[0] Print (Self.url + str (nextPage)) yield Request (self.url + str (NE Xtpage), Callback=self.parse)

data stored in MySQL database
First, to create a Movie table in the local database, the table statement is as follows:

CREATE TABLE Movie (
  ID         INT not           NULL PRIMARY KEY auto_increment
  COMMENT ' Auto id ',
  name       VARCHAR ( 1024) NOT null
  COMMENT ' movie name ',
  movieinfo  VARCHAR (1024) NOT null
  COMMENT ' movie Details ',
  star       VARCHAR ()                        default null
  COMMENT ' watercress score ',
  quote      VARCHAR (1024)                      default null
  COMMENT ' Classic lines ',
  createtime DATETIME                           DEFAULT current_timestamp
  COMMENT ' Add Time '
)
  ENGINE = InnoDB
  DEFAULT CHARSET = UTF8;

The database can then be connected and stored. This process can be completed in pipeline.py with the following code:

Import Pymysql


class Moviepipeline (object):
    def __init__ (self):
        self.conn = pymysql.connect (host= ' 127.0.0.1 ', port=3306, user=***, passwd=***, db=***,
                                    charset= ' UTF8 ')
        self.cursor = Self.conn.cursor ()
        Self.cursor.execute ("TRUNCATE TABLE Movie")
        Self.conn.commit ()

    def process_item (self, item, spider):
        Try:
            self.cursor.execute (INSERT into Movie (name,movieinfo,star,quote) VALUES (%s,%s,%s,%s), (
                item[' Title '], item[' Movieinfo '], item[' star ', item[' quote ']) self.conn.commit
            () except Pymysql
        . Error:
            print ("error%s,%s,%s,%s"% (item[' title '), item[' Movieinfo '], item[' star '), item[' quote ']) return
        Item

Here, the crawler program is finished, look at the results,

There are 250 of data, and the number of web movies consistent, and then look at the content,

The above is to use the Scrapy framework to crawl all processes, the detailed code can be viewed in my GitHub, the author level is limited, if there are deficiencies, please do not hesitate to enlighten.

Note: When you connect to a database using Mypysql, the table field properties of the database are defined as String types, Pymysql cannot be inserted into the MySQL table with the shape and floating-point type, which has a hole in it for a long time.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.