Recently, want to study well under the Scrapy crawler framework, a very powerful Python crawler framework, after watching the course of the Geek College, the realization of their own Scrapy crawler film top250, coexist in the MySQL database. The implementation process is now introduced.
First, look at the structure of the web.
The corresponding HTML code is:
As shown above, the main is to crawl out the name of the film, film Brief introduction, watercress grading, film generalization.
Environment Installation:
PIP3 install-u scrapy
pip3 install-u pymysql #连接数据库
Create a new Scrapy project using the scrapy startproject Movie command. The new project structure follows the following figure:
Among them, several Python files have the following features,
1. items.py defines data that needs to be crawled and later processed;
2. settings.py file configuration Scrapy to modify User-agent, set crawl time interval, set agent, configure various middleware and so on;
3. pipeline.py is used to store the later data processing function, so that the data crawl and processing separate, you can store data in this file to the MySQL database;
4. moviespider.py custom crawler, mainly crawling the name of the film, film Brief introduction, watercress grading, film generalization.
crawled data structure definition (items.py)
From Scrapy Import Item, Field
class Movieitem (item):
title = field ()
movieinfo = field ()
star = field () C6/>quote = Field () Pass
crawler (moviespider.py)
From scrapy.spiders import Spider to scrapy.http import Request from scrapy.selector import selector from Movie.items I Mport Movieitem class Moviespider (Spider): name = ' movie ' url = ' https://movie.douban.com/top250 ' Start_urls = [' https://movie.douban.com/top250 '] def parse (self, response): item = Movieitem () selector = Selec Tor (response) Movies = Selector.xpath ('//div[@class = ' info '] ') for movie in movies:title = mov
Ie.xpath (' div[@class = "HD"]/a/span/text ()). Extract () Fulltitle = ' For each in title: Fulltitle + = Each Movieinfo = Movie.xpath (' div[@class = "BD"]/p/text ()). Extract () star = movie . XPath (' div[@class = "BD"]/div[@class = "star"]/span[@class = "Rating_num"]/text ()). Extract () [0] quote = Movie.xpat
H (' div[@class = "BD"]/p/span/text ()). Extract () If quote:quote = quote[0] Else:
Quote = ' ' item[' title '] = Fulltitle item[' movieinfo ' = '; '.
Join (Movieinfo). Replace (', '). replace (' \ n ', ') item[' star ' = star[0] item[' quote '] = Quote
Yield Item nextPage = Selector.xpath ('//span[@class = Next ']/link/@href '). Extract () If nextPage: NextPage = nextpage[0] Print (Self.url + str (nextPage)) yield Request (self.url + str (NE Xtpage), Callback=self.parse)
data stored in MySQL database
First, to create a Movie table in the local database, the table statement is as follows:
CREATE TABLE Movie (
ID INT not NULL PRIMARY KEY auto_increment
COMMENT ' Auto id ',
name VARCHAR ( 1024) NOT null
COMMENT ' movie name ',
movieinfo VARCHAR (1024) NOT null
COMMENT ' movie Details ',
star VARCHAR () default null
COMMENT ' watercress score ',
quote VARCHAR (1024) default null
COMMENT ' Classic lines ',
createtime DATETIME DEFAULT current_timestamp
COMMENT ' Add Time '
)
ENGINE = InnoDB
DEFAULT CHARSET = UTF8;
The database can then be connected and stored. This process can be completed in pipeline.py with the following code:
Import Pymysql
class Moviepipeline (object):
def __init__ (self):
self.conn = pymysql.connect (host= ' 127.0.0.1 ', port=3306, user=***, passwd=***, db=***,
charset= ' UTF8 ')
self.cursor = Self.conn.cursor ()
Self.cursor.execute ("TRUNCATE TABLE Movie")
Self.conn.commit ()
def process_item (self, item, spider):
Try:
self.cursor.execute (INSERT into Movie (name,movieinfo,star,quote) VALUES (%s,%s,%s,%s), (
item[' Title '], item[' Movieinfo '], item[' star ', item[' quote ']) self.conn.commit
() except Pymysql
. Error:
print ("error%s,%s,%s,%s"% (item[' title '), item[' Movieinfo '], item[' star '), item[' quote ']) return
Item
Here, the crawler program is finished, look at the results,
There are 250 of data, and the number of web movies consistent, and then look at the content,
The above is to use the Scrapy framework to crawl all processes, the detailed code can be viewed in my GitHub, the author level is limited, if there are deficiencies, please do not hesitate to enlighten.
Note: When you connect to a database using Mypysql, the table field properties of the database are defined as String types, Pymysql cannot be inserted into the MySQL table with the shape and floating-point type, which has a hole in it for a long time.