Python crawler-crawls movie information of a website and writes it to the mysql database, pythonmysql
This document writes the crawled movie information to the database for ease of viewing.
First, let's go to the Code:
#-*-Coding: UTF-8-*-import requestsimport reimport mysql. connector # changepage is used to generate links of different pages def changepage (url, total_page): page_group = ['https: // records for I in range (2, total_page + 1 ): link = re. sub ('jddy/Index', 'jddy/index _ '+ str (I), url, re. s) page_group.append (link) return page_group # pagelink is used to generate the video link page in the page. def pagelink (url): base_url = 'https: // www.dygod.net/html/gn Dy/jddy/'headers = {'user-agent': 'mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) chrome/45.0.2454.101 Safari/537.36 '} req = requests. get (url, headers = headers) req. encoding = 'gbk' # specify the encoding; otherwise, garbled pat = re. compile ('<a href = "/html/gndy/jddy /(. *?) "Class =" ulink "title = (.*?) /A> ', re. s) # obtain the movie list URL reslist = re. findall (pat, req. text) finalurl = [] for I in range (25-25): xurl = reslist [I] [0] finalurl. append (base_url + xurl) return finalurl # return all video webpage addresses on the page # getdownurl obtains the video address and information of the page def getdownurl (url ): headers = {'user-agent': 'mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) chrome/45.0.2454.101 Safari/537.36 '} req = requests. get (url, headers = headers) Req. encoding = 'gbk' # specify the encoding. Otherwise, garbled pat = re. compile ('<a href = "ftp (.*?) "> Ftp ', re. s) # obtain reslist = re. findall (pat, req. text) furl = 'ftp '+ reslist [0] pat2 = re. compile ('<! -- Content Start --> (.*?) <! -- DuguPlayList Start --> ', re. s) # obtain the video information reslist2 = re. findall (pat2, req. text) reslist3 = re. sub ('[<p> </p>]', '', reslist2 [0]) fdetail = reslist3.split ('◎') return (furl, fdetail) # create a moviesdef createtable (con, cs): # create a movies table and determine its table structure: cs.exe cute ('create table if not exists movies (film_addr varchar (1000 ), cover_pic varchar (1000), name varchar (100) primary key, \ ori_name varchar (100), prod_year varchar (100), Prod_country varchar (100), category varchar (100), language varchar (100), \ subtitle varchar (100), release_date varchar (100), score varchar (100 ), file_format varchar (100), video_size varchar (100), \ file_size varchar (100), film_length varchar (100), director varchar (100), actors varchar (500 ), profile varchar (2000), capt_pic varchar (1000) ') # submit the transaction: con. commit () # insert movie addresses and descriptions into the table def inserttable (con, cs, x, Y): try: cs.exe cute ('insert into movies values (% s, % s, % s, % s) ', \ (x, y [0], y [1], y [2], y [3], y [4], y [5], y [6], y [7], y [8], y [9], y [10], y [11], y [12], y [13], y [14], y [15], y [16], y [17]) returns T: pass finally: con. commit () if _ name _ = "_ main _": html = "https://www.dygod.net/html/gndy/jddy/index.html" print ('the site you're about to crawl is: https://www.dygod.net/html/gndy/ Jddy/index.html ') pages = input ('Enter the number of pages to crawl:') createtable p1 = changepage (html, int (pages) # Open Database conn = mysql. connector. connect (user = 'py', password = 'unix _ 1234 ', database = 'py _ test') cursor = conn. cursor () createtable (conn, cursor) # insert data j = 0 for p1i in p1: j = j + 1 print ('crawling Page % d, the URL is % s... '% (j, p1i) p2 = pagelink (p1i) for p2i in p2: p3, p4 = getdownurl (p2i) if len (p3) = 0: pass else: inser Ttable (conn, cursor, p3, p4) # close database cursor. close () conn. close () print ('crawling all page addresses is complete! ')
The most important thing to use is database operations. The following briefly introduces how to connect python to the database.
1. mysql driver is required in python. Commonly used modules include the official mysql-connect-python, mysqldb (Python 2.x), and pymysql (Python 3.x). These modules are both drivers and tools, it can be used to directly operate mysql databases, that is, they are operated by writing SQL statements in Python, such as creating a user table:
Cursor.exe cute ('create table user (id int, name varchar (20 ))')
# The create table statement is a typical SQL statement.
2. In many cases, we use the object relational mapping framework (ORM) to map the programming language object model to the relational database model (RDBMS relational Database, in this way, you can directly use the object model of the programming language to operate the database, instead of using the SQL language. Create a user table as follows:
User = Table ('user', metadata,
Column ('id', Integer ),
Column ('name', String (20 ))
)
Metadata. create_all ()
# Here we can see that there is no SQL statement at all, so that we can focus on Python code rather than SQL code. (Note that the ORM does not contain a driver. To use it, install the driver mentioned above)
If you are interested, you can learn by yourself. This is not the focus of this article. For simplicity, mysql-connect-python is used in this article.
Regular Expression matching is also very simple, because the Source Page Comparison rules are as follows:
Match with ◎ directly.
After the program runs, data is written to the movies table.
For example, if I want to filter over 7 Douban scores,
Is it easy? Have you got it?