Python crawler-crawls movie information of a website and writes it to the mysql database, pythonmysql

Last Update:2018-03-08 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This document writes the crawled movie information to the database for ease of viewing.

First, let's go to the Code:

#-*-Coding: UTF-8-*-import requestsimport reimport mysql. connector # changepage is used to generate links of different pages def changepage (url, total_page): page_group = ['https: // records for I in range (2, total_page + 1 ): link = re. sub ('jddy/Index', 'jddy/index _ '+ str (I), url, re. s) page_group.append (link) return page_group # pagelink is used to generate the video link page in the page. def pagelink (url): base_url = 'https: // www.dygod.net/html/gn Dy/jddy/'headers = {'user-agent': 'mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) chrome/45.0.2454.101 Safari/537.36 '} req = requests. get (url, headers = headers) req. encoding = 'gbk' # specify the encoding; otherwise, garbled pat = re. compile ('<a href = "/html/gndy/jddy /(. *?) "Class =" ulink "title = (.*?) /A> ', re. s) # obtain the movie list URL reslist = re. findall (pat, req. text) finalurl = [] for I in range (25-25): xurl = reslist [I] [0] finalurl. append (base_url + xurl) return finalurl # return all video webpage addresses on the page # getdownurl obtains the video address and information of the page def getdownurl (url ): headers = {'user-agent': 'mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) chrome/45.0.2454.101 Safari/537.36 '} req = requests. get (url, headers = headers) Req. encoding = 'gbk' # specify the encoding. Otherwise, garbled pat = re. compile ('<a href = "ftp (.*?) "> Ftp ', re. s) # obtain reslist = re. findall (pat, req. text) furl = 'ftp '+ reslist [0] pat2 = re. compile ('<! -- Content Start --> (.*?) <! -- DuguPlayList Start --> ', re. s) # obtain the video information reslist2 = re. findall (pat2, req. text) reslist3 = re. sub ('[<p> </p>]', '', reslist2 [0]) fdetail = reslist3.split ('◎') return (furl, fdetail) # create a moviesdef createtable (con, cs): # create a movies table and determine its table structure: cs.exe cute ('create table if not exists movies (film_addr varchar (1000 ), cover_pic varchar (1000), name varchar (100) primary key, \ ori_name varchar (100), prod_year varchar (100), Prod_country varchar (100), category varchar (100), language varchar (100), \ subtitle varchar (100), release_date varchar (100), score varchar (100 ), file_format varchar (100), video_size varchar (100), \ file_size varchar (100), film_length varchar (100), director varchar (100), actors varchar (500 ), profile varchar (2000), capt_pic varchar (1000) ') # submit the transaction: con. commit () # insert movie addresses and descriptions into the table def inserttable (con, cs, x, Y): try: cs.exe cute ('insert into movies values (% s, % s, % s, % s) ', \ (x, y [0], y [1], y [2], y [3], y [4], y [5], y [6], y [7], y [8], y [9], y [10], y [11], y [12], y [13], y [14], y [15], y [16], y [17]) returns T: pass finally: con. commit () if _ name _ = "_ main _": html = "https://www.dygod.net/html/gndy/jddy/index.html" print ('the site you're about to crawl is: https://www.dygod.net/html/gndy/ Jddy/index.html ') pages = input ('Enter the number of pages to crawl:') createtable p1 = changepage (html, int (pages) # Open Database conn = mysql. connector. connect (user = 'py', password = 'unix _ 1234 ', database = 'py _ test') cursor = conn. cursor () createtable (conn, cursor) # insert data j = 0 for p1i in p1: j = j + 1 print ('crawling Page % d, the URL is % s... '% (j, p1i) p2 = pagelink (p1i) for p2i in p2: p3, p4 = getdownurl (p2i) if len (p3) = 0: pass else: inser Ttable (conn, cursor, p3, p4) # close database cursor. close () conn. close () print ('crawling all page addresses is complete! ')

The most important thing to use is database operations. The following briefly introduces how to connect python to the database.

1. mysql driver is required in python. Commonly used modules include the official mysql-connect-python, mysqldb (Python 2.x), and pymysql (Python 3.x). These modules are both drivers and tools, it can be used to directly operate mysql databases, that is, they are operated by writing SQL statements in Python, such as creating a user table:

Cursor.exe cute ('create table user (id int, name varchar (20 ))')

# The create table statement is a typical SQL statement.

2. In many cases, we use the object relational mapping framework (ORM) to map the programming language object model to the relational database model (RDBMS relational Database, in this way, you can directly use the object model of the programming language to operate the database, instead of using the SQL language. Create a user table as follows:

User = Table ('user', metadata,
Column ('id', Integer ),
Column ('name', String (20 ))
)
Metadata. create_all ()
# Here we can see that there is no SQL statement at all, so that we can focus on Python code rather than SQL code. (Note that the ORM does not contain a driver. To use it, install the driver mentioned above)

If you are interested, you can learn by yourself. This is not the focus of this article. For simplicity, mysql-connect-python is used in this article.

Regular Expression matching is also very simple, because the Source Page Comparison rules are as follows:

Match with ◎ directly.

After the program runs, data is written to the movies table.

For example, if I want to filter over 7 Douban scores,

Is it easy? Have you got it?

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python crawler-crawls movie information of a website and writes it to the mysql database, pythonmysql

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python crawler-crawls movie information of a website and writes it to the mysql database, pythonmysql

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support