Python crawler-crawls movie information of a website and writes it to the mysql database, pythonmysql

Source: Internet
Author: User

Python crawler-crawls movie information of a website and writes it to the mysql database, pythonmysql

This document writes the crawled movie information to the database for ease of viewing.

First, let's go to the Code:

#-*-Coding: UTF-8-*-import requestsimport reimport mysql. connector # changepage is used to generate links of different pages def changepage (url, total_page): page_group = ['https: // records for I in range (2, total_page + 1 ): link = re. sub ('jddy/Index', 'jddy/index _ '+ str (I), url, re. s) page_group.append (link) return page_group # pagelink is used to generate the video link page in the page. def pagelink (url): base_url = 'https: // www.dygod.net/html/gn Dy/jddy/'headers = {'user-agent': 'mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) chrome/45.0.2454.101 Safari/537.36 '} req = requests. get (url, headers = headers) req. encoding = 'gbk' # specify the encoding; otherwise, garbled pat = re. compile ('<a href = "/html/gndy/jddy /(. *?) "Class =" ulink "title = (.*?) /A> ', re. s) # obtain the movie list URL reslist = re. findall (pat, req. text) finalurl = [] for I in range (25-25): xurl = reslist [I] [0] finalurl. append (base_url + xurl) return finalurl # return all video webpage addresses on the page # getdownurl obtains the video address and information of the page def getdownurl (url ): headers = {'user-agent': 'mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) chrome/45.0.2454.101 Safari/537.36 '} req = requests. get (url, headers = headers) Req. encoding = 'gbk' # specify the encoding. Otherwise, garbled pat = re. compile ('<a href = "ftp (.*?) "> Ftp ', re. s) # obtain reslist = re. findall (pat, req. text) furl = 'ftp '+ reslist [0] pat2 = re. compile ('<! -- Content Start --> (.*?) <! -- DuguPlayList Start --> ', re. s) # obtain the video information reslist2 = re. findall (pat2, req. text) reslist3 = re. sub ('[<p> </p>]', '', reslist2 [0]) fdetail = reslist3.split ('◎') return (furl, fdetail) # create a moviesdef createtable (con, cs): # create a movies table and determine its table structure: cs.exe cute ('create table if not exists movies (film_addr varchar (1000 ), cover_pic varchar (1000), name varchar (100) primary key, \ ori_name varchar (100), prod_year varchar (100), Prod_country varchar (100), category varchar (100), language varchar (100), \ subtitle varchar (100), release_date varchar (100), score varchar (100 ), file_format varchar (100), video_size varchar (100), \ file_size varchar (100), film_length varchar (100), director varchar (100), actors varchar (500 ), profile varchar (2000), capt_pic varchar (1000) ') # submit the transaction: con. commit () # insert movie addresses and descriptions into the table def inserttable (con, cs, x, Y): try: cs.exe cute ('insert into movies values (% s, % s, % s, % s) ', \ (x, y [0], y [1], y [2], y [3], y [4], y [5], y [6], y [7], y [8], y [9], y [10], y [11], y [12], y [13], y [14], y [15], y [16], y [17]) returns T: pass finally: con. commit () if _ name _ = "_ main _": html = "https://www.dygod.net/html/gndy/jddy/index.html" print ('the site you're about to crawl is: https://www.dygod.net/html/gndy/ Jddy/index.html ') pages = input ('Enter the number of pages to crawl:') createtable p1 = changepage (html, int (pages) # Open Database conn = mysql. connector. connect (user = 'py', password = 'unix _ 1234 ', database = 'py _ test') cursor = conn. cursor () createtable (conn, cursor) # insert data j = 0 for p1i in p1: j = j + 1 print ('crawling Page % d, the URL is % s... '% (j, p1i) p2 = pagelink (p1i) for p2i in p2: p3, p4 = getdownurl (p2i) if len (p3) = 0: pass else: inser Ttable (conn, cursor, p3, p4) # close database cursor. close () conn. close () print ('crawling all page addresses is complete! ')

The most important thing to use is database operations. The following briefly introduces how to connect python to the database.

1. mysql driver is required in python. Commonly used modules include the official mysql-connect-python, mysqldb (Python 2.x), and pymysql (Python 3.x). These modules are both drivers and tools, it can be used to directly operate mysql databases, that is, they are operated by writing SQL statements in Python, such as creating a user table:

Cursor.exe cute ('create table user (id int, name varchar (20 ))')

# The create table statement is a typical SQL statement.

2. In many cases, we use the object relational mapping framework (ORM) to map the programming language object model to the relational database model (RDBMS relational Database, in this way, you can directly use the object model of the programming language to operate the database, instead of using the SQL language. Create a user table as follows:

User = Table ('user', metadata,
Column ('id', Integer ),
Column ('name', String (20 ))
)
Metadata. create_all ()
# Here we can see that there is no SQL statement at all, so that we can focus on Python code rather than SQL code. (Note that the ORM does not contain a driver. To use it, install the driver mentioned above)

If you are interested, you can learn by yourself. This is not the focus of this article. For simplicity, mysql-connect-python is used in this article.

Regular Expression matching is also very simple, because the Source Page Comparison rules are as follows:

 

 

Match with ◎ directly.

After the program runs, data is written to the movies table.

For example, if I want to filter over 7 Douban scores,

Is it easy? Have you got it?

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.