Requests+ is crawling the cat's eye movie and storing the data to the MySQL database

Source: Internet
Author: User

In
front of how to operate the database through the Pymysql, this time write a crawler to extract information, and data stored in the MySQL database
1. Crawl target crawl cat's Eye movie TOP100 list the information to be extracted includes: movie rank, movie name, release time, score

2. Analysis of HTML source page

You can see that each movie message is wrapped in a pair of <dd>...</dd> tags, so we just need to extract the above information from a label pair. Extract 3 using regular expressions. The complete process This example has 2 key points: regular writing and data processing (written into MySQL database) (1) The writing of regular expressions
    pattern = re.compile(r‘<dd>.*?<i.*?>(\d+)</i>.*?‘  # 匹配电影排名                         r‘<p class="name"><a.*?data-val=".*?">(.*?)‘  # 匹配电影名称                         r‘</a>.*?<p.*?class="releasetime">(.*?)</p>‘  # 匹配上映时间                         r‘.*?<i.*?"integer">(.*?)</i>‘   # 匹配分数的整数位                         r‘.*?<i.*?"fraction">(.*?)</i>.*?</dd>‘, re.S)  # 匹配分数小数位,re.S表示跨行匹配    m = pattern.findall(html)    # print(m)
Use the FindAll () method to match all qualifying characters, return a list, and follow the results of one of the pages

(2) Complete code, note how the Get_data () function handles the data, and then the Write_sql () function is written to the database
# coding:utf-8# Author:hmkimport requestsimport reimport pymysqldef get_html (URL, header): response = Requests.get (ur L, headers=header) if Response.status_code = = 200:return Response.text else:return nonedef get_data (  HTML, list_data): pattern = Re.compile (R ' <dd>.*?<i.*?> (\d+) </i>.*? ' # match movie rank R ' <p class= "name" ><a.*?data-val= ". *?"  > (. *?) ' # match movie name R ' </a>.*?<p.*?class= "Releasetime" > (. *?) </p> ' # matches release Time R '. *?<i.*? " Integer "> (. *?) </i> ' # matches the integer digits of the score R '. *?<i.*? " Fraction "> (. *?) </i>.*?</dd> ', Re. S) # Match fractional decimal bit m = Pattern.findall (HTML) for i in M: # because all the results that match are returned as a list, each movie message is saved as a tuple, so you can iterate over each set of movie information Rankin g = i[0] # Extract a set of movie information in the rank movie = i[1] # Extract the name of a set of movie information release_time = i[2] # Extract the release time from a set of movie messages score = i [3] + i[4] # Extract the scores from a set of movie messages, and place the integral and fractional parts of the score together List_dAta.append ([ranking, movie, Release_time, score]) # Each collection of movie information is placed in a list and appended to a large list, so that the last large list contains all the movie Information Def write_sql ( DATA): conn = pymysql.connect (host= ' localhost ', user= ' root ', passwo Rd= ' 123456 ', db= ' test ', charset= ' utf8 ') cur = conn.cursor () for I in data: "" "The data parameter here is the matching and processing of the list (is a large list, including all the movie information, each movie information has its own list; iterate over a large list to extract each set of movie information so that each group of movie information extracted is a A small list, and then you can write each set of movie information to the database. "" "movie = i # each group of movie information, which can be seen as a ready to insert a database of each set of movie data sql =" INSERT INTO Maoyan_movie (ranking , Movie,release_time,score) values (%s,%s,%s,%s) "# SQL INSERT statement try:cur.execute (SQL, Movie) # Execute SQL statement, MO Vie means the data to be inserted into the database Conn.commit () # After the insert is complete, do not forget to submit the operation print (' Import succeeded ') Except:print (' Import failed ') ) Cur.close () # Close Cursor conn.close () # Close Connection def main (): Start_url = ' http://maoyan.com/board/4 ' depth = 10 # crawl Depth (paging) header = {"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", "accept-encoding": "Gzip, DEFLA Te, Sdch "," Accept-language ":" zh-cn,zh;q=0.8 "," Cache-control ":" Max-age=0 "," Conn              Ection ":" Keep-alive "," Host ":" Maoyan.com "," Referer ":" Http://maoyan.com/board ", "Upgrade-insecure-requests": "1", "user-agent": "mozilla/5.0 (Windows NT 6.1; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/49.0.2623.75 safari/537.36 "} For I in range (depth): url = sta Rt_url + '? offset= ' + str (TEN * i) HTML = get_html (URL, header) List_data = [] get_data (HTML, List_dat A) Write_sql (list_data) #print (list_data) # for I in List_data: # t = i # Print (t) if __name__ = = "__main__": Main ()

Note that in the request URL, add the headers, here must add, it is estimated that the site is limited, the direct crawl will fail, may recognize the request link is not a person but a worm ~ ~ ~ The code in the comments written in detail, no longer described too much

Requests+ is crawling the cat's eye movie and storing the data to the MySQL database

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.