"Go" crawl the watercress film top250 extract film Classification for data analysis

Source: Internet
Author: User

First, crawl the Web page, get the required content

What we're going to crawl today is the Watercress movie top250
The page looks like this:

What we need is the movie classification inside, and we can analyze what we need by looking at the source code. Go straight to the theme!

Now that we know what we need, let's use our powerful Python request library to get the content of the Web page! After getting the content, we use a good lxml library to analyze the content of the Web page and then get our content to do the next steps.
First post code that uses the request library and lxml analysis

1 defGet_page (i):2URL ='https://movie.douban.com/top250?start={}&filter='. Format (i)3                 4html = requests.get (URL). Content.decode ('Utf-8')#get page content using the request library5         6selector = etree. HTML (HTML)#extracting content using the lxml library7             " "8 See the page to find content in <div class= "info" > part9             " "TenContent = Selector.xpath ('//div[@class = "Info"]/div[@class = "BD"]/p/text ()') One             Print(content) A          -              forIinchContent[1::2]: -                 Print(Str (i). Strip (). Replace ('\n\r',"')) the                 #Print (str (i). Split ('/')) -i = str (i). Split ('/')   -i = I[len (i)-1] -Key = I.strip (). Replace ('\ n',"'). Split (' ')#The purpose of strip and replace here is to remove spaces and empty lines . +                 Print(key)

We found that the contents of a movie are separated by '/', we just need to extract the contents of the movie category, so we need to use

i = str (i). Split ('/')

To separate the content into several items, because the film classification is at the end, so we

i = I[len (i)-1]

To get the last item after the separation is the sort of movie we need, and the last step we need to make, because a movie usually has more than one label for the movie category, so we're going to continue to separate the captured movie categories, and we can see that the movie classification is separated by a space, So we can use the following line of code to isolate the individual categories:

Key = I.strip (). Replace ('\ n'). Split (')

Second, the next is to save to the MySQL database

Save the movie classification in the MySQL database for data analysis, here we use to Pymysql to connect MySQL database, first we need to build a table in MySQL database:

We then save the data to the database via Pymysql, with the following code:
To connect to the database first:

# 连接mysql数据库conn = pymysql.connect(host = ‘localhost‘, user = ‘root‘, passwd = ‘2014081029‘, db = ‘mysql‘, charset = ‘utf8‘)  # user为数据库的名字,passwd为数据库的密码,一般把要把字符集定义为utf8,不然存入数据库容易遇到编码问题cur = conn.cursor()  # 获取操作游标cur.execute(‘use douban‘)  # 使用douban这个数据库

Before we save it to the database, we have one more thing to do, which is to classify the number of 250 movies, so we define a dictionary to count the number of movie categories, the code here is part of the Get_page function, and the code is as follows:

 forIinchContent[1::2]:        Print(Str (i). Strip (). Replace ('\n\r',"'))        #Print (str (i). Split ('/'))i = str (i). Split ('/') I= I[len (i)-1] Key= I.strip (). Replace ('\ n',"'). Split (' ')        Print(Key) forIinchKey:ifI not inchDouban.keys (): Douban[i]= 1Else: Douban[i]+ = 1

Then define a save function, perform the insert operation, perform a rollback if an insert failure occurs, and remember to close the database connection using Conn.close () and Cur.close () after the operation is complete, with the following code:

    defSave_mysql (Douban):Print(Douban)#Douban dictionary defined in the main function         forKeyinchDouban:Print(Key)Print(Douban[key])ifKey! ="':                Try: SQL='Insert Douban (category, quantity) value ('+"\ '"+ key +"\ ',"+"\ '"+ str (Douban[key]) +"\ '"+');'cur.execute (SQL) Conn.commit ()except:                    Print('Insert Failed') Conn.rollback ()

Third, the use of matplotlib for data visualization operations

First, the movie classification and the number of each classification in the database are stored in a list, and then use matplotlib to visualize the operation, as follows:

defpylot_show (): SQL='select * from Douban;'cur.execute (SQL) rows= Cur.fetchall ()#read all the fields in the tableCount = []#the number of each categoryCategory = []#category             forRowinchrows:count.append (int (row[2])) Category.append (row[1]) Y_pos= Np.arange (len category)#defines the number of y-axis coordinatesPlt.barh (Y_pos, Count, align='Center', alpha=0.4)#the Fill opacity (0~1) between the alpha chartPlt.yticks (Y_pos, category)#mark the name of the class on the y-axis             forCount, Y_posinchZip (count, y_pos):#The number of categories shown in the graph is the numbers that are displayed at the rear of the histogramPlt.text (count, Y_pos, Count, horizontalalignment='Center', verticalalignment='Center', weight='Bold') Plt.ylim (+28.0,-1.0)#visualization range, equivalent to the specified y-axis rangePlt.title (U'Watercress Movie')#the title of the chartPlt.ylabel (U'Film Classification')#marker for the y-axis of the chartPlt.subplots_adjust (bottom = 0.15) Plt.xlabel (U'number of classification occurrences')#marker for chart x axisPlt.savefig ('Douban.png')#Save Picture

Here are some simple uses of matplotlib, first we want to import matplotlib and NumPy packages

Import NumPy as NP import Matplotlib.pyplot as Plt

This visualization is a histogram, where the definition of the Brah () function is given:

Barh ()
Main function: Make a horizontal bar chart, the rectangle size of the horizontal bar is: Left, left + width, bottom, bottom + height
Parameters: Barh (bottom, width, height =0.8, left =0, **kwargs)
return type: A class class, matplotlib.patches.rectangle** instance
Parameter description:

    • Bottom:bars the bottom edge of the vertical position

    • Length of Width:bars
      Optional Parameters:

    • The height of the Height:bars

    • Left:bars x-axis coordinate value of left margin

    • Color:bars Color

    • Edgecolor:bars Edge Color

    • Linewidth:bar Edge width; None indicates the default width; 0 Indicates not I draw edge

    • Xerr: If not none, the Errobars will be generated on the bar chart

    • Yerr: If not none, the Errobars will be generated on the bar chart

    • EColor: Specify ErrorBar Color

    • Capsize: Specifies the top (cap) length of the ErrorBar

    • Align: ' edge ' (default) | ' Center ': ' Edge ' is aligned at the bottom; Center ' with Y axis as centre

    • LOG: [false| True] False (default), if true, log coordinates are used

And then we can show the picture.

The original address: Crawl The watercress film top250 extract film Classification for data analysis

"Go" crawl the watercress film top250 extract film Classification for data analysis

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.