"Go" crawl the watercress film top250 extract film Classification for data analysis

Last Update:2016-07-24 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, crawl the Web page, get the required content

What we're going to crawl today is the Watercress movie top250
The page looks like this:

What we need is the movie classification inside, and we can analyze what we need by looking at the source code. Go straight to the theme!

Now that we know what we need, let's use our powerful Python request library to get the content of the Web page! After getting the content, we use a good lxml library to analyze the content of the Web page and then get our content to do the next steps.
First post code that uses the request library and lxml analysis

1 defGet_page (i):2URL ='https://movie.douban.com/top250?start={}&filter='. Format (i)3                 4html = requests.get (URL). Content.decode ('Utf-8')#get page content using the request library5         6selector = etree. HTML (HTML)#extracting content using the lxml library7             " "8 See the page to find content in <div class= "info" > part9             " "TenContent = Selector.xpath ('//div[@class = "Info"]/div[@class = "BD"]/p/text ()') One             Print(content) A          -              forIinchContent[1::2]: -                 Print(Str (i). Strip (). Replace ('\n\r',"')) the                 #Print (str (i). Split ('/')) -i = str (i). Split ('/')   -i = I[len (i)-1] -Key = I.strip (). Replace ('\ n',"'). Split (' ')#The purpose of strip and replace here is to remove spaces and empty lines . +                 Print(key)

We found that the contents of a movie are separated by '/', we just need to extract the contents of the movie category, so we need to use

i = str (i). Split ('/')

To separate the content into several items, because the film classification is at the end, so we

i = I[len (i)-1]

To get the last item after the separation is the sort of movie we need, and the last step we need to make, because a movie usually has more than one label for the movie category, so we're going to continue to separate the captured movie categories, and we can see that the movie classification is separated by a space, So we can use the following line of code to isolate the individual categories:

Key = I.strip (). Replace ('\ n'). Split (')

Second, the next is to save to the MySQL database

Save the movie classification in the MySQL database for data analysis, here we use to Pymysql to connect MySQL database, first we need to build a table in MySQL database:

We then save the data to the database via Pymysql, with the following code:
To connect to the database first:

# 连接mysql数据库conn = pymysql.connect(host = ‘localhost‘, user = ‘root‘, passwd = ‘2014081029‘, db = ‘mysql‘, charset = ‘utf8‘)  # user为数据库的名字，passwd为数据库的密码，一般把要把字符集定义为utf8，不然存入数据库容易遇到编码问题cur = conn.cursor()  # 获取操作游标cur.execute(‘use douban‘)  # 使用douban这个数据库

Before we save it to the database, we have one more thing to do, which is to classify the number of 250 movies, so we define a dictionary to count the number of movie categories, the code here is part of the Get_page function, and the code is as follows:

 forIinchContent[1::2]:        Print(Str (i). Strip (). Replace ('\n\r',"'))        #Print (str (i). Split ('/'))i = str (i). Split ('/') I= I[len (i)-1] Key= I.strip (). Replace ('\ n',"'). Split (' ')        Print(Key) forIinchKey:ifI not inchDouban.keys (): Douban[i]= 1Else: Douban[i]+ = 1

Then define a save function, perform the insert operation, perform a rollback if an insert failure occurs, and remember to close the database connection using Conn.close () and Cur.close () after the operation is complete, with the following code:

    defSave_mysql (Douban):Print(Douban)#Douban dictionary defined in the main function         forKeyinchDouban:Print(Key)Print(Douban[key])ifKey! ="':                Try: SQL='Insert Douban (category, quantity) value ('+"\ '"+ key +"\ ',"+"\ '"+ str (Douban[key]) +"\ '"+');'cur.execute (SQL) Conn.commit ()except:                    Print('Insert Failed') Conn.rollback ()

Third, the use of matplotlib for data visualization operations

First, the movie classification and the number of each classification in the database are stored in a list, and then use matplotlib to visualize the operation, as follows:

defpylot_show (): SQL='select * from Douban;'cur.execute (SQL) rows= Cur.fetchall ()#read all the fields in the tableCount = []#the number of each categoryCategory = []#category             forRowinchrows:count.append (int (row[2])) Category.append (row[1]) Y_pos= Np.arange (len category)#defines the number of y-axis coordinatesPlt.barh (Y_pos, Count, align='Center', alpha=0.4)#the Fill opacity (0~1) between the alpha chartPlt.yticks (Y_pos, category)#mark the name of the class on the y-axis             forCount, Y_posinchZip (count, y_pos):#The number of categories shown in the graph is the numbers that are displayed at the rear of the histogramPlt.text (count, Y_pos, Count, horizontalalignment='Center', verticalalignment='Center', weight='Bold') Plt.ylim (+28.0,-1.0)#visualization range, equivalent to the specified y-axis rangePlt.title (U'Watercress Movie')#the title of the chartPlt.ylabel (U'Film Classification')#marker for the y-axis of the chartPlt.subplots_adjust (bottom = 0.15) Plt.xlabel (U'number of classification occurrences')#marker for chart x axisPlt.savefig ('Douban.png')#Save Picture

Here are some simple uses of matplotlib, first we want to import matplotlib and NumPy packages

Import NumPy as NP import Matplotlib.pyplot as Plt

This visualization is a histogram, where the definition of the Brah () function is given:

Barh ()
Main function: Make a horizontal bar chart, the rectangle size of the horizontal bar is: Left, left + width, bottom, bottom + height
Parameters: Barh (bottom, width, height =0.8, left =0, **kwargs)
return type: A class class, matplotlib.patches.rectangle** instance
Parameter description:

Bottom:bars the bottom edge of the vertical position
Length of Width:bars
Optional Parameters:
The height of the Height:bars
Left:bars x-axis coordinate value of left margin
Color:bars Color
Edgecolor:bars Edge Color
Linewidth:bar Edge width; None indicates the default width; 0 Indicates not I draw edge
Xerr: If not none, the Errobars will be generated on the bar chart
Yerr: If not none, the Errobars will be generated on the bar chart
EColor: Specify ErrorBar Color
Capsize: Specifies the top (cap) length of the ErrorBar
Align: ' edge ' (default) | ' Center ': ' Edge ' is aligned at the bottom; Center ' with Y axis as centre
LOG: [false| True] False (default), if true, log coordinates are used

And then we can show the picture.

The original address: Crawl The watercress film top250 extract film Classification for data analysis

"Go" crawl the watercress film top250 extract film Classification for data analysis

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

"Go" crawl the watercress film top250 extract film Classification for data analysis

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

"Go" crawl the watercress film top250 extract film Classification for data analysis

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support