First, crawl the Web page, get the required content
What we're going to crawl today is the Watercress movie top250
The page looks like this:
What we need is the movie classification inside, and we can analyze what we need by looking at the source code. Go straight to the theme!
Now that we know what we need, let's use our powerful Python request library to get the content of the Web page! After getting the content, we use a good lxml library to analyze the content of the Web page and then get our content to do the next steps.
First post code that uses the request library and lxml analysis
1 defGet_page (i):2URL ='https://movie.douban.com/top250?start={}&filter='. Format (i)3 4html = requests.get (URL). Content.decode ('Utf-8')#get page content using the request library5 6selector = etree. HTML (HTML)#extracting content using the lxml library7 " "8 See the page to find content in <div class= "info" > part9 " "TenContent = Selector.xpath ('//div[@class = "Info"]/div[@class = "BD"]/p/text ()') One Print(content) A - forIinchContent[1::2]: - Print(Str (i). Strip (). Replace ('\n\r',"')) the #Print (str (i). Split ('/')) -i = str (i). Split ('/') -i = I[len (i)-1] -Key = I.strip (). Replace ('\ n',"'). Split (' ')#The purpose of strip and replace here is to remove spaces and empty lines . + Print(key)
We found that the contents of a movie are separated by '/', we just need to extract the contents of the movie category, so we need to use
i = str (i). Split ('/')
To separate the content into several items, because the film classification is at the end, so we
i = I[len (i)-1]
To get the last item after the separation is the sort of movie we need, and the last step we need to make, because a movie usually has more than one label for the movie category, so we're going to continue to separate the captured movie categories, and we can see that the movie classification is separated by a space, So we can use the following line of code to isolate the individual categories:
Key = I.strip (). Replace ('\ n'). Split (')
Second, the next is to save to the MySQL database
Save the movie classification in the MySQL database for data analysis, here we use to Pymysql to connect MySQL database, first we need to build a table in MySQL database:
We then save the data to the database via Pymysql, with the following code:
To connect to the database first:
# 连接mysql数据库conn = pymysql.connect(host = ‘localhost‘, user = ‘root‘, passwd = ‘2014081029‘, db = ‘mysql‘, charset = ‘utf8‘) # user为数据库的名字,passwd为数据库的密码,一般把要把字符集定义为utf8,不然存入数据库容易遇到编码问题cur = conn.cursor() # 获取操作游标cur.execute(‘use douban‘) # 使用douban这个数据库
Before we save it to the database, we have one more thing to do, which is to classify the number of 250 movies, so we define a dictionary to count the number of movie categories, the code here is part of the Get_page function, and the code is as follows:
forIinchContent[1::2]: Print(Str (i). Strip (). Replace ('\n\r',"')) #Print (str (i). Split ('/'))i = str (i). Split ('/') I= I[len (i)-1] Key= I.strip (). Replace ('\ n',"'). Split (' ') Print(Key) forIinchKey:ifI not inchDouban.keys (): Douban[i]= 1Else: Douban[i]+ = 1
Then define a save function, perform the insert operation, perform a rollback if an insert failure occurs, and remember to close the database connection using Conn.close () and Cur.close () after the operation is complete, with the following code:
defSave_mysql (Douban):Print(Douban)#Douban dictionary defined in the main function forKeyinchDouban:Print(Key)Print(Douban[key])ifKey! ="': Try: SQL='Insert Douban (category, quantity) value ('+"\ '"+ key +"\ ',"+"\ '"+ str (Douban[key]) +"\ '"+');'cur.execute (SQL) Conn.commit ()except: Print('Insert Failed') Conn.rollback ()
Third, the use of matplotlib for data visualization operations
First, the movie classification and the number of each classification in the database are stored in a list, and then use matplotlib to visualize the operation, as follows:
defpylot_show (): SQL='select * from Douban;'cur.execute (SQL) rows= Cur.fetchall ()#read all the fields in the tableCount = []#the number of each categoryCategory = []#category forRowinchrows:count.append (int (row[2])) Category.append (row[1]) Y_pos= Np.arange (len category)#defines the number of y-axis coordinatesPlt.barh (Y_pos, Count, align='Center', alpha=0.4)#the Fill opacity (0~1) between the alpha chartPlt.yticks (Y_pos, category)#mark the name of the class on the y-axis forCount, Y_posinchZip (count, y_pos):#The number of categories shown in the graph is the numbers that are displayed at the rear of the histogramPlt.text (count, Y_pos, Count, horizontalalignment='Center', verticalalignment='Center', weight='Bold') Plt.ylim (+28.0,-1.0)#visualization range, equivalent to the specified y-axis rangePlt.title (U'Watercress Movie')#the title of the chartPlt.ylabel (U'Film Classification')#marker for the y-axis of the chartPlt.subplots_adjust (bottom = 0.15) Plt.xlabel (U'number of classification occurrences')#marker for chart x axisPlt.savefig ('Douban.png')#Save Picture
Here are some simple uses of matplotlib, first we want to import matplotlib and NumPy packages
Import NumPy as NP import Matplotlib.pyplot as Plt
This visualization is a histogram, where the definition of the Brah () function is given:
Barh ()
Main function: Make a horizontal bar chart, the rectangle size of the horizontal bar is: Left, left + width, bottom, bottom + height
Parameters: Barh (bottom, width, height =0.8, left =0, **kwargs)
return type: A class class, matplotlib.patches.rectangle** instance
Parameter description:
Bottom:bars the bottom edge of the vertical position
Length of Width:bars
Optional Parameters:
The height of the Height:bars
Left:bars x-axis coordinate value of left margin
Color:bars Color
Edgecolor:bars Edge Color
Linewidth:bar Edge width; None indicates the default width; 0 Indicates not I draw edge
Xerr: If not none, the Errobars will be generated on the bar chart
Yerr: If not none, the Errobars will be generated on the bar chart
EColor: Specify ErrorBar Color
Capsize: Specifies the top (cap) length of the ErrorBar
Align: ' edge ' (default) | ' Center ': ' Edge ' is aligned at the bottom; Center ' with Y axis as centre
LOG: [false| True] False (default), if true, log coordinates are used
And then we can show the picture.
The original address: Crawl The watercress film top250 extract film Classification for data analysis
"Go" crawl the watercress film top250 extract film Classification for data analysis