Python Big Job

Last Update:2018-04-22 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Using Python to crawl the evaluation of the Watercress movie and generate a word cloud

First, crawl web data

The first step is to access the Web page, using the Urllib library in Python. The code is as follows:

from urllib import requestresp = request.urlopen(‘https://movie.douban.com/nowplaying/hangzhou/‘)html_data = resp.read().decode(‘utf-8‘)

In the second step, we need to parse the resulting HTML code to get the data we need.

Use the BeautifulSoup library for parsing HTML code in Python.

BeautifulSoup uses the following format:

BeautifulSoup(html,"html.parser")

The first parameter is the HTML that needs to extract the data, the second parameter is the specified parser, and then reads the contents of the HTML tag using Find_all ()

From BS4 import BeautifulSoup as BS soup = BS (html_data, ' html.parser ') Nowplaying_movie = Soup.find_all (' Div ', id=' nowplaying ') nowplaying_movie_list = nowplaying_movie[0].find_all (' Li ', class_=' List-item ')

You can see the ID number of the movie in the Data-subject property, and the name of the movie in the ALT attribute of the IMG tag, so we'll get the movie ID and name by these two properties. (Note: You need to use the ID of the movie to open the movie review page, so you need to parse it), write the following code:

for item in nowplaying_movie_list:                nowplaying_dict = {}                nowplaying_dict[‘id‘] = item[‘data-subject‘] for tag_img_item in item.find_all(‘img‘): nowplaying_dict[‘name‘] = tag_img_item[‘alt‘] nowplaying_list.append(nowplaying_dict)

Second, data cleaning

In order to facilitate the cleaning of data, we put the data in the list in a string array, the code is as follows:

‘‘for k in range(len(eachCommentList)): comments = comments + (str(eachCommentList[k])).strip()

Third, using the word cloud to display

The code is as follows:

Import Matplotlib.pyplotAs Plt%matplotlib inlineImport matplotlibmatplotlib.rcparams[ ' figure.figsize '] = (10.0, 5.0) Span class= "Hljs-keyword" >from wordcloud import wordcloud# Word Cloud pack Wordcloud=wordcloud (font_path= "Simhei.ttf", background_color=  "white", Max_font_size=80) # Specify font type, font size, and font color word_frequence = {X[0]:x[1] for x in words_stat.head (1000). Values}word_ Frequence_list = []for key in word_frequence:temp = (key,word_ Frequence[key]) word_frequence_list.append (temp) wordcloud=wordcloud.fit_words (word_frequence_list) plt.imshow ( Wordcloud)                




Pay Source:

Full code #-*-Coding:utf-8-*-import warningswarnings.filterwarnings ("Ignore") Import Jieba # word breaker import NumPy # NumPy COMPUTE Package Impo RT Codecs # Codecs provides an open method to specify the language encoding of the opened file, which is automatically converted to the internal Unicodeimport Reimport pandas as Pdimport Matplotlib.pyplot as p at read time Ltfrom PIL Import imagefrom urllib import requestfrom bs4 import beautifulsoup as Bsfrom wordcloud import Wordcloud,imagec Olorgenerator # Word Cloud Package import matplotlibmatplotlib.rcparams[' figure.figsize ' = (10.0, 5.0) # Parse page function def getnowplayingmovie_ List (): resp = Request.urlopen (' https://movie.douban.com/nowplaying/hangzhou/') Html_data = Resp.read (). Decode (' utf- 8 ') soup = BS (html_data, ' html.parser ') Nowplaying_movie = Soup.find_all (' div ', id= ' nowplaying ') Nowplaying_movie _list = Nowplaying_movie[0].find_all (' li ', class_= ' list-item ') nowplaying_list = [] for item in Nowplaying_movie_lis T:nowplaying_dict = {} nowplaying_dict[' id '] = item[' Data-subject '] for Tag_img_item in Item.find_al L (' img '): Nowplaying_dict[' name '] = tag_img_item[' alt '] Nowplaying_list.append (nowplaying_dict) return nowplaying_list# crawl comment function    def Getcommentsbyid (MovieID, pagenum): Eachcommentlist = [] if pagenum > 0:start = (pageNum-1) * 20 Else:return False requrl = ' https://movie.douban.com/subject/' + MovieID + '/comments ' + '? ' + ' start= ' + str (s Tart) + ' &limit=20 ' Print (requrl) resp = Request.urlopen (requrl) Html_data = Resp.read (). Decode (' Utf-8 ') s OUP = BS (html_data, ' html.parser ') comment_div_lits = Soup.find_all (' div ', class_= ' comment ') for item in Comment_div     _lits:if item.find_all (' P ') [0].string is not None:eachCommentList.append (Item.find_all (' P ') [0].string] return Eachcommentlistdef Main (): # Loop to get the first 10 pages of a movie comments commentlist = [] Nowplayingmovie_list = Getnowplayingmov Ie_list () for I in range: num = i + 1 commentlist_temp = Getcommentsbyid (nowplayingmovie_list[0][' id ' ], num) commentliSt.append (commentlist_temp) # converts data in a list to a string comments = ' for K in Range ' (Len (commentlist)): comments = com ments + (str (commentlist[k])). Strip () # Use regular expressions to remove punctuation pattern = re.compile (R ' [\u4e00-\u9fa5]+ ') Filterdata = re.f Indall (pattern, comments) Cleaned_comments = ". Join (Filterdata) # Use stutter participle for chinese word segmentation segment = Jieba.lcut (Cleaned_comm Ents) WORDS_DF = PD. DataFrame ({' segment ': segment}) # Remove the stop word stopwords = pd.read_csv ("Stopwords.txt", Index_col=false, quoting=3, sep= "\ t ", names=[' Stopword '], encoding= ' Utf-8 ') # quoting=3 all do not refer to WORDS_DF = Words_df[~words_df.segm Ent.isin (Stopwords.stopword)] # statistic word frequency Words_stat = Words_df.groupby (by=[' segment ']) [' segment '].agg ({"Count": numpy.size} ) Words_stat = Words_stat.reset_index (). Sort_values (by=["Count"], Ascending=false) # Print (Words_stat.head ()) Bg_pi        c = Numpy.array (Image.open ("Alice_mask.png")) # display with Word cloud Wordcloud = Wordcloud (font_path= "Simhei.ttf",Background_color= "White", max_font_size=80, Width = A, height = 1800, mask = Bg_pic, mode = "RGBA") Word_frequence = {X[0]: x[1] for x in Words_stat.head (. Values} # Print (word_frequence) " "" Word_frequence_list = [] for key in word_frequence:temp = (key, Word_frequence[key]) Word_frequenc E_list.append (temp) #print (word_frequence_list) "" "Wordcloud = Wordcloud.fit_words (word_frequence) Image_ colors = Imagecolorgenerator (bg_pic) # Generate word cloud color from picture plt.imshow (Wordcloud) #显示词云图片 Plt.axis ("Off") plt.show () wor Dcloud.to_file (' show_chinese.png ') # Keep the word cloud down main ()

Python Big Job

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python Big Job

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python Big Job

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support