Python Big Job

Source: Internet
Author: User

Using Python to crawl the evaluation of the Watercress movie and generate a word cloud

First, crawl web data

The first step is to access the Web page, using the Urllib library in Python. The code is as follows:

from urllib import requestresp = request.urlopen(‘https://movie.douban.com/nowplaying/hangzhou/‘)html_data = resp.read().decode(‘utf-8‘)


In the second step, we need to parse the resulting HTML code to get the data we need.

Use the BeautifulSoup library for parsing HTML code in Python.


BeautifulSoup uses the following format:

BeautifulSoup(html,"html.parser")

The first parameter is the HTML that needs to extract the data, the second parameter is the specified parser, and then reads the contents of the HTML tag using Find_all ()

From BS4 import BeautifulSoup as BS soup = BS (html_data, ' html.parser ') Nowplaying_movie = Soup.find_all (' Div ', id=' nowplaying ') nowplaying_movie_list = nowplaying_movie[0].find_all (' Li ', class_=' List-item ')

You can see the ID number of the movie in the Data-subject property, and the name of the movie in the ALT attribute of the IMG tag, so we'll get the movie ID and name by these two properties. (Note: You need to use the ID of the movie to open the movie review page, so you need to parse it), write the following code:

for item in nowplaying_movie_list:                nowplaying_dict = {}                nowplaying_dict[‘id‘] = item[‘data-subject‘] for tag_img_item in item.find_all(‘img‘): nowplaying_dict[‘name‘] = tag_img_item[‘alt‘] nowplaying_list.append(nowplaying_dict)

Second, data cleaning

In order to facilitate the cleaning of data, we put the data in the list in a string array, the code is as follows:

‘‘for k in range(len(eachCommentList)): comments = comments + (str(eachCommentList[k])).strip()


Third, using the word cloud to display

The code is as follows:

Import Matplotlib.pyplotAs Plt%matplotlib inlineImport matplotlibmatplotlib.rcparams[ ' figure.figsize '] = (10.0, 5.0) Span class= "Hljs-keyword" >from wordcloud import wordcloud# Word Cloud pack Wordcloud=wordcloud (font_path= "Simhei.ttf", background_color=  "white", Max_font_size=80) # Specify font type, font size, and font color word_frequence = {X[0]:x[1] for x in words_stat.head (1000). Values}word_ Frequence_list = []for key in word_frequence:temp = (key,word_ Frequence[key]) word_frequence_list.append (temp) wordcloud=wordcloud.fit_words (word_frequence_list) plt.imshow ( Wordcloud)                




Pay Source:
Full code #-*-Coding:utf-8-*-import warningswarnings.filterwarnings ("Ignore") Import Jieba # word breaker import NumPy # NumPy COMPUTE Package Impo RT Codecs # Codecs provides an open method to specify the language encoding of the opened file, which is automatically converted to the internal Unicodeimport Reimport pandas as Pdimport Matplotlib.pyplot as p at read time Ltfrom PIL Import imagefrom urllib import requestfrom bs4 import beautifulsoup as Bsfrom wordcloud import Wordcloud,imagec Olorgenerator # Word Cloud Package import matplotlibmatplotlib.rcparams[' figure.figsize ' = (10.0, 5.0) # Parse page function def getnowplayingmovie_ List (): resp = Request.urlopen (' https://movie.douban.com/nowplaying/hangzhou/') Html_data = Resp.read (). Decode (' utf- 8 ') soup = BS (html_data, ' html.parser ') Nowplaying_movie = Soup.find_all (' div ', id= ' nowplaying ') Nowplaying_movie _list = Nowplaying_movie[0].find_all (' li ', class_= ' list-item ') nowplaying_list = [] for item in Nowplaying_movie_lis T:nowplaying_dict = {} nowplaying_dict[' id '] = item[' Data-subject '] for Tag_img_item in Item.find_al L (' img '): Nowplaying_dict[' name '] = tag_img_item[' alt '] Nowplaying_list.append (nowplaying_dict) return nowplaying_list# crawl comment function    def Getcommentsbyid (MovieID, pagenum): Eachcommentlist = [] if pagenum > 0:start = (pageNum-1) * 20 Else:return False requrl = ' https://movie.douban.com/subject/' + MovieID + '/comments ' + '? ' + ' start= ' + str (s Tart) + ' &limit=20 ' Print (requrl) resp = Request.urlopen (requrl) Html_data = Resp.read (). Decode (' Utf-8 ') s OUP = BS (html_data, ' html.parser ') comment_div_lits = Soup.find_all (' div ', class_= ' comment ') for item in Comment_div     _lits:if item.find_all (' P ') [0].string is not None:eachCommentList.append (Item.find_all (' P ') [0].string] return Eachcommentlistdef Main (): # Loop to get the first 10 pages of a movie comments commentlist = [] Nowplayingmovie_list = Getnowplayingmov Ie_list () for I in range: num = i + 1 commentlist_temp = Getcommentsbyid (nowplayingmovie_list[0][' id ' ], num) commentliSt.append (commentlist_temp) # converts data in a list to a string comments = ' for K in Range ' (Len (commentlist)): comments = com ments + (str (commentlist[k])). Strip () # Use regular expressions to remove punctuation pattern = re.compile (R ' [\u4e00-\u9fa5]+ ') Filterdata = re.f Indall (pattern, comments) Cleaned_comments = ". Join (Filterdata) # Use stutter participle for chinese word segmentation segment = Jieba.lcut (Cleaned_comm Ents) WORDS_DF = PD. DataFrame ({' segment ': segment}) # Remove the stop word stopwords = pd.read_csv ("Stopwords.txt", Index_col=false, quoting=3, sep= "\ t ", names=[' Stopword '], encoding= ' Utf-8 ') # quoting=3 all do not refer to WORDS_DF = Words_df[~words_df.segm Ent.isin (Stopwords.stopword)] # statistic word frequency Words_stat = Words_df.groupby (by=[' segment ']) [' segment '].agg ({"Count": numpy.size} ) Words_stat = Words_stat.reset_index (). Sort_values (by=["Count"], Ascending=false) # Print (Words_stat.head ()) Bg_pi        c = Numpy.array (Image.open ("Alice_mask.png")) # display with Word cloud Wordcloud = Wordcloud (font_path= "Simhei.ttf",Background_color= "White", max_font_size=80, Width = A, height = 1800, mask = Bg_pic, mode = "RGBA") Word_frequence = {X[0]: x[1] for x in Words_stat.head (. Values} # Print (word_frequence) " "" Word_frequence_list = [] for key in word_frequence:temp = (key, Word_frequence[key]) Word_frequenc E_list.append (temp) #print (word_frequence_list) "" "Wordcloud = Wordcloud.fit_words (word_frequence) Image_ colors = Imagecolorgenerator (bg_pic) # Generate word cloud color from picture plt.imshow (Wordcloud) #显示词云图片 Plt.axis ("Off") plt.show () wor Dcloud.to_file (' show_chinese.png ') # Keep the word cloud down main ()

  

Python Big Job

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.