"A show" about human nature, using Python to grab the cat's eye nearly 100,000 comments and analysis, together reveal "this play" how exactly?

Source: Internet
Author: User
Tags install matplotlib

Huang Bo's first film "a show" since August 10 in the national release, has been 10 days, its starring lineup is strong, I believe many audiences are also directed at the stars.
At present, "a show" in the cat's eye has been nearly 600,000 evaluations, divided into 8.2 points, the box office has broken 1 billion.

The author (Tang) also walked into the cinema today, to do a personal view of the film, after reading the feeling is a little lost, this thought is a comedy, the results found laughter, from a funny point of view, not as "the richest man in western Rainbow City", the film is more reflective of human nature of a film, should not do comedy to see, The relationship between people in the film is worth pondering.

Today, with the soup teacher to reveal the film "a show", to see "The show" in the end how?

We'll use Python to grab nearly 100,000 reviews of the cat's eye and analyze the data we've acquired to see what the audience has to say about the movie.

The entire data analysis process is divided into four steps:

    1. Get Data
    2. Working with Data
    3. Storing data
    4. Visualization of data
One, obtain data 1. Brief introduction

? This is the cat's Eye app review data:

The analysis found that the cat Eye app's review data interface is:


? By analyzing the comment data, the following information is obtained:

    • JSON-formatted data is returned

    • 1200486 means the exclusive id;offset of the movie represents the offset; StartTime represents the starting time to get a comment, from that time forward data, to get the latest comments

    • The CMTS represents a comment, gets 15 each time, offset is the starting index each time a comment is obtained, and 15 is taken back

    • Hcmts says top 10 of popular comments

    • Total indicates number of comments
2. Code implementation

? This first defines a function to fetch data based on the specified URL, and can only get to a specified date forward to 15 comment data

# coding=utf-8__author__ = ‘汤小洋‘from urllib import requestimport jsonimport timefrom datetime import datetimefrom datetime import timedelta# 获取数据,根据url获取def get_data(url):    headers = {        ‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36‘    }    req = request.Request(url, headers=headers)    response = request.urlopen(req)    if response.getcode() == 200:        return response.read()    return Noneif __name__ == ‘__main__‘:    html = get_data(‘http://m.maoyan.com/mmdb/comments/movie/1200486.json?_v_=yes&offset=0&startTime=2018-07-28%2022%3A25%3A03‘)    print(html)
Ii. Processing of data

Processing of the acquired data, converted to JSON

# 处理数据def parse_data(html):    data = json.loads(html)[‘cmts‘]  # 将str转换为json    comments = []    for item in data:        comment = {            ‘id‘: item[‘id‘],            ‘nickName‘: item[‘nickName‘],            ‘cityName‘: item[‘cityName‘] if ‘cityName‘ in item else ‘‘,  # 处理cityName不存在的情况            ‘content‘: item[‘content‘].replace(‘\n‘, ‘ ‘, 10),  # 处理评论内容换行的情况            ‘score‘: item[‘score‘],            ‘startTime‘: item[‘startTime‘]        }        comments.append(comment)    return commentsif __name__ == ‘__main__‘:    html = get_data(‘http://m.maoyan.com/mmdb/comments/movie/1200486.json?_v_=yes&offset=0&startTime=2018-07-28%2022%3A25%3A03‘)    comments = parse_data(html)    print(comments)
Third, storage data

? In order to be able to get all the comment data, the method is to get the data forward from the current time, get 15 lines each time according to the URL, and then get the last comment time, from that time continue to get the data, until the movie release date (2018-08-10), get all the data between.

# store data, save to text file Def save_to_txt (): Start_time = DateTime.Now (). Strftime ('%y-%m-%d%h:%m:%s ') # Get the current time, get forward from the current time end_t IME = ' 2018-08-10 00:00:00 ' while start_time > end_time:url = ' http://m.maoyan.com/mmdb/comments/movie/12030 84.json?_v_=yes&offset=0&starttime= ' + start_time.replace (', '%20 ') HTML = None ' problem:        When the request is too frequent, the server rejects the connection, which is actually the server's anti-crawler policy: 1. Increase the delay by 0.1 seconds between each request, minimizing the request being rejected 2. If rejected, retry after 0.5 seconds try:html = get_data (URL) except Exception as E:time.sleep (0.5) HTML = Get_data (URL) else:time.sleep (0.1) comments = Parse_data (HTML) print (comments) start_time = comments[14][' StartTime ' # Gets the end of the comment time start_time = datetime.strptime (start_time, '%y-%m-%d%h:%m:%s ') + Timedelt A (seconds=-1) # converts to a datetime type, minus 1 seconds, avoids getting to duplicate data Start_time = Datetime.strftime (start_time, '%y-%m-%d%h:%m:%s ') # converted to STR for item in COmments:with open (' comments.txt ', ' a ', encoding= ' Utf-8 ') as F:f.write (str (item[' ID ']) + ', ' +item [' nickname '] + ', ' + item[' cityname '] + ', ' + item[' content '] + ', ' + str (item[' score ']) + ', ' + item[' startTime '] + ' \ n ') I F __name__ = = ' __main__ ': # html = get_data (' http://m.maoyan.com/mmdb/comments/movie/1200486.json?_v_=yes&offset= 0&starttime=2018-07-28%2022%3a25%3a03 ') # comments = Parse_data (HTML) # Print (comments) Save_to_txt ()

? There are two points to note:

    1. Servers generally have anti-crawler policies, when the request is too frequent, the server will reject the partial connection, I am here by increasing the delay between each request to solve, but a simple solution, but also hope that the spectators understand the inclusion
    2. Depending on the amount of data, the time taken to fetch the data will vary, and I crawl the data between 2018-8-19 and 2018-8-10 (Release day), which took about 2 hours to crawl about 92,000 reviews of data

Iv. Visualization of data

? Here, Pyecharts,pyecharts is a class library for generating echarts charts, making it easy to generate visualizations in Python based on data.

? Echarts is Baidu Open source of a data visualization JS library, mainly used for data visualization.

? Reference: http://pyecharts.org/

? After Pyecharts v0.3.2, Pyecharts will no longer bring his own map JS file. If you need to use the map diagram, you can install the corresponding map file package.

# 安装地图文件包pip install echarts-china-provinces-pypkg # 中国省、市、县、区地图pip install echarts-china-cities-pypkgpip install echarts-china-counties-pypkgpip install echarts-china-misc-pypkg pip install echarts-countries-pypkg # 全球国家地图pip install echarts-united-kingdom-pypkg
1. Fan location distribution

? Code implementation

# import style class, used to define styling styles from Pyecharts import style# import Geo components for generating geographic coordinate class diagrams from Pyecharts import geoimport json# import Geo components,    Used to generate a histogram from pyecharts import bar# Import Counter class for the number of times a statistic value appears from collections import counter# data visualization def render (): # Get comments in all cities Cities = [] with open (' Comments.txt ', mode= ' R ', encoding= ' Utf-8 ') as F:rows = F.readlines () for row I  n rows:city = row.split (', ') [2] if City! = ': # Remove the value of the town name Empty Cities.append # Processing of place names in urban data and coordinate files handle (cities) # count the number of occurrences of each city # data = [] # for Urban in Set (cities): # Data.append (( City, cities.count)) data = Counter (cities). Most_common () # Use the Counter class to count occurrences and convert to a tuple list # print (data) # define Sample style = Style (title_color= ' #fff ', title_pos= ' center ', width=1200, height=600, BAC  Kground_color= ' #404a59 ') # Generate GEO map based on city data Geo = Geo (' A show ' fan location distribution ', ' data source: Cat's eye-soup small ocean Collection ', **style.init_style) attr, Value = geo.cast (data) Geo.add(', attr, value, Visual_range=[0, 3500], visual_text_color= ' #fff ', symbol_size=15, Is_visualmap=tru E, Is_piecewise=true, visual_split_number=10) geo.render (' fan location-geo-map. html ') # Generate histogram based on city data DATA_TOP20 = Counter (  Cities). Most_common (20) # Returns the most occurrences of the 20 bar = Bar ("A show" fan source ranking TOP20 "," Data source: Cat's eye-soup Small Ocean Collection ", title_pos= ' center ', width=1200, HEIGHT=60) attr, value = Bar.cast (DATA_TOP20) Bar.add ("", attr, Value, Is_visualmap=true, Visual_range=[0, 3500], V Isual_text_color= ' #fff ', Is_more_utils=true, is_label_show=true) bar.render ("fan source ranking-bar chart. html")

? Problems that arise:

    • Error: Valueerror:no coordinate is specified for XXX (place name)

    • Cause: There is no such place name in the pyecharts's coordinate file, which is actually caused by inconsistent names, such as "Dazhou" in the data, and "Dazhou" in the coordinate file.

      The path of the coordinate file:项目/venv/lib/python3.6/site-packages/pyecharts/datasets/city_coordinates.json

    • FIX: Modify the coordinate file, copy the same in the original location, then modify the place name
{  "达州市": [    107.5,    31.22  ],   "达州": [    107.5,    31.22  ],}    

? However, because there are too many places to modify, the above method is really troublesome, so I have defined a function to deal with the problem of place-name data can not be found

# Work with place-name data to resolve a problem where a place name is not found in the coordinate file Def handle (cities): # print (Len (cities), Len (Set (cities)) # get all the place names in the coordinate file data = None W ITH open ('/users/wangbo/pycharmprojects/python-spider/venv/lib/python3.6/site-packages/pyecharts/datasets/city _coordinates.json ', mode= ' R ', encoding= ' Utf-8 ') as F:data = Json.loads (F.read ()) # Convert STR to JSON # loop             Judgment processing data_new = data.copy () # Copy all the place-name data for the city in set (cities): # Use Set to Redo # to process the data if City = = of the place name is empty:            While the city is Cities:cities.remove (city), Count = 0 for k in Data.keys ():                Count + = 1 if k = = City:break if K.startswith (city): # Processing abbreviated place names, such as Dazhou to Dazhou # print (k, city) data_new[city] = Data[k] break if K.startswith (city[        0:-1] and Len (city) >= 3: # Address the administrative change of the place name, such as the county or county to change the market, such as data_new[city] = Data[k] Break # handle non-existent place names if coUNT = = Len (data): While the Cities:cities.remove (city) # print (len (data), Len (data_new)) # write overwrite coordinate file with open ('/users/wangbo/pycharmprojects/python-spider/venv/lib/python3.6/site-packages/pyec Harts/datasets/city_coordinates.json ', mode= ' W ', encoding= ' Utf-8 ') as F:f.write (Json.dumps (data_new, en Sure_ascii=false) # convert JSON to STR

Visualization results:

The fan population is concentrated in the coastal area

As can be seen, "a good show" of the audience mainly concentrated in the coastal area, these local economies are relatively developed, the city's large population base, a lot of screen numbers and seats, very high-density row of film, so that viewers easy to view the film, active audience comments, and naturally become the main contributor to the box office.

The top 20 cities of fans are: Beijing, Shenzhen, Shanghai, Chengdu, Wuhan, Guangzhou, Xi ' An, Zhengzhou, Chongqing, Nanjing, Tianjin, Shenyang, Changsha, Dongguan, Harbin, Qingdao, Hangzhou, Hefei, Dalian, Suzhou

Movie consumption is a part of urban consumption, from a certain point of view, it can be used as an index to study the purchasing power of a city. These cities in recent years, most of the GDP ranked upstream, the consumption level is higher.

2. Word Cloud

? Jieba is a python-based word breaker that perfectly supports Chinese word segmentation, powerful

pip install jieba

? Matplotlib is a Python 2D drawing library that produces high-quality graphics that can quickly generate plots, histograms, power spectra, histogram, error plots, scatter plots, and more

pip install matplotlib

? Wordcloud is a python-based word cloud generation class library that generates word clouds

pip install wordcloud

? Code implementation:

# coding=utf-8__author__ = "Soup Xiao Yang" # import Jieba module for Chinese word import jieba# import matplotlib for generating 2D graphics import Matplotlib.pyplot as plt# Import WordCount, used to make word cloud from Wordcloud import Wordcloud, stopwords, imagecolorgenerator# get all comments comments = []with open (' Comments.txt ', mode= ' R ', encoding= ' Utf-8 ') as F:rows = F.readlines () for row in rows:comment = Row.split (', ') [3] if comment! = ': comments.append (comment) # set word breaker Comment_after_split = jieba.cut (str (comments), cut _all=false) # Non-full mode participle, cut_all=falsewords = "". Join (Comment_after_split) # Stitching with spaces # print (words) # Set the mask word stopwords = Stopwo Rds.copy () stopwords.add ("movie") Stopwords.add ("a") Stopwords.add ("one") Stopwords.add ("No") Stopwords.add ("What") Stopwords.add ("somewhat") Stopwords.add ("This") Stopwords.add ("This") Stopwords.add ("not") Stopwords.add ("true") Stopwords.add (" Feeling ") Stopwords.add (" Feel ") Stopwords.add (" or ") Stopwords.add (" but ") Stopwords.add (" Yes ") # import background Map bg_image = Plt.imread (' Bg.jpg ') # Sets the word cloud parameters, the parameters are: canvas width, background color, background map shape, font, shielding word, maximum word font size WC = Wordcloud (width=1024, height=768, background_color= ' White ', mask=bg_image, font_path= ' Stkaiti. TTF ', Stopwords=stopwords, max_font_size=400, random_state=50) # Pass the data into the cloud after the participle wc.generate_from_text (words) plt.im Show (WC) plt.axis (' off ') # does not display Axis plt.show () # Save results to local wc.to_file (' word cloud. jpg ')

Visualization results:

Overall evaluation is very good

? After the comment data segmentation, make the following word cloud:

? From the word cloud, you can see:

    • Comments in the "can", "good-looking", "good" and other hot words, indicating that the audience of "a good show" the overall evaluation is still very nice
    • At the same time, the film "Zhang Yi Xing" "acting" also gave a lot of recognition, I myself today after watching also have the same feeling, let us see a different Zhang Yi Hing, strength actors
    • For the first "director" film "Yellow Bo", can make such a film, fans are more certain, at the same time it is the box office guarantee
    • As for the plot, "reality", "comedy", "Funny", "story" and other words, can see this is a reflection of the reality of the feature film, but also a comedy funny
    • For the comments appear in the "General", "disappointment" and so on, these fans may be like me, this thought this is a hilarious comedy, laughter should be a lot of (after all, in our hearts, Huang Bo, Wangbaoqiang, etc. is comedian), did not expect a lot of laughter, at least with the expectation of the gap, leading to the cause of the fall

3. Rating Stars

? Code implementation:

# coding=utf-8__author__ = "汤小洋"# 导入Pie组件,用于生成饼图from pyecharts import Pie# 获取评论中所有评分rates = []with open(‘comments.txt‘, mode=‘r‘, encoding=‘utf-8‘) as f:    rows = f.readlines()    for row in rows:        rates.append(row.split(‘,‘)[4])# print(rates)# 定义星级,并统计各星级评分数量attr = ["五星", "四星", "三星", "二星", "一星"]value = [    rates.count(‘5‘) + rates.count(‘4.5‘),    rates.count(‘4‘) + rates.count(‘3.5‘),    rates.count(‘3‘) + rates.count(‘2.5‘),    rates.count(‘2‘) + rates.count(‘1.5‘),    rates.count(‘1‘) + rates.count(‘0.5‘)]# print(value)pie = Pie(‘《一出好戏》评分星级比例‘, title_pos=‘center‘, width=900)pie.add("7-17", attr, value, center=[75, 50], is_random=True,        radius=[30, 75], rosetype=‘area‘,        is_legend_show=False, is_label_show=True)pie.render(‘评分.html‘)

Visualization results:

Four or five Star film review total up to 83%

? Can be seen, the five-star ratio of nearly 62%, four stars ratio of 21%, both total up to 83%, visible word of mouth or quite good, a star accounted for less than 6%

? "A good show" as Huang Bo's first direct work, in the filming process of the Director Bo in his request is also very strict, so there is such a result, it is taken for granted.

Attached: a ticket stub to see a movie today ^_^

"A show" about human nature, using Python to grab the cat's eye nearly 100,000 comments and analysis, together reveal "this play" how exactly?

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.