International - English

Topic Center

Contact Sales

Home > Developer > Web Develop

Social networking-based sentiment analysis III, social sentiment iii

Last Update:2015-06-30 Source: Internet

Author: User

Tags idf

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Social networking-based sentiment analysis III, social sentiment iii
Emotional analysis based on social network IIIBy bear flower (http://blog.csdn.net/whiterbear) reprint need to indicate the source, thank you.

Previously, we captured and processed Weibo data in a simple way. This article analyzes the similarity of school Weibo.

Weibo Similarity Analysis

Here, we try to calculate the similarity of Weibo words between any two schools.

Idea: first, perform word segmentation on the school microblog, traverse and obtain the frequently-used word Dictionary of each school, form a word base vector, and use this base vector to construct word vectors for each school, finally, we use the TF-IDF algorithm and cosine function to calculate the similarity between Weibo.

NOTE: For the use of TF-IDF algorithms and Cosine Functions, refer to my previous blog. The numpy module is used for cosine function calculation.

Get school word dictionary

By School, each school's microblog first performs word segmentation, traverses and obtains the word dictionary worddict of each school, and saves worddict in the local way of pickle.

The pseudo code is as follows:

Word_results = school microblog after word segmentation # obtain all words in two cycles and store them in the worddict dictionary for r in word_results: for w in r [0]. split (): if worddict. has_key (w) = False: worddict [w] = 1 else: worddict [w] + = 1 # Save the dictionary as a pickle file to the local save_to_pickle_file (worddict)

Obtain a dictionary of frequently used words in a school

Use the dictionary of words in the school to traverse and extract phrases that appear more than 10 times, save them as frequently-used word dictionaries of the school, and store them in pickle mode.

The pseudo code is as follows:

# Highworddict = {} for word in worddict: if worddict [word]> 10: highworddict [word] = worddict [word] # Save the dictionary as a pickle file to the local save_to_pickle_file (highworddict)

Build base Vector

Combine the frequently used word dictionaries of the given two schools into a base vector. The significance of constructing the base vector is that all the frequently used words of the two schools have specific locations in the base vector, so that the words of the two schools can be easily constructed.

The pseudo code is as follows:

# Base vector dictionary baseworddict ={}# add the first school's high-frequency dictionary to the base vector dictionary for word in highworddict1: if baseworddict. has_key (word) = False: baseworddict [word] = 1 # add the second school frequently dictionary to the base vector dictionary for word in highworddict2: if baseworddict. has_key (word) = False: baseworddict [word] = 1 # base Vector Array basewordlist = [] # convert dict to list and save it, in this way, the position of each word in the array is fixed. for bd in baseworddict: basewordlist. append (bd)

Construct school word Vector

Construct two lists with equal lengths of the base vector. All words appearing in the school's frequently-frequency dictionary are assigned a value corresponding to the value. All words not appearing in the school's frequently-frequency dictionary are assigned a value of 1.

The pseudo code is as follows:

# School word vector schoolw.list = [] school2_list = [] # ensure that the length of the school word vector is the same as that of the base vector for I in basewordlist: # add all words that appear in the school frequently-used word dictionary to the list of school word vectors, and Add 1 if highworddict1.has _ key (I) = True to the list: round (highworddict1 [I]) else: school1_list.append (1) for I in basewordlist: if vertex _ key (I) = True: school2_list.append (highworddict2 [I]) else: school2_list.append (1)

Calculate TF-IDF Value

Use the word vector of the school to calculate the word frequency value, and then use the idf document to calculate the tf * idf value.

The pseudo code is as follows:

# Length of the school word vector sum_school_1, sum_school_2 = sum (school1_list), sum (school2_list) # obtain the tf frequency list for I, value in enumerate (school1_list) of common school words ): school1_list [I] = school1_list [I] * 1.0/sum_school_1 * 1for I, value in enumerate (school2_list ): school2_list [I] = school2_list [I] * 1.0/sum_school_2*1 # load the idf file. The idfdict = get_idf_dict () has been processed as the idf dictionary () # generate your idf list for I, value in enumerate (basewordlist): if idfdict. has_key (value. encode ('utf-8') = True: basewordlist [I] = idfdict [value. encode ('utf-8')] else: basewordlist [I] = 3 # obtain the tf * idf list for I, value in enumerate (school1_list) commonly used in schools ): school0000list [I] = school0000list [I] * basewordlist [I] for I, value in enumerate (school2_list): school2_list [I] = school2_list [I] * basewordlist [I]

Returns the cosine of a number.

Finally, the numpy module is used to calculate the cosine value between the tf-idf list commonly used in schools.

The pseudo code is as follows:

# Convert to numpy array array_1, array_2 = np. array (school1_list), np. array (school2_list) # Calculate the length of two vectors len_1, len_2 = np. sqrt (array_1.dot (array_1), np. sqrt (array_2.dot (array_2) # Calculate the cos value of the angle cos_angle = array_1.dot (array_2)/(len_1 * len_2) # Calculate the angle of the radian angle = np. arccos (cos_angle)

Result

The frequently used words in each school (more than 10 times) are between three thousand and four thousand, and the frequently used words in any two schools differ by about five hundred.

The similarity of frequently used words on Weibo of any two schools is as follows:

School Name	Dagong	Tsinghua	Peking University	Nanda	Hua Zheng
Dagong		33.35	34.21	28.25	32.37
Tsinghua	33.35		24.77	24.46	32.86
Peking University	34.21	24.77		26.16	33.50
Nanda	28.25	24.46	26.16		27.36
Hua Zheng	32.37	32.86	33.50	27.36

From the table above, we can see that the word similarity between the five schools is almost the same, and the word similarity between any two schools is about 30 degrees. If the cosine angle is set to [0, 90], Its 0 degrees indicate identical, 90 degrees indicate completely irrelevant, and 30 degrees can be considered more similar.

Related Code links

Github: weibo sentiment analysis
Download csdn

Summary

The base vector construction here is made up of frequently used words between two schools. After 10 groups of comparisons, ten groups of base vectors are constructed. I think, in this way, it is inappropriate to construct a base vector. Instead, we should use a phrase containing all words as the base vector to compare the similarity of Weibo. Of course, the use of the TF-IDF here is just based on the gourd painting, it is not clear whether the results of doing so represent the word similarity between the two schools.

Next, Weibo Similarity analysis.

Copyright Disclaimer: This article is an original article by the blogger and cannot be reproduced without the permission of the blogger.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

python nltk sentiment analysis google sentiment analysis api hootsuite sentiment analysis twitter sentiment analysis tutorial sentiment analysis java nltk sentiment analysis textblob sentiment analysis

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More