Social networking-based sentiment analysis III, social sentiment iii
Emotional analysis based on social network IIIBy bear flower (http://blog.csdn.net/whiterbear) reprint need to indicate the source, thank you.
Previously, we captured and processed Weibo data in a simple way. This article analyzes the similarity of school Weibo.
Weibo Similarity Analysis
Here, we try to calculate the similarity of Weibo words between any two schools.
Idea: first, perform word segmentation on the school microblog, traverse and obtain the frequently-used word Dictionary of each school, form a word base vector, and use this base vector to construct word vectors for each school, finally, we use the TF-IDF algorithm and cosine function to calculate the similarity between Weibo.
NOTE: For the use of TF-IDF algorithms and Cosine Functions, refer to my previous blog. The numpy module is used for cosine function calculation.
Get school word dictionary
By School, each school's microblog first performs word segmentation, traverses and obtains the word dictionary worddict of each school, and saves worddict in the local way of pickle.
The pseudo code is as follows:
Word_results = school microblog after word segmentation # obtain all words in two cycles and store them in the worddict dictionary for r in word_results: for w in r [0]. split (): if worddict. has_key (w) = False: worddict [w] = 1 else: worddict [w] + = 1 # Save the dictionary as a pickle file to the local save_to_pickle_file (worddict)
Obtain a dictionary of frequently used words in a school
Use the dictionary of words in the school to traverse and extract phrases that appear more than 10 times, save them as frequently-used word dictionaries of the school, and store them in pickle mode.
The pseudo code is as follows:
# Highworddict = {} for word in worddict: if worddict [word]> 10: highworddict [word] = worddict [word] # Save the dictionary as a pickle file to the local save_to_pickle_file (highworddict)
Build base Vector
Combine the frequently used word dictionaries of the given two schools into a base vector. The significance of constructing the base vector is that all the frequently used words of the two schools have specific locations in the base vector, so that the words of the two schools can be easily constructed.
The pseudo code is as follows:
# Base vector dictionary baseworddict ={}# add the first school's high-frequency dictionary to the base vector dictionary for word in highworddict1: if baseworddict. has_key (word) = False: baseworddict [word] = 1 # add the second school frequently dictionary to the base vector dictionary for word in highworddict2: if baseworddict. has_key (word) = False: baseworddict [word] = 1 # base Vector Array basewordlist = [] # convert dict to list and save it, in this way, the position of each word in the array is fixed. for bd in baseworddict: basewordlist. append (bd)
Construct school word Vector
Construct two lists with equal lengths of the base vector. All words appearing in the school's frequently-frequency dictionary are assigned a value corresponding to the value. All words not appearing in the school's frequently-frequency dictionary are assigned a value of 1.
The pseudo code is as follows:
# School word vector schoolw.list = [] school2_list = [] # ensure that the length of the school word vector is the same as that of the base vector for I in basewordlist: # add all words that appear in the school frequently-used word dictionary to the list of school word vectors, and Add 1 if highworddict1.has _ key (I) = True to the list: round (highworddict1 [I]) else: school1_list.append (1) for I in basewordlist: if vertex _ key (I) = True: school2_list.append (highworddict2 [I]) else: school2_list.append (1)
Calculate TF-IDF Value
Use the word vector of the school to calculate the word frequency value, and then use the idf document to calculate the tf * idf value.
The pseudo code is as follows:
# Length of the school word vector sum_school_1, sum_school_2 = sum (school1_list), sum (school2_list) # obtain the tf frequency list for I, value in enumerate (school1_list) of common school words ): school1_list [I] = school1_list [I] * 1.0/sum_school_1 * 1for I, value in enumerate (school2_list ): school2_list [I] = school2_list [I] * 1.0/sum_school_2*1 # load the idf file. The idfdict = get_idf_dict () has been processed as the idf dictionary () # generate your idf list for I, value in enumerate (basewordlist): if idfdict. has_key (value. encode ('utf-8') = True: basewordlist [I] = idfdict [value. encode ('utf-8')] else: basewordlist [I] = 3 # obtain the tf * idf list for I, value in enumerate (school1_list) commonly used in schools ): school0000list [I] = school0000list [I] * basewordlist [I] for I, value in enumerate (school2_list): school2_list [I] = school2_list [I] * basewordlist [I]
Returns the cosine of a number.
Finally, the numpy module is used to calculate the cosine value between the tf-idf list commonly used in schools.
The pseudo code is as follows:
# Convert to numpy array array_1, array_2 = np. array (school1_list), np. array (school2_list) # Calculate the length of two vectors len_1, len_2 = np. sqrt (array_1.dot (array_1), np. sqrt (array_2.dot (array_2) # Calculate the cos value of the angle cos_angle = array_1.dot (array_2)/(len_1 * len_2) # Calculate the angle of the radian angle = np. arccos (cos_angle)
Result
The frequently used words in each school (more than 10 times) are between three thousand and four thousand, and the frequently used words in any two schools differ by about five hundred.
The similarity of frequently used words on Weibo of any two schools is as follows:
School Name |
Dagong |
Tsinghua |
Peking University |
Nanda |
Hua Zheng |
Dagong |
|
33.35 |
34.21 |
28.25 |
32.37 |
Tsinghua |
33.35 |
|
24.77 |
24.46 |
32.86 |
Peking University |
34.21 |
24.77 |
|
26.16 |
33.50 |
Nanda |
28.25 |
24.46 |
26.16 |
|
27.36 |
Hua Zheng |
32.37 |
32.86 |
33.50 |
27.36 |
|
From the table above, we can see that the word similarity between the five schools is almost the same, and the word similarity between any two schools is about 30 degrees. If the cosine angle is set to [0, 90], Its 0 degrees indicate identical, 90 degrees indicate completely irrelevant, and 30 degrees can be considered more similar.
Related Code links
- Github: weibo sentiment analysis
- Download csdn
Summary
The base vector construction here is made up of frequently used words between two schools. After 10 groups of comparisons, ten groups of base vectors are constructed. I think, in this way, it is inappropriate to construct a base vector. Instead, we should use a phrase containing all words as the base vector to compare the similarity of Weibo. Of course, the use of the TF-IDF here is just based on the gourd painting, it is not clear whether the results of doing so represent the word similarity between the two schools.
Next, Weibo Similarity analysis.
Copyright Disclaimer: This article is an original article by the blogger and cannot be reproduced without the permission of the blogger.