Emotional analysis based on social network Iiiby white Shinhuata (http://blog.csdn.net/whiterbear) reprint need to indicate the source, thank you.
In front of the micro-Bo data capture, simple processing, this article on the school Micro-blog similarity analysis.
Similarity analysis of Weibo
This is an attempt to calculate the similarity between any two schools ' microblog words.
Idea: First of all, the school micro-bo participle, traverse to get each school's high-frequency Dictionary of words, set up a word base vector, use the base vector to construct each school's word vector, and finally use the TF-IDF algorithm and cosine function to calculate the similarity between the two school Weibo.
Note: The TF-IDF algorithm and the cosine function can be used to refer to the blog I recorded earlier. The cosine function is calculated using the NumPy module.
Get a dictionary of school words
According to the school division, each school micro-bo first participle, traverse to get each school thesaurus Worddict, will worddict in pickle way to save the local.
The pseudo code is as follows:
word_results = 获取分词后学校微博# 两重循环获取所有的单词,存储到worddict词典中for r in word_results: for w in r[0].split(): if worddict.has_key(w) == False: worddict[w] = 1 else: worddict[w] += 1# 将该词典以pickle文件方式保存到本地save_to_pickle_file(worddict)
Get a dictionary of high-frequency words in schools
Use the dictionary of the former school lexicon, Traverse, extract the number of occurrences more than 10 times of the phrase, Save as the school's high-frequency Dictionary of words, pickle way to save.
The pseudo code is as follows:
#高频用词词典highworddict = {}for word in worddict: if worddict[word] > 10: highworddict[word] = worddict[word]# 将该词典以pickle文件方式保存到本地save_to_pickle_file(highworddict)
Building a base vector
A base vector is combined with a given high-frequency dictionary of two schools. The meaning of building a base vector is that all the high-frequency words in the two schools have specific corresponding positions in the base vector, which makes it easy to construct the vector of the corresponding school words.
The pseudo code is as follows:
# 基向量字典baseworddict = {}# 添加第一个学校的高频词典到基向量字典for word in highworddict1: if baseworddict.has_key(word) == False: baseworddict[word] = 1# 添加第二个学校的高频词典到基向量字典for word in highworddict2: if baseworddict.has_key(word) == False: baseworddict[word] = 1# 基向量数组basewordlist = []# 将dict再转成list保存起来,这样每个词在数组中的位置就固定起来了for bd in baseworddict: basewordlist.append(bd)
Construct School-used word vectors
Build two lists with the same length as the base vector, all the words that appear in the school high-frequency dictionary are assigned values, and all the exposition values that are not present in the school high-frequency dictionary are 1.
The pseudo code is as follows:
# 学校用词向量school1_list = []school2_list = []# 保证学校用词向量的长度与基向量长度相同for i in basewordlist: # 所有出现在学校高频用词词典中的单词全部添加到学校用词向量的列表中,所有没出现的,列表中添加1 if highworddict1.has_key(i) == True: school1_list.append(highworddict1[i]) else: school1_list.append(1)for i in basewordlist: if highworddict2.has_key(i) == True: school2_list.append(highworddict2[i]) else: school2_list.append(1)
Calculate TF-IDF Value
Use the previous school word vector to calculate the frequency value, and then combine the IDF document to calculate the TF*IDF value.
The pseudo code is as follows:
# 学校用词向量的长度 sum_school_1,sum_school_2 = sum(school1_list),sum(school2_list)# 得到学校常用词的tf频率列表for i, value in enumerate(school1_list): school1_list[i] = school1_list[i]*1.0/sum_school_1 * 1for i, value in enumerate(school2_list): school2_list[i] = school2_list[i]*1.0/sum_school_2 * 1# 加载idf文件,这里已经处理成idf词典了idfdict = get_idf_dict()# 生成自己的idf列表for i, value in enumerate(basewordlist): if idfdict.has_key(value.encode(‘utf-8‘)) == True: basewordlist[i] = idfdict[value.encode(‘utf-8‘)] else: basewordlist[i] = 3# 得到学校常用词的tf*idf列表for i, value in enumerate(school1_list): school1_list[i] = school1_list[i] * basewordlist[i]for i, value in enumerate(school2_list): school2_list[i] = school2_list[i] * basewordlist[i]
Calculate cosine value
Finally, use the NumPy module to calculate the cosine between the school's common word TF-IDF lists.
The pseudo code is as follows:
# 转化为numpy数组array_1,array_2 = np.array(school1_list),np.array(school2_list)# 计算两个向量的长度len_1, len_2 = np.sqrt(array_1.dot(array_1)), np.sqrt(array_2.dot(array_2))# 计算夹角的cos值cos_angle = array_1.dot(array_2)/(len_1*len_2)# 计算弧度制夹角angle = np.arccos(cos_angle)
Results
Each school's high frequency (the number of occurrences more than 10 times) with the number of words in 3,000 to 4,000, any two schools between the high-frequency word difference of about 500.
The results of the similarity of the high frequency words of the microblog in any two schools are:
School Name |
Workpiece |
Tsinghua |
Peking University |
nan da |
Hwajung |
Da Gong |
|
33.35 |
34.21 |
28.25 |
32.37 |
Tsinghua |
33.35 |
|
24.77 |
24.46 |
32.86 |
North |
34.21 |
24.77 |
|
26.16 |
33.50 |
Nan da |
28.25 |
24.46 |
26.16 |
|
27.36 |
Hwajung |
32.37 |
32.86 |
33.50 |
27.36 |
|
As can be seen from the above table, the five schools of the microblog word similarity is similar, the use of any two school words similar degree are around 30 degrees. and the cosine angle value in [0,90], its 0 degrees is exactly the same, 90 degrees is completely irrelevant, 30 degrees can be considered more similar.
Related Code Links
- Github:weibo sentiment analysis
- CSDN Download
Summarize
The base vector building here is based on the use of high-frequency words between two schools, and 10 groups are compared, and 10 groups of base vectors are constructed, and I think that the way to construct the base vectors should be inappropriate, and a comparison of the microblog similarity should be made on the basis of a phrase containing all words. Of course, the use of TF-IDF here is only a gourd painting, it is not clear whether the results will represent two schools with the similarity of words.
Next, the similarity analysis of Weibo.
Emotional analysis based on social Networks III