International - English

Topic Center

Contact Sales

Emotional analysis based on social Networks III

Last Update:2015-06-28 Source: Internet

Author: User

Tags idf

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Emotional analysis based on social network Iiiby white Shinhuata (http://blog.csdn.net/whiterbear) reprint need to indicate the source, thank you.

In front of the micro-Bo data capture, simple processing, this article on the school Micro-blog similarity analysis.

Similarity analysis of Weibo

This is an attempt to calculate the similarity between any two schools ' microblog words.

Idea: First of all, the school micro-bo participle, traverse to get each school's high-frequency Dictionary of words, set up a word base vector, use the base vector to construct each school's word vector, and finally use the TF-IDF algorithm and cosine function to calculate the similarity between the two school Weibo.

Note: The TF-IDF algorithm and the cosine function can be used to refer to the blog I recorded earlier. The cosine function is calculated using the NumPy module.

Get a dictionary of school words

According to the school division, each school micro-bo first participle, traverse to get each school thesaurus Worddict, will worddict in pickle way to save the local.

The pseudo code is as follows:

word_results = 获取分词后学校微博# 两重循环获取所有的单词，存储到worddict词典中for r in word_results:    for w in r[0].split():        if worddict.has_key(w) == False:            worddict[w] = 1        else:            worddict[w] += 1# 将该词典以pickle文件方式保存到本地save_to_pickle_file(worddict)

Get a dictionary of high-frequency words in schools

Use the dictionary of the former school lexicon, Traverse, extract the number of occurrences more than 10 times of the phrase, Save as the school's high-frequency Dictionary of words, pickle way to save.

The pseudo code is as follows:

#高频用词词典highworddict = {}for word in worddict:    if worddict[word] > 10:        highworddict[word] = worddict[word]# 将该词典以pickle文件方式保存到本地save_to_pickle_file(highworddict)

Building a base vector

A base vector is combined with a given high-frequency dictionary of two schools. The meaning of building a base vector is that all the high-frequency words in the two schools have specific corresponding positions in the base vector, which makes it easy to construct the vector of the corresponding school words.

The pseudo code is as follows:

# 基向量字典baseworddict = {}# 添加第一个学校的高频词典到基向量字典for word in highworddict1:    if baseworddict.has_key(word) == False:        baseworddict[word] = 1# 添加第二个学校的高频词典到基向量字典for word in highworddict2:    if baseworddict.has_key(word) == False:        baseworddict[word] = 1# 基向量数组basewordlist = []# 将dict再转成list保存起来，这样每个词在数组中的位置就固定起来了for bd in baseworddict:    basewordlist.append(bd)

Construct School-used word vectors

Build two lists with the same length as the base vector, all the words that appear in the school high-frequency dictionary are assigned values, and all the exposition values that are not present in the school high-frequency dictionary are 1.

The pseudo code is as follows:

# 学校用词向量school1_list = []school2_list = []# 保证学校用词向量的长度与基向量长度相同for i in basewordlist:    # 所有出现在学校高频用词词典中的单词全部添加到学校用词向量的列表中，所有没出现的，列表中添加1    if highworddict1.has_key(i) == True:        school1_list.append(highworddict1[i])    else:        school1_list.append(1)for i in basewordlist:    if highworddict2.has_key(i) == True:        school2_list.append(highworddict2[i])    else:        school2_list.append(1)

Calculate TF-IDF Value

Use the previous school word vector to calculate the frequency value, and then combine the IDF document to calculate the TF*IDF value.

The pseudo code is as follows:

# 学校用词向量的长度 sum_school_1，sum_school_2 = sum(school1_list)，sum(school2_list)# 得到学校常用词的tf频率列表for i, value in enumerate(school1_list):    school1_list[i] =  school1_list[i]*1.0/sum_school_1 * 1for i, value in enumerate(school2_list):    school2_list[i] =  school2_list[i]*1.0/sum_school_2 * 1# 加载idf文件，这里已经处理成idf词典了idfdict = get_idf_dict()# 生成自己的idf列表for i, value in enumerate(basewordlist):    if idfdict.has_key(value.encode(‘utf-8‘)) == True:        basewordlist[i] = idfdict[value.encode(‘utf-8‘)]    else:        basewordlist[i] = 3# 得到学校常用词的tf*idf列表for i, value in enumerate(school1_list):    school1_list[i] =  school1_list[i] * basewordlist[i]for i, value in enumerate(school2_list):    school2_list[i] =  school2_list[i] * basewordlist[i]

Calculate cosine value

Finally, use the NumPy module to calculate the cosine between the school's common word TF-IDF lists.

The pseudo code is as follows:

# 转化为numpy数组array_1，array_2 = np.array(school1_list)，np.array(school2_list)# 计算两个向量的长度len_1, len_2 = np.sqrt(array_1.dot(array_1)), np.sqrt(array_2.dot(array_2))# 计算夹角的cos值cos_angle = array_1.dot(array_2)/(len_1*len_2)# 计算弧度制夹角angle = np.arccos(cos_angle)

Results

Each school's high frequency (the number of occurrences more than 10 times) with the number of words in 3,000 to 4,000, any two schools between the high-frequency word difference of about 500.

The results of the similarity of the high frequency words of the microblog in any two schools are:

School Name	Workpiece	Tsinghua	Peking University	nan da	Hwajung
Da Gong		33.35	34.21	28.25	32.37
Tsinghua	33.35		24.77	24.46	32.86
North	34.21	24.77		26.16	33.50
Nan da	28.25	24.46	26.16		27.36
Hwajung	32.37	32.86	33.50	27.36

As can be seen from the above table, the five schools of the microblog word similarity is similar, the use of any two school words similar degree are around 30 degrees. and the cosine angle value in [0,90], its 0 degrees is exactly the same, 90 degrees is completely irrelevant, 30 degrees can be considered more similar.

Related Code Links

Github:weibo sentiment analysis
CSDN Download

Summarize

The base vector building here is based on the use of high-frequency words between two schools, and 10 groups are compared, and 10 groups of base vectors are constructed, and I think that the way to construct the base vectors should be inappropriate, and a comparison of the microblog similarity should be made on the basis of a phrase containing all words. Of course, the use of TF-IDF here is only a gourd painting, it is not clear whether the results will represent two schools with the similarity of words.

Next, the similarity analysis of Weibo.

Emotional analysis based on social Networks III

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

social networks list business social networks list social networks for tweens manage multiple social networks new social media networks social media networks hootsuite social networks

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More