Social networking-based sentiment analysis III, social sentiment iiiEmotional analysis based on social network IIIBy bear flower (http://blog.csdn.net/whiterbear) reprint need to indicate the source, thank you.
Previously, we captured and processed Weibo data in a simple way. This article analyzes the similarity of school Weibo.Weibo Similarity Analysis
Here, we try to calculate the similarity of Weibo words between any two schools.
Idea: first, perform word segmentation on the school microblog,
describe information. Tagging is the user's behavior of assigning tags to information.
Killer features:
From our team's understanding of the current project, the entire site landing, uploading files, translation files and other display interfaces are written by the WPF design, that is, the so-called client, and we want to achieve a comprehensive web site.
Peripheral Features:
A good UI design
Scalability: Enhance functionality without destroying the underlying st
Official document Http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.htmlterm: not a simple key. Is Field-key, the key under the specified domainfactors that affect scoringCoord:document hit query in the number of term (not count, is the number of different term) term.tf:term in the corresponding field frequency TERM.IDF: The number of document containing this term Query.boost:query weights (when search is set) the weight of the term in term.boost:quer
TF-IDF_MapReduceJava Code Implementation ideas, mapreducetfidf
Thursday, February 16, 2017TF-IDF
1. Concept
2. Principles
3. java code implementation ideas
Dataset:
Three MapReduce
First MapReduce: (use the ik tokenizer to split words in a blog post, that is, content in a record)Result of the first MapReduce operation: 1. Obtain the dataset
Total number of Weibo posts;
2. Get
TF value of each word on the current WeiboMapper end: key: LongWritable (of
high-dimensional space, and the more commonly used mapping function is TF*IDF, which takes into account the occurrence of words in the document and document collections. A basic TF*IDF formula is as follows:
Ω=tfi (d) *log (N/DFI) (2-1)
(2-2)
where n is the number of documents in the document collection, TFI (d) is called the word frequency, the number of occurrences of the
First of all to ensure that the computer has downloaded the Git client, no self- https://git-scm.com/downloadSTEP1: Get the Build ToolchainWindows does not have a built-in "make" environment, so you will need a GNU-compatible environment to install the toolchain. We use the MSYS2 Environment to provide this. You don't have to use this environment all the time, you can program with front-end software such as Eclipse or Arduio, but the toolchain is actually running in the background. The quick
Directory:Calculates the similarity between two stringsII. application of TF-IDF and cosine similarity (II.): Finding similar articlesCalculates the similarity between two stringsThis article is reproduced from Cscmaker(1) cosine similarityThe similarity between the two vectors is measured by measuring the cosine of the corners between them. The cosine of the 0-degree angle is 1, and the cosine of any other angle is no greater than 1, and its minimum
_3823890314914825 2For data processing, according to "\ T" cutting, and then according to "_" Cut, output Context.write (today, 1)//Note that this will count the total number of files that are included today, so don't pay attention to Weibo IDReducer End: Key:w value:{1,1,1} Data sample: Key= Today, value={1,1,1,1,1} //each 1 means there is a microblog in the data set containing the word today Step one: The data after the shuffle process is consolidated (the same value for key is a group, and
TF-IDF (term frequency–inverse document frequency) is a commonly used weighted technique for information retrieval and information mining.The main idea of TFIDF is that if a word or phrase appears in an article with a high frequency of TF and is seldom seen in other articles, it is considered to be a good category-distinguishing ability and suitable for classification.TFIDF is actually: TF * IDF,TF Word fre
1. First. We use the surf algorithm to generate the feature points and descriptive descriptors of each picture in the image library.2. The K-means algorithm is used to train the feature points in the image library to generate the class heart.3. Generate BOF for each image. The detailed method is: Infer each feature point of the image with which class heart is recent. In the near future, a series of frequency tables will be generated. That is, the initial right to BOF.4. Add weights to the freque
. HDCSFTTD, a European-branded German Rosenberg, offers a holistic approach to the profession.
Body
In the current mainstream building wiring system, the backbone of the network data transmission is composed by the optical cable, no matter what horizontal wiring scheme is used, there is no substantial difference in the trunk, and as the wiring scheme of the FTTD, the main difference lies in the horizontal wiring system that is from the floor distribution between the
the page containing the word "automobile", and the page that actually contains the word "car" may be required by the user.Here is an example of LDA primitive paper[1]:is a term-document matrix, x means that the word appears in the corresponding file, the asterisk indicates that the word appears in the query, and when the user enters the query "IDF in computer-based information look up", The user is looking for pages related to
Feature ExtractionTf-idf
TF-IDF is generally used in text mining to reflect the importance of a feature item. Set the feature item to T, the document is D, and the document set is D. The feature frequency (term frequency) TF (T,D) for the feature item appears in document D in number of times. Document frequency (Documents frequency) DF (T,D) represents the number of documents with the feature item T. If you
Np.random.choice (len (utterances), 10, Replace=false)
# Evaluate Random Predictor
y_random = [Predict_random (TEST_DF. CONTEXT[X], test_df.iloc[x,1:].values) for x in range (len (TEST_DF))] for
n in [1, 2, 5,]:
print ("Recall @ ({}, : {: G} ". Format (n, Evaluate_recall (Y_random, Y_test, N))
Recall @ (1): 0.0937632
Recall @ (2): 0.194503
Recall @ (5): 0.49297 Recall
@ (10, 10): 1
Very good. The result is the same as we expected. Of course, we are not satisfied with a stochastic pre
The text similarity is computed using Sklearn, and the similarity matrix between the text is saved to the file. This extracts the text TF-IDF eigenvalues to calculate the similarity of the text.#!/usr/bin/python #-*-Coding:utf-8-*-import numpyimport osimport sysfrom sklearn import Feature_extractionfrom Sklea Rn.feature_extraction.text Import tfidftransformerfrom sklearn.feature_extraction.text import Tfidfvectorizer, Countvectorizerreload (SYS) #sys.
, power Center and raw material storehouse, auxiliary materials storehouse, sewage station, rubbish station and other areas link up (different enterprise name is slightly different, this name is for reference only), finally formed the Tobacco Enterprise Complete network system, 1.650) this.width=650; "style=" width:692px;height:436px; "title=" Untitled -8.jpg "src=" http://s3.51cto.com/wyfs02/M01/72/ 2a/wkiol1xeeshizysqaa-jbqq0cqe847.jpg "width=" "height=" "alt=" Wkiol1xeeshizysqaa-jbqq0cqe847.j
You can use the searcher. Explain (query, int DOC) method to view the specific composition of a document's score.
In Lucene, the score is calculated by TF * IDF * boost * lengthnorm.
TF: the square root of the number of times the query word appears in the documentIDF: indicates the document frequency to be reversed. After observing that all documents are the same, it is useless and does not take any decision.Boost: the incentive factor can be set thro
-1. Misunderstanding of TF-IDF
TF-IDF can effectively assess the importance of a word to one of a collection or corpus. Because it comprehensively represents the importance of the word in the document and the document discrimination. However, it is not enough to judge whether a feature has discrimination by simply using TF-IDF in text classification.
1) It does n
after the algorithm is completed, and the efficiency is not very high. So I personally copied a keyword matching method.
Preparations:
1. Prepare a word segmentation class library. shotseg 1.0 is used here, which is very effective but can be used.
2. Take a look at the concept of TF-IDF (TF-IDF is a statistical method used to evaluate the importance of a word to one of a collection or corpus. The importanc
Tags: gty ons ignores data and key list function divThe predecessor picked the tree, posterity. The source code is cmakelists on GitHub and can be compiled directly. Bubble robot has a very detailed analysis, combined with a discussion of the loop detection of the word bag model, with Gao Xiang's loopback detection application, basically can be strung together. The concept of TF-IDF, the expression is not unique, here is the definition of: TF indicate
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.