Reprinted from http://www.ruanyifeng.com/blog/
Last time I used TF-IDF algorithms to automatically extract keywords.
Today, let's look at another issue. Sometimes, in addition to finding keywords, we also hope to find other articles similar to the original article. For example, Google News provides similar news under the main news.
Cosine similiarity is used to identify similar articles ). The following is an example of cosine similarity ".
For the s
There is a problem that requires the use of pure MySQL to implement a TF-IDF algorithm.The original input is a articles table with 100 columns and one word per column. In fact, the core difficulty is how to traverse the comparison of these 100 words and specified words such as ' apple ' for comparison. First of all, brute force is poor to give all the column names, such as Word1, Word2 ... But this code must be ugly ugly, and if it is 1000 columns wha
;import Com.elex.utils.dataclean;import Com.google.common.io.Closeables; public class Tfidf_5 {public static String Hdfsurl = "hdfs://namenode:8020 ";p ublic static String FileURL ="/tmp/usercount ";p ublic static class Tfmap extends MapperCounter ct = tfjob.getcounters (). Findcounter ("Org.apache.hadoop.mapreduce.TaskCounter", "map_input_records"); System.out.println (Ct.getvalue ());iterableOriginally used a separate job to calculate the number of documents, followed by the company's predeces
, a and B are two vectors. we need to calculate their angle θ. The cosine theorem tells us that we can use the following formula:
If the vector a is [x1, y1] and the vector B is [x2, y2], you can rewrite the cosine theorem to the following form:
Mathematicians have proved that this calculation method of cosine is also true for n-dimensional vectors. Assume that A and B are two n-dimensional vectors, and A is [A1, A2 ,..., an], B is [B1, B2 ,..., bn], then the cosine of the angle θ between A a
() + g.tolowercase ();if(R.SUBSTR (++d) *0x3,0x6) = = G.concat ("Easy") C.test (a)) {d =String(0x1) +String(A.length)}}};if(A.substr (0x4,0x1) !=String. fromCharCode (d) | | A.SUBSTR (0x4,0x1) =="Z") {alert ("Well, think again." ")}Else{Alert ("Congratulations, congratulations!" ")}/script>Analyze the code and find that variable A is the flag we requested.After B.replace (/7/ig, ++d). replace (/8/ig, D * 0x2), the variable B becomes f3313e36c611150119f5d04ff1225b3e, and MD5 is decrypted after
Last time, I used the TF-IDF algorithm to automatically extract keywords.
Today, we are going to look at another related issue. Sometimes, in addition to finding the keyword, we also want to find other articles similar to the original article. For example, "Google News" under the main news, but also provides a number of similar news.
In order to find similar articles, "Cosine similarity" (cosine similiarity) is needed. Now, let me give you an exam
This headline seems to be very complicated, in fact, I would like to talk about a very simple question.
There is a very long article, I want to use the computer to extract its keywords (Automatic keyphrase extraction), completely without manual intervention, how can I do it correctly?
This problem involves data mining, text processing, information retrieval and many other computer frontier areas, but unexpectedly, there is a very simple classical algorithm, can give a very satisfactory resul
Using Java to implement feature extraction calculation TF-IDF
(1) The formula for calculating the frequency of anti-document is as follows:
(2) The formula for calculating TF-IDF is as follows:
Tf-idf=tf*idf
(2) Java code implementation
Package Com.panguoyuan.datamining.first;
Import Java.io.BufferedReader;
Import Ja
Key points of knowledge:
Boolean model
If/idf
Vector space Model
First,the Boolean modelwhen ES makes various searches for scoring, the initial filter is done with the Boolean model, similar to the Boolean model and This logical operator first filters out the containing specified Term of the Doc . must/must not/should(filtered, included, not included, may contain) These cases, this step does not rate the individual doc , only
That year, Chrysanthemum is only chrysanthemum, 2B or exam when the use of the pencil, cucumber only vegetables function, information retrieval technology (information retrieval) is simply used in libraries, databases and other places.
It is also in that year, information retrieval related sorting technology is very popular is TF-IDF.
Perhaps at this moment you will be very want to ask, what is TF-IDF? We
Statement
The following code is just the basic implementation of the TF-IDF algorithm idea, so many places need to be perfected, summarized as follows:1. To achieve the logic problem: special position, such as paragraph first or noun (relative to the verb), should have a greater weight;2. Before the word segmentation should be the basic processing of text: Remove punctuation, the appropriate way to call the word segmentation interface, so that the te
Jeffrey Magder | replyI was having the same problem, but I just FIXED it. I was getting the same error from the following code:
HMODULE hPowerFunctions = LoadLibrary ("Powrprof. dll ");Typedef bool (* tSetSuspendStateSig) (BOOL, BOOL, BOOL
Read Catalogue
Topic
Analysis
TopicsBack to TopAnalysisHave you seen this code? Anyway, I've seen this. It's actually JavaScript code.Reference:Principle--principleInstanceCopy the code of the topic, open the Google browser F12
Sometimes, very simple mathematical methods can accomplish very complex tasks.
The first two parts of the series are good examples. Only by counting the frequency of words can you find keywords and similar articles. Although they are not the best
1. English in the picturePictures can have many ways to open, to crack the problem, you need to download the picture.For pictures, we can use the image editing software, to do various dimming, color and other operations.We can also open using 2
Read Catalogue
Topic
Start analysis
Find Attachment Data
Restore Attachment Data
Processing a restored data attachment
Find attachment Data again
TopicsThe topic gives a link to the network, click into the file to
Read Catalogue
Topic
Analysis
TopicsBack to TopAnalysis
Right-click to save the picture as a project root directory of my Kail Linux. Now for analysis!
According to test instructions, image estimation is
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.