The program realization of news classification from webpage relativity TF-IDF to cosine theorem

Source: Internet
Author: User
Tags idf

Premise: TF-IDF model is a kind of information retrieval model widely used in real applications such as search engine, but there are always questions about TF-IDF model. In this paper, a box-ball model based on conditional probability, the core idea is to turn "query string Q and document D's matching degree" into "conditional probability problem of query string Q from Document D". It defines the goal that the matching degree expressed by the TF-IDF model is clearer than that of the information retrieval problem from the perspective of probability. This model can be included in the TF-IDF model, on the one hand to explain its rationality, on the other hand also found its imperfections. In addition, this model can explain the meaning of PageRank, and why the PageRank weights and TF-IDF weights are product relationships.---> References (http://baike.baidu.com/view/1228847.htm?fr= Aladdin).

September 19, 2014 23:49:07 implement IF-IDF. Only 30%. Pit Ah, the late threshold how to set or need to study ~

Package Com.lean;import java.util.arraylist;import java.util.arrays;/* * 1. How to measure the relevance of Web pages and queries---information retrieval field * TF-IDF (Word frequency-inverse text rate index ) algorithm: * TF frequency = (number of occurrences of Word/total number of words in text) * Idf=log (D/DW) =log (total number of pages/pages containing specific words)----> Why is log (), the interpretation of mathematical beauty is "cross-entropy of the probability distribution of a keyword under a given condition" "* Relevance =tf1*idf1+tf2*idf2+tf3*idf3+ ... * 2. How to use cosine theorem to news automatic classification algorithm * Calculate IDF for each news glossary. * Use each News glossary IDF as a feature vector-to-array implementation, each subscript corresponding to a specific word * Use the cosine formula to calculate the angle a between news 22, (the greater the angle, the smaller the correlation, the most relevant parallelism) * constantly merging a<t (T stands for threshold, based on experience) The resulting small class * continues to compute the eigenvectors of each small class, continuing to merge until it is merged into 1 large classes. * Depending on the level of consolidation, determine the size of the news category, OK * * 3. Optimization algorithm Complexity: * The cosine theorem of molecular denominator optimization, do memory storage.  * Delete the function words (the, is, and, some conjunctions, adverbs, prepositions, only positioning notional "aXb" mode) * position weighted (subject keyword weight > body, body key keywords weight > middle) * * Simulation algorithm starts ~ ~ ~ */public class Newcate {int a[]={};static int news1[]={0,1,2,3,4,5,6,7,8,9,1,2,3,4,5,1,2,3,3,3};//represents a piece of news. Text replaces ~~~static int news2[]={12,13,19,11,12,13,14,15,11,12,13,13,3};static int news3[]={with numbers 10,11,12,13,24,24,25,21,22,23,23,13};static int KEYS[]={3,2,8};//3 Keywords, of course, assuming that the results have been processed by the word/* * Calculate the frequency of */private double[] G Ettf (arraylist<int[]> newslist, int keys[]) {int n=newSlist.size (), m=keys.length;double tf[]=new double[m];//frequency for (int i = 0; i < tf.length; i++) {tf[i]=0;} /* * can be optimized to nlogn~~~ lazy Write ~ ~ */for (int i = 0; i < m; i++) {for (int j = 0; J < N; j + +) {int news[]=newslist.get (j); int k=news.length;for (int k = 0; k <k; k++) {if (Keys[i]==news[k]) tf[i]+= (1.0/k);}}} return TF;} /* * Calculate weights for each keyword * Idf=log (D/DW) =log (total pages/pages containing specific words) */private double[] GETIDF (arraylist<int[]> al,int keys[]) {int D=al.size (), dw=0;double idf[]=new double[keys.length];for (int i = 0; i < idf.length; i++) {dw= getdw (al,keys[i]); IDF[i ]+=math.log (D/DW);} return IDF;} /* * Calculates the number of news occurrences of the keyword */private int getdw (arraylist<int[]> newslist, int i) {int cnt=0;for (int j = 0; J < newslist.si Ze (); J + +) {int news[]=newslist.get (j); for (int k = 0, K < news.length; k++) {if (news[k]==i) {cnt++; break;}}} return CNT;} /* * Word Frequency * Weight collection * Correlation =tf1*idf1+tf2*idf2+tf3*idf3+ ... */private double getrelate (double tf[],double idf[]) {int n=tf.length; Double ans=0.0;for (int i = 0; I &Lt;n; i++) {ans+=tf[i]*idf[i];} return ans;} public static void Main (string[] args) {newcate nc=new newcate ();//---------------IF-IDF------------//News page collection ArrayList <int[]> newslist=new arraylist<int[]> (); Newslist.add (NEWS1); Newslist.add (NEWS2); Newslist.add (NEWS3); Newslist.add (NEWS3); Newslist.add (NEWS3); Newslist.add (NEWS3);d ouble Tf[]=nc.gettf (newslist, Keys);d ouble idf[]= NC.GETIDF (Newslist, Keys); SYSTEM.OUT.PRINTLN ("keyword frequency set =" +arrays.tostring (TF)); SYSTEM.OUT.PRINTLN ("keyword contribution set =" +arrays.tostring (IDF));d ouble relate=nc.getrelate (TF, IDF); System.out.println ("Relevance of keywords =" +relate);//-----------News_cate-----------------------}}

Write the rest of tomorrow morning, goodnight.


The program realization of news classification from webpage relativity TF-IDF to cosine theorem

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.