TF-IDF, Logistic regression, and SVM on spark

Last Update:2018-07-25 Source: Internet

Author: User

Tags foreach split svm idf

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1, TF-IDF

The main idea of IDF is that if the fewer documents that contain the entry T, that is, the smaller the n, the larger the IDF, the better the class-distinguishing ability of the term T. If the number of documents containing the term T in a class of document C is M, and the total number of documents containing T in the other class is K, it is clear that all documents containing T are n=m+k, when M is large, n is also large, and the IDF value obtained by the IDF formula is small, indicating that the term T category is not strong. But in fact, if an entry is frequently present in a document of a class, it indicates that the term is a good representation of the character of the text of the class, and that the entry should give them a higher weight and be chosen as the characteristic word of the text to distinguish it from other classes of documents. This is where the IDF is deficient. In a given document, the word frequency (term FREQUENCY,TF) refers to how often a given term appears in the file. This number is normalized to the number of words (term count) to prevent it from favoring long files. (The same word may have a higher number of words in a long document than a short document, regardless of whether the word is important or not.) ）

The

Word frequency-inverse document frequency (TF-IDF), is widely used in text mining to reflect the importance of a word for the corpus of the significance of the method of generating eigenvectors, using T to represent a word, d for a document, D for the document library, frequency TF (t,d) is the number of words T in document D, The document Frequency DF (t,d) indicates how many documents contain the word t, and if we use only word frequency to measure importance, it will be easy to overemphasize some words that appear very frequently but only contain small amounts of information, such as: "A", "the" and "of", if a word appears very frequently in the document library, It means that it does not load special information about a particular document, the inverse document frequency is a digitized measure of the amount of information a word loads, and TF-IDF shows how a word relates to a particular document. After you have built the word frequency vectors, you can use IDF to calculate the inverse document frequency and then multiply them by the word frequency to calculate the TF-IDF instance:
Import Org.apache. Spark. Ml.feature. {HASHINGTF, IDF, Tokenizer}
val sentencedata = Spark.createdataframe (Seq (
(0, "Hi I heard about Spark"),
(0, "I wish ; java could use case classes "),
(1," Logistic regression models is neat ")
). TODF (" label "," Sentenc E ")
val tokenizer = new Tokenizer (). Setinputcol (" sentence "). Setoutputcol (" words ")
val wordsdata = Tokenizer.tra Nsform (Sentencedata)
val hashingtf = new HASHINGTF (). Setinputcol ("words"). Setoutputcol ("Rawfeatures"). Setnumfeatures ()

Val featurizeddata = Hashingtf.transform (wordsdata)//Alternatively, Countvectorizer can also be used to get term Frequen Cy Vectors
Val IDF = new IDF (). Setinputcol ("Rawfeatures"). Setoutputcol ("Features")
Val Idfmodel = Idf.fit (featurizeddata)
Val rescaleddata = Idfmodel.transform (featurizeddata)
Rescaleddata.select ("Features", "label"). Take (3). foreach (println)

2. Logistic regressionExamples of affective classifications:
Import Org.apache.spark.mlib.classification.LogisticRegressionWithSGD
Val negetive = Sc.textfile ("Lvyou_comment_negitive.txt")
Val normal = Sc.textfile ("Lvyou_comment_passive.txt")
Create a HASHINGTF instance to map the evaluation text to a vector of 10,000 features
Val tf = new HASHINGTF (numfeatures = 10000)
Each evaluation is cut into words, each word mapped to a feature
Valnegetivefeatures = Negetive.map (Comment=>tf.transform ("Comment.split (")))
Val noramfeatures = Normal.map (Comment=>tf.transform (Comment.split ("")))
Create Labeledpoint datasets to store examples of negative and positive evaluations, respectively
Val positiveexamples =negetivefeatures.map (Features=>labeledpoint (1,features))
Val negativeexamples = Noramfeatures.map (Features=>labeledpoint (0,features))
Val trainingdata = positiveexamples.union (negativeexamples)
Trainingdata.cache ()//Because the logistic regression is an iterative algorithm, the cache training data
Using the SGD algorithm to run logistic regression
Val model = new LOGISTICREGRESSIONWITHSGD (). Run (Trainingdata)
Test with examples of negative and positive evaluations, respectively
Val postest = Tf.transform ("0 M G GET cheap stuff ...". Split (""))
Val negtest = Tf.transform ("". Split (""))

Model.predict (postest) 3, SVM

 PackageClassificationImportCom.huaban.analysis.jieba.JiebaSegmenterImportCom.huaban.analysis.jieba.JiebaSegmenter.SegModeImportOrg.apache.spark.mllib.feature. {HASHINGTF, IDF}ImportOrg.apache.spark.mllib.regression.LabeledPointImportOrg.apache.spark. {sparkconf, Sparkcontext}ImportScala.collection.mutableImportOrg.apache.spark.mllib.classification.SVMWithSGDImportOrg.apache.spark.mllib.evaluation.BinaryClassificationMetricsImportScala.collection.javaconversions._Objectsvmwithsgdforcomment {defMain (args:array[string]): Unit = {Valconf =NewSparkconf (). Setappname ("Svmwithsgdexample")Valsc =NewSparkcontext (CONF)if(Args.length < 4) {println ("Please input 4 args:datafile numiterations train_percent (0.6)!") System.exit (1)}ValDataFile = args.head.toStringValNumiterations = Integer.parseint (args (1))ValTrain_percent = args (2). ToDoubleValTest_percent = 1.0-train_percentValModel_file = args (3)//Data preprocessing//data loading into the Spark system, abstraction becomes an RDDValOrigindata = Sc.textfile (datafile)//distinct method for data deduplicationValOrigindistinctdata = origindata.distinct ()//converts each line of text into a list and retains only data that is longer than 2.ValRatedocument = Origindistinctdata.map (line = Line.split ('\ t'). Filter (line + line.length > 2)//Five points is no doubt favorable; considering different people's different preferences for scoring, for four points, three points of data, this article does not know whether it is a good or bad comment; for the three points below is bad commentValFiveratedocument = Ratedocument.filter (arrline = arrline (0). Equalsignorecase ("5")) System.out.println ("************************5 score num:"+fiveratedocument.count ())ValFourratedocument = Ratedocument.filter (arrline = arrline (0). Equalsignorecase ("4"))ValThreeratedocument = Ratedocument.filter (arrline = arrline (0). Equalsignorecase ("3"))ValTworatedocument = Ratedocument.filter (arrline = arrline (0). Equalsignorecase ("2"))ValOneratedocument = Ratedocument.filter (arrline = arrline (0). Equalsignorecase ("1")//Combined negative sample data 1.2 starsValNegratedocument = Oneratedocument.union (tworatedocument) negratedocument.repartition (1)//health training number, according to setValPosratedocument = Sc.parallelize (Fiveratedocument.take (Negratedocument.count (). ToInt)). Repartition (1)ValAllratedocument = Negratedocument.union (posratedocument) allratedocument.repartition (1)ValRate = Allratedocument.map (s = = Reducerate (s (0)))ValDocument = Allratedocument.map (s + = s (1))//text vector representation and text feature extraction each comment is translated into a wordValWords = document.map (sentence = Cut_for_calc (sentence)). Map (line = Line.split ("/"). Toseq) Words.foreach (seq =>{Valarr = Seq.tolistValline =NewStringBuilder Arr.foreach (item = {Line ++= (item +' ')
        })
    })

    Training Word frequency matrix
      HASHINGTF ()
    tf = hashingtf.transform (words)
    Tf.cache ()

//    calculate TF-IDF matrix
      IDF (). Fit (TF)
    TFIDF = idf.transform (TF)

Generating training sets and test setsValzipped = Rate.zip (TFIDF)Valdata = zipped.map (tuple = Labeledpoint (tuple._1,tuple._2))ValSplits = Data.randomsplit (Array (train_percent, test_percent), seed = 11L)ValTraining = Splits (0). Cache ()ValTest = splits (1)ValModel = Svmwithsgd.train (training, Numiterations) Model.clearthreshold ()//Compute RAW scores on the test set.ValTopicsarray =Newmutable. Mutablelist[string]ValScoreandlabels = Test.map {point = =ValScore = Model.predict (point.features) (Score, Point.label)} scoreandlabels.coalesce (1). Saveastextfile ("file:///data/1/usr/local/services/spark/helh/comment_test_predic/")//Get evaluation metrics.ValMetrics =NewBinaryclassificationmetrics (Scoreandlabels)ValAuroc = Metrics.areaunderroc () println ("area under ROC ="+ Auroc)}defReducerate (rate_str:string): Int = {if(Rate_str.toint > 4)return1Elsereturn0; }defCut_for_calc (str:string): String = {ValJieba =NewJiebasegmenter ();ValLword_info = jieba.process (str, segmode.search); Lword_info.map (item = Item.word). Mkstring ("/")}}//Scalastyle:on printlnclassSvmwithsgdforcomment

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

TF-IDF, Logistic regression, and SVM on spark

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

TF-IDF, Logistic regression, and SVM on spark

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support