1, TF-IDF
The main idea of IDF is that if the fewer documents that contain the entry T, that is, the smaller the n, the larger the IDF, the better the class-distinguishing ability of the term T. If the number of documents containing the term T in a class of document C is M, and the total number of documents containing T in the other class is K, it is clear that all documents containing T are n=m+k, when M is large, n is also large, and the IDF value obtained by the IDF formula is small, indicating that the term T category is not strong. But in fact, if an entry is frequently present in a document of a class, it indicates that the term is a good representation of the character of the text of the class, and that the entry should give them a higher weight and be chosen as the characteristic word of the text to distinguish it from other classes of documents. This is where the IDF is deficient. In a given document, the word frequency (term FREQUENCY,TF) refers to how often a given term appears in the file. This number is normalized to the number of words (term count) to prevent it from favoring long files. (The same word may have a higher number of words in a long document than a short document, regardless of whether the word is important or not.) )
The
Word frequency-inverse document frequency (TF-IDF), is widely used in text mining to reflect the importance of a word for the corpus of the significance of the method of generating eigenvectors, using T to represent a word, d for a document, D for the document library, frequency TF (t,d) is the number of words T in document D, The document Frequency DF (t,d) indicates how many documents contain the word t, and if we use only word frequency to measure importance, it will be easy to overemphasize some words that appear very frequently but only contain small amounts of information, such as: "A", "the" and "of", if a word appears very frequently in the document library, It means that it does not load special information about a particular document, the inverse document frequency is a digitized measure of the amount of information a word loads, and TF-IDF shows how a word relates to a particular document. After you have built the word frequency vectors, you can use IDF to calculate the inverse document frequency and then multiply them by the word frequency to calculate the TF-IDF instance:
Import Org.apache. Spark. Ml.feature. {HASHINGTF, IDF, Tokenizer}
val sentencedata = Spark.createdataframe (Seq (
(0, "Hi I heard about Spark"),
(0, "I wish ; java could use case classes "),
(1," Logistic regression models is neat ")
). TODF (" label "," Sentenc E ")
val tokenizer = new Tokenizer (). Setinputcol (" sentence "). Setoutputcol (" words ")
val wordsdata = Tokenizer.tra Nsform (Sentencedata)
val hashingtf = new HASHINGTF (). Setinputcol ("words"). Setoutputcol ("Rawfeatures"). Setnumfeatures ()
Val featurizeddata = Hashingtf.transform (wordsdata)//Alternatively, Countvectorizer can also be used to get term Frequen Cy Vectors
Val IDF = new IDF (). Setinputcol ("Rawfeatures"). Setoutputcol ("Features")
Val Idfmodel = Idf.fit (featurizeddata)
Val rescaleddata = Idfmodel.transform (featurizeddata)
Rescaleddata.select ("Features", "label"). Take (3). foreach (println)
2. Logistic regressionExamples of affective classifications:
Import Org.apache.spark.mlib.classification.LogisticRegressionWithSGD
Val negetive = Sc.textfile ("Lvyou_comment_negitive.txt")
Val normal = Sc.textfile ("Lvyou_comment_passive.txt")
Create a HASHINGTF instance to map the evaluation text to a vector of 10,000 features
Val tf = new HASHINGTF (numfeatures = 10000)
Each evaluation is cut into words, each word mapped to a feature
Valnegetivefeatures = Negetive.map (Comment=>tf.transform ("Comment.split (")))
Val noramfeatures = Normal.map (Comment=>tf.transform (Comment.split ("")))
Create Labeledpoint datasets to store examples of negative and positive evaluations, respectively
Val positiveexamples =negetivefeatures.map (Features=>labeledpoint (1,features))
Val negativeexamples = Noramfeatures.map (Features=>labeledpoint (0,features))
Val trainingdata = positiveexamples.union (negativeexamples)
Trainingdata.cache ()//Because the logistic regression is an iterative algorithm, the cache training data
Using the SGD algorithm to run logistic regression
Val model = new LOGISTICREGRESSIONWITHSGD (). Run (Trainingdata)
Test with examples of negative and positive evaluations, respectively
Val postest = Tf.transform ("0 M G GET cheap stuff ...". Split (""))
Val negtest = Tf.transform ("". Split (""))
Model.predict (postest) 3, SVM
PackageClassificationImportCom.huaban.analysis.jieba.JiebaSegmenterImportCom.huaban.analysis.jieba.JiebaSegmenter.SegModeImportOrg.apache.spark.mllib.feature. {HASHINGTF, IDF}ImportOrg.apache.spark.mllib.regression.LabeledPointImportOrg.apache.spark. {sparkconf, Sparkcontext}ImportScala.collection.mutableImportOrg.apache.spark.mllib.classification.SVMWithSGDImportOrg.apache.spark.mllib.evaluation.BinaryClassificationMetricsImportScala.collection.javaconversions._Objectsvmwithsgdforcomment {defMain (args:array[string]): Unit = {Valconf =NewSparkconf (). Setappname ("Svmwithsgdexample")Valsc =NewSparkcontext (CONF)if(Args.length < 4) {println ("Please input 4 args:datafile numiterations train_percent (0.6)!") System.exit (1)}ValDataFile = args.head.toStringValNumiterations = Integer.parseint (args (1))ValTrain_percent = args (2). ToDoubleValTest_percent = 1.0-train_percentValModel_file = args (3)//Data preprocessing//data loading into the Spark system, abstraction becomes an RDDValOrigindata = Sc.textfile (datafile)//distinct method for data deduplicationValOrigindistinctdata = origindata.distinct ()//converts each line of text into a list and retains only data that is longer than 2.ValRatedocument = Origindistinctdata.map (line = Line.split ('\ t'). Filter (line + line.length > 2)//Five points is no doubt favorable; considering different people's different preferences for scoring, for four points, three points of data, this article does not know whether it is a good or bad comment; for the three points below is bad commentValFiveratedocument = Ratedocument.filter (arrline = arrline (0). Equalsignorecase ("5")) System.out.println ("************************5 score num:"+fiveratedocument.count ())ValFourratedocument = Ratedocument.filter (arrline = arrline (0). Equalsignorecase ("4"))ValThreeratedocument = Ratedocument.filter (arrline = arrline (0). Equalsignorecase ("3"))ValTworatedocument = Ratedocument.filter (arrline = arrline (0). Equalsignorecase ("2"))ValOneratedocument = Ratedocument.filter (arrline = arrline (0). Equalsignorecase ("1")//Combined negative sample data 1.2 starsValNegratedocument = Oneratedocument.union (tworatedocument) negratedocument.repartition (1)//health training number, according to setValPosratedocument = Sc.parallelize (Fiveratedocument.take (Negratedocument.count (). ToInt)). Repartition (1)ValAllratedocument = Negratedocument.union (posratedocument) allratedocument.repartition (1)ValRate = Allratedocument.map (s = = Reducerate (s (0)))ValDocument = Allratedocument.map (s + = s (1))//text vector representation and text feature extraction each comment is translated into a wordValWords = document.map (sentence = Cut_for_calc (sentence)). Map (line = Line.split ("/"). Toseq) Words.foreach (seq =>{Valarr = Seq.tolistValline =NewStringBuilder Arr.foreach (item = {Line ++= (item +' ')
})
})
Training Word frequency matrix
HASHINGTF ()
tf = hashingtf.transform (words)
Tf.cache ()
// calculate TF-IDF matrix
IDF (). Fit (TF)
TFIDF = idf.transform (TF)
Generating training sets and test setsValzipped = Rate.zip (TFIDF)Valdata = zipped.map (tuple = Labeledpoint (tuple._1,tuple._2))ValSplits = Data.randomsplit (Array (train_percent, test_percent), seed = 11L)ValTraining = Splits (0). Cache ()ValTest = splits (1)ValModel = Svmwithsgd.train (training, Numiterations) Model.clearthreshold ()//Compute RAW scores on the test set.ValTopicsarray =Newmutable. Mutablelist[string]ValScoreandlabels = Test.map {point = =ValScore = Model.predict (point.features) (Score, Point.label)} scoreandlabels.coalesce (1). Saveastextfile ("file:///data/1/usr/local/services/spark/helh/comment_test_predic/")//Get evaluation metrics.ValMetrics =NewBinaryclassificationmetrics (Scoreandlabels)ValAuroc = Metrics.areaunderroc () println ("area under ROC ="+ Auroc)}defReducerate (rate_str:string): Int = {if(Rate_str.toint > 4)return1Elsereturn0; }defCut_for_calc (str:string): String = {ValJieba =NewJiebasegmenter ();ValLword_info = jieba.process (str, segmode.search); Lword_info.map (item = Item.word). Mkstring ("/")}}//Scalastyle:on printlnclassSvmwithsgdforcomment