Spark Large Data Chinese word segmentation statistics (c) Scala language to achieve word segmentation statistics __spark

Source: Internet
Author: User

Java version of the spark large data Chinese word segmentation Statistics program completed, after a week of effort, the Scala version of the spark

Large data Chinese Word segmentation Statistics program also got out, here to share to you want to learn spark friends.

The following is the final operation of the program screen screenshot, and the Java version of the difference is not:

The following is the Scala engineering structure:

When you right-click on the project main class file Wordcounter.scala, select Run as Scala application:

Then select the Tang and song CI for word segmentation statistics, there will appear in front of the word segmentation results.

Engineering code has been uploaded csdn:http://download.csdn.net/detail/yangdanbo1975/9608632.

The overall engineering structure is very simple, and in the text package, as in Java Engineering, contains the built-in text files. The entire project references a class library similar to Java engineering, with just a little more scala content.

It should be noted that Scala's default reference class libraries are different because of the Scala version, for example, when you select the Scala 2.10.6 version of Eclipse, the Swing class library is automatically cited

, as shown in the following illustration:

However, if you choose a different Scala version, such as the 2.1.18 version of the latest download installation, swing class libraries will have to load themselves manually:

You can toggle the version of the Scala Library by edit the library in the project Properties Java build Path-> Scala library Container:

The entire project consists of Guiutils.scala,sparkwordcount.scala,textpane.scala and Wordcounter.scala four Scala classes and Javautil.java a Java class.

Guiutils.scala is completely copied from the code on the Web, and implements functions similar to the Optionpanel message prompt box in Java swing.

Textpane.scala is copied from the ScalaSwing2 project on GitHub and transplanted Jtextpanel into Scala. The standard Scala library was not implemented until version 2.1.18

Textpanel, only textarea, our project shows word segmentation results followed by the Java version of the Jtextpane, so we copied this Scala version.

Sparkwordcount.scala class implements the spark Chinese word segmentation statistics core function, is in the DT Big Data dream Factory Wang Jialin Teacher's sparkwordcount code based on rewrite.

First, the main functional steps are moved from the companion object's main method to the Sparkwordcount class, and split into multiple methods so that the main method of the companion object and the subsequent GUI interface

Can call:

Class Sparkwordcount {
var sc:sparkcontext = null;

def initspark (appname:string) {
/**
* Step 1th: Create the Spark Configuration Object sparkconf, set the runtime configuration information for the Spark program,
* For example, by Setmaster to set the URL of the master of the spark cluster that the program will link to, if set
* For local, it represents the spark program running locally, especially for machine configuration conditions very poor (for example,
* Only 1G of memory) beginners *
*/
Val conf = new sparkconf ()//Create sparkconf Object
Conf.setappname (AppName)//Set the name of the application, the monitor interface in which the program runs can see the name
Conf.setmaster ("local")//At this time, the program is running locally, do not need to install spark cluster

/**
* Step 2nd: Create a Sparkcontext object
* Sparkcontext is the only access to all features of the spark program, whether it be in Scala, Java, Python, R, etc. must have a sparkcontext
* Sparkcontext Core role: Initialize the core components required for spark application to run, including Dagscheduler, TaskScheduler, Schedulerbackend
* will also be responsible for spark procedures to master registration procedures, etc.
* Sparkcontext is one of the most important objects in the entire spark application
*/
sc = new Sparkcontext (conf)//Create Sparkcontext object to customize specific parameters and configuration information for spark run by passing in sparkconf instance
}

def wordCount (doc:string, Wordlength:int): rdd[(String,int)]={
/**
* 3rd Step: Based on the specific data sources (HDFS, HBase, local FS, DB, S3, etc.) through the Sparkcontext to create Rdd
* There are basically three ways to create RDD: based on external sources of data (such as HDFs), based on Scala collections, and by other RDD operations
* The data will be rdd into a series of partitions, and the data assigned to each partition belongs to a task's processing category.
*/
Val lines = Sc.textfile ("e://text//Tang poetry 300 txt", 1)//read local file and set to a partion
Val lines = Sc.textfile ("src/com/magicstudio/spark/text/Tang poetry 300 txt", 1)
Val lines = Sc.textfile (doc, 1)

/**
* 4th step: The initial RDD for the transformation level of processing, such as map, filter, such as higher order functions such as programming, to carry out specific data calculation
* Step 4.1: Tell the string of each line to split into a single word
*/

Val words = lines.flatmap {line => line.split ("")}////A Word split on the string for each line and the split result of all rows through flat merged into a large word collection
Val words = lines.flatmap {line => javautil.getsplitwords (line, wordlength). Asscala}
/**
* 4th step: The initial RDD for the transformation level of processing, such as map, filter, such as higher order functions such as programming, to carry out specific data calculation
* Step 4.2: Count Each word instance to 1 on the basis of the word split, that is, word => (Word, 1)
*/
Val pairs = words.map {Word => (Word, 1)}

/**
* 4th step: The initial RDD for the transformation level of processing, such as map, filter, such as higher order functions such as programming, to carry out specific data calculation
* Step 4.3: Count The total number of occurrences of each word in a file on the basis of 1 counts per word instance
*/
Val wordcounts = Pairs.reducebykey (_+_)//To the same key, the cumulative value (including local and reducer levels at the same time reduce)

Added by Dumbbell Yang at 2016-07-24
Wordcounts.sortby (x => x._2, False, WordCounts.partitions.size)
}

def outputresult (wordcounts:rdd[(string,int)]) {
Wordcounts.foreach (Wordnumberpair => println (wordnumberpair._1 + ":" + wordnumberpair._2))
}

Def Closespark () {
Sc.stop ()
}

}

Secondly, in the WordCount method, the original 3rd step to read the fixed file mode to the parameter mode, can be the SRC directory relative file path (in GUI interface through the dropdown

box, or the absolute file path on the local disk (selected by the File navigation box):

Val lines = Sc.textfile ("e://text//Tang poetry 300 txt", 1)//read local file and set to a partion
Val lines = Sc.textfile ("src/com/magicstudio/spark/text/Tang poetry 300 txt", 1)
Val lines = Sc.textfile (doc, 1)

Then is the first step 4.1, by calling the Java method in the Javautil class, the Chinese word segmentation function, replace the original simple split, for each line of text in Chinese participle:

Val words = lines.flatmap {line => line.split ("")}////A Word split on the string for each line and the split result of all rows through flat merged into a large word collection
Val words = lines.flatmap {line => javautil.getsplitwords (line, wordlength). Asscala}

It's important to note that because Java functionality needs to be invoked to pass data between Scala and Java, you must refer to the library for data type conversions:

Import collection. Javaconverters._

It is then possible to asscala the results returned by the Getsplitwords method in Javautil to meet the requirements of the Scala method invocation.

The last change is to add a word to the statistical results according to the frequency of the sorting function:

Added by Dumbbell Yang at 2016-07-24
Wordcounts.sortby (x => x._2, False, WordCounts.partitions.size)

You can compare Java methods to implement sorting, Exchange keys and value, sort, and then switch back in the tedious, Scala language is indeed very convenient.

After the above changes, spark Chinese word segmentation statistics function, can be called from the Main method, such as the companion object in the original call:

     /**
       * Using Scala to develop spark wordcount programs for local tests
       * @author DT large Data dream factory
       * Sina microblog: http://weibo.com/ilovepains/
       */
Object sparkwordcount{
 
    def main (args:array[string]) {
       val counter = new Sparkwordcount
      
       Counter.initspark ("Spark Chinese word Segmentation statistics")
      
        Val words = Counter.wordcount ("src/com/magicstudio/spark/text/Tang poetry 300 txt", 2)
       
       counter.outputresult (words)
       
       Counter.closespark ()
   }
}

It can also be invoked from the GUI interface program Wordcounter.scala.

Wordcounter.scala class mainly realizes the spark Chinese word segmentation statistic Program GUI interface, the code also is not complex, needs to pay attention to have the following points:

First associated object declaration, in the latest Scala library, is based on simpleswingapplication:

Object WordCounter extends Simpleswingapplication {

But in the early Scala library, this class name is Simpleguiapplication, so many online code that hasn't been updated in time, in the new Scala

The library needs to modify the class name to compile and run.

Second, the return value of the Scala function, which is just saying that the return value of the last statement of the function is the return value of the function, but it is not specific.

The program test, in fact, should be said to be the last execution of the return value of the statement more exact, and should be noted that under different conditions, the implementation of different logic, so

The last execution statement is not like many examples, it must be the last line of the statement, for example:

Def getdocpath (): string={
if (DocField.text.isEmpty ()) {
"src/com/magicstudio/spark/text/" + CboDoc.selection.item + ". txt"
}
else{
Docfield.text
}
}

Again for example:

Def gettopn (): int={
if (top50.selected) {
50
}
else if (top100.selected) {
100
}
else if (top500.selected) {
500
}
else if (top1000.selected) {
1000
}
else if (topall.selected) {
0
}
else{
0
}
}

Moreover, the return value does not have to write back, the direct expression can, fully embodies the Scala language assiduous refinement.

Finally, it is worth mentioning the mutual invocation function of Scala and Java, which is of great significance for reusing the extensive application functions of existing Java development.

In Scala, you can add Java classes, reference existing Java classes, implement many functions in Java methods, and then invoke them in the Scala class.

For example, in this project, the Chinese Word segmentation function is implemented through the Java method, referencing the Ikanalyzer component in the Javautil method, in the Scala class

Call. Again, for example, other methods in Javautil, such as:

public static void Showrddwordcount (Javardd<tuple2<string, int>> WordCount,
int Countlimit, String curdoc, Jtextpane Resultpane, Jcheckbox chkclear)

is also rewritten from the original Java project in the source code, in the Scala class reference, completed in the GUI interface to display the result of the function of the word.

Of course, in order to reference in Scala, some changes have been made to the parameters, such as the original interface control is not passed, now to pass the Scala interface component

Peer (corresponding to the Java Swing component), the original word breaker is Tuple2<string,integer>, now changed to Tuple2<string,int>

The INT type replaces the Java integer type because Scala's Rdd.tojavardd () method generates RDD <String,Int>. And Java is a perfect guide to

Use the Scala int type (the original Tuple2 is the type of Scala). In a nutshell, Scala and Java invoke each other very powerful and convenient.

The above is a small summary of the Scala language to achieve spark Chinese word segmentation statistics. If I have time later, I will continue to try Sparkstreaming,

Spark SQL and other related Spark technology, and strive to grasp the full Spark.







Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.