Spark Big Data Chinese Word segmentation Statistics (iii) Scala language implementation segmentation statistics

Source: Internet
Author: User

The Java version of the spark Big Data Chinese word Segmentation Statistics program was completed, and after a week of effort, the Scala version of the spark

Big Data Chinese Word segmentation Statistics program also made out, here to share to you want to learn spark friends.

The following is the final interface of the program, and the Java version is not very different:

The following is the Scala engineering structure:

When you right-click on the project main class file Wordcounter.scala, select Run as Scala application:

Then select the Tang and song Words for word segmentation statistics, will appear before the word segmentation results.

The project code has been uploaded csdn:http://download.csdn.net/detail/yangdanbo1975/9608632.

The entire engineering structure is simple, and the text package, like in Java Engineering, contains a built-in textual file. The class library referenced throughout the project is similar to the Java project, just more Scala content.

It is important to note that, due to the Scala version, the class libraries referenced by Scala are different, for example, when you choose the Scala 2.10.6 version of Eclipse, the Swing class library is automatically cited

As shown in the following:

However, if you choose a different Scala version, such as the 2.1.18 version of the latest download installation, the Swing class library has to be loaded manually:

You can switch the version of the Scala library by the edit library in the project Properties Java Build Path, Scala library Container:

The entire project consists of Guiutils.scala,sparkwordcount.scala,textpane.scala and Wordcounter.scala four Scala classes and Javautil.java a Java class.

Guiutils.scala is completely copied from the online code, implementing a function similar to the Optionpanel message prompt box in Java swing.

The Textpane.scala was copied from the ScalaSwing2 project on GitHub and transplanted Jtextpanel into Scala. Standard Scala Library until 2.1.18 versions are not implemented

Textpanel, only textarea, our project shows the word segmentation results in the Java version of the Jtextpane, so we copied this version of the Scala.

Sparkwordcount.scala class implements the spark Chinese word segmentation statistic core function, is in the DT Big Data dream Factory Liaoliang Teacher's sparkwordcount code basis to rewrite.

First, the main function steps from the associated object's main method moved to the Sparkwordcount class, and split into several methods, so that the main method of the associated object and the subsequent GUI interface

Can be called:

Class Sparkwordcount {
var sc:sparkcontext = null;

def initspark (appname:string) {
/**
* 1th step: Create a Spark Configuration object sparkconf, set the runtime configuration information for the Spark program,
* For example, use Setmaster to set the URL of the master of the Spark Cluster to which the program is linked, if set
* For local, the spark program is run locally and is particularly suitable for very poor machine configuration conditions (e.g.
* Only 1G of memory) for beginners *
*/
Val conf = new sparkconf ()//Create sparkconf Object
Conf.setappname (AppName)//Set the name of the application, you can see the name in the monitoring interface of the program run
Conf.setmaster ("local")//At this time, the program runs locally without the need to install the spark cluster

/**
* 2nd step: Create a Sparkcontext object
* Sparkcontext is the only entry for all the functions of the spark program, including Scala, Java, Python, R, etc. must have a sparkcontext
* Sparkcontext Core role: Initialize the core components required for the spark application to run, including Dagscheduler, TaskScheduler, Schedulerbackend
* will also be responsible for the spark program to master registration program, etc.
* Sparkcontext is one of the most critical objects in the entire spark application
*/
sc = new Sparkcontext (conf)//Create Sparkcontext object to customize the specific parameters and configuration information of the spark run by passing in the sparkconf instance
}

def wordCount (doc:string, Wordlength:int): rdd[(String,int)]={
/**
* 3rd step: Create Rdd via sparkcontext based on specific data sources (HDFS, HBase, Local FS, DB, S3, etc.)
* There are three ways to create an RDD: Based on external data sources (such as HDFs), collections based on Scala, and other RDD operations
* Data is divided into a series of partitions by the RDD, and the data assigned to each partition belongs to the processing category of a task.
*/
Val lines = Sc.textfile ("e://text//Tang 300", 1)//read local file and set to a partion
Val lines = Sc.textfile ("src/com/magicstudio/spark/text/300 tang. txt", 1)
Val lines = Sc.textfile (doc, 1)

/**
* 4th step: The initial RDD for the transformation level of processing, such as map, filter and other high-order functions, such as programming, to perform specific data calculation
* Step 4.1: Tell the string of each line to split into a single word
*/

Val words = lines.flatmap {line + line.split ("")}//The string of each line is split and the split result of all rows is merged into a large word collection by flat
Val words = lines.flatmap {line + javautil.getsplitwords (line, wordlength). Asscala}
/**
* 4th step: The initial RDD for the transformation level of processing, such as map, filter and other high-order functions, such as programming, to perform specific data calculation
* Step 4.2: Count Each word instance to 1 on the basis of word splitting, that is, Word = (word, 1)
*/
Val pairs = Words.map {Word = = (Word, 1)}

/**
* 4th step: The initial RDD for the transformation level of processing, such as map, filter and other high-order functions, such as programming, to perform specific data calculation
* Step 4.3: Count The total number of occurrences of each word in a file based on counting 1 per word instance
*/
Val wordcounts = Pairs.reducebykey (_+_)//For the same key, the accumulation of value (both local and reducer levels simultaneously reduce)

Added by dumbbell Yang at 2016-07-24
Wordcounts.sortby (x = X._2, False, WordCounts.partitions.size)
}

def outputresult (wordcounts:rdd[(string,int)]) {
Wordcounts.foreach (Wordnumberpair = println (wordnumberpair._1 + ":" + wordnumberpair._2))
}

Def Closespark () {
Sc.stop ()
}

}

Second, in the WordCount method, the original 3rd step to read the fixed file mode to the parameter mode, can be the SRC directory relative file path (on the GUI interface through the drop-down

or the absolute file path on the local disk (selected via the file browse box):

Val lines = Sc.textfile ("e://text//Tang 300", 1)//read local file and set to a partion
Val lines = Sc.textfile ("src/com/magicstudio/spark/text/300 tang. txt", 1)
Val lines = Sc.textfile (doc, 1)

Then it is step 4.1, by calling the Java method in the Javautil class, the Chinese word segmentation function is implemented, and the original simple split is replaced, and the Chinese word segmentation is done for each line of text:

Val words = lines.flatmap {line + line.split ("")}//The string of each line is split and the split result of all rows is merged into a large word collection by flat
Val words = lines.flatmap {line + javautil.getsplitwords (line, wordlength). Asscala}

It is important to note that because of the need to invoke Java functionality to pass data between Scala and Java, you must reference the library of data type conversions:

Import collection. Javaconverters._

The results returned by the Getsplitwords method in Javautil can then be converted Asscala to meet the requirements of the Scala method invocation.

One final change is to add a feature that sorts the word-segmentation statistics by the frequency of words:

Added by dumbbell Yang at 2016-07-24
Wordcounts.sortby (x = X._2, False, WordCounts.partitions.size)

You can compare Java methods to achieve sorting, Exchange key and value, sort, and then swap back the tedious, Scala language is really handy for a lot.

      After the above changes, spark Chinese word segmentation statistics can be called from the Main method, such as the original call in the associated object:

     /**
       * Use Scala to develop locally tested spark wordcount programs
       * @author DT Big Data Dream Factory
       * Sina Weibo: http://weibo.com/ilovepains/
       */
Object sparkwordcount{
 
    def main (args:array[string]) {
       val counter = new Sparkwordcount
      
       Counter.initspark ("Spark Chinese word count")
      
        Val words = Counter.wordcount ("src/com/magicstudio/spark/text/300. txt", 2)
       
       counter.outputresult (words)
       
       Counter.closespark ()
   }
}

It can also be called from the GUI interface program of Wordcounter.scala.

Wordcounter.scala class mainly implements the spark Chinese word segmentation statistical Program GUI interface, the code is not complex, need to pay attention to the following points:

First associated object declaration, the newest Scala library, is based on simpleswingapplication:

Object WordCounter extends Simpleswingapplication {

But in the early Scala library, this class name is Simpleguiapplication, so many online code is not updated in the new Scala

The class name needs to be modified under the library to compile and run.

Second, it is about the Scala function return value, the document just said that the return value of the last statement of the function is the return value of the function, but it is not actually specific, after

The program test, in fact, should be said to be the last execution of the return value of the statement is more accurate, and should be pointed out under different conditions, the implementation of different logic, so

The last execution statement is not, as in many cases, the last line of the statement, for example:

Def getdocpath (): string={
if (DocField.text.isEmpty ()) {
"src/com/magicstudio/spark/text/" + CboDoc.selection.item + ". txt"
}
else{
Docfield.text
}
}

Again for example:

Def gettopn (): int={
if (top50.selected) {
50
}
else if (top100.selected) {
100
}
else if (top500.selected) {
500
}
else if (top1000.selected) {
1000
}
else if (topall.selected) {
0
}
else{
0
}
}

Moreover, the return value does not have to write return, the direct expression can, fully embodies the Scala language assiduous refinement.

Finally, it is worth mentioning that both Scala and Java invoke each other, which is far-reaching for reusing a large number of application functions of existing Java development.

In the Scala project, you can add Java classes, reference existing Java classes, implement many functions with Java methods, and then invoke them in the Scala class.

For example, in this project, the Chinese Word segmentation function is implemented through the Java method, referencing the Ikanalyzer component in the Javautil method, in the Scala class

Call. For example, other methods in Javautil, such as:

public static void Showrddwordcount (Javardd<tuple2<string, int>> WordCount,
int Countlimit, String curdoc, Jtextpane Resultpane, Jcheckbox chkclear)

is also rewritten from the original Java Project source code, in the Scala class reference, complete the GUI interface display Word segmentation results function.

Of course, in order to reference in Scala, the parameters have been changed, such as the original interface control is not passed, now to pass the Scala interface components

Peer (the corresponding Java Swing component), the original word segmentation tuple is tuple2<string,integer> now it's changed to Tuple2<string,int> with Scala's

The INT type replaces the Java integer type, because the RDD generated by the Scala Rdd.tojavardd () method is <String,Int>. And Java can be fully cited

Use the Scala int type (the original Tuple2 is the Scala type). All in all, the functionality that Scala and Java call each other is very powerful and convenient.

The above is a small summary of the implementation of Spark Chinese word segmentation statistics for the Scala language. If there is time, I will continue to try Sparkstreaming,

Spark other related technologies like Spark SQL for a complete mastery of spark.







Spark Big Data Chinese Word segmentation Statistics (iii) Scala language implementation segmentation statistics

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.