in order to get the topn of each group, the first is to group, then each group is sorted, get topn.
test Data
Hadoop
Spark
Java Spark Spark (Hadoop)
hadoop
spark
1, the source data as follows, take out the top three of each class results
Class1 98Class2 90CLASS2 92Class1 96Class1 100Class2 89Class2 68Class1 81Class2 90
2, the implementation process
Package BasicImport Org.apache.spark. {sparkconf, Sparkcontext}/*** Created by TG on 10/25/16.*/Object GROUPTOPN {def main (args:array[string]): unit = {Val conf=new sparkconf (). Setappname ("Grouptopn"). Setmaster ("local")Val sc=new sparkcontext (conf)
Reading data from the HDFsVal lines=sc.textfile ("Hdfs:
: http://sqoop.apache.org
Spark
As an alternative to MapReduce, Spark is a data processing engine. It claims that, when used in memory, it is up to 100 times times faster than MapReduce, and when used on disk, it is up to 10 times times faster than MapReduce. It can be used with Hadoop and Apache Mesos, or it can be used standalone.Supported operating systems: Windows, Linux, and OS X.RELATED
"Note" This series of articles and the use of the installation package/test data can be in the "big gift--spark Getting Started Combat series" Get 1, compile sparkSpark can be compiled in SBT and maven two ways, and then the deployment package is generated through the make-distribution.sh script. SBT compilation requires the installation of Git tools, and MAVEN installation requires MAVEN tools, both of which need to be carried out under the network,
"Note" This series of articles and the use of the installation package/test data can be in the "big gift--spark Getting Started Combat series" Get 1, compile sparkSpark can be compiled in SBT and maven two ways, and then the deployment package is generated through the make-distribution.sh script. SBT compilation requires the installation of Git tools, and MAVEN installation requires MAVEN tools, both of which need to be carried out under the network,
the corresponding task. In addition, Sparkclient will get the job run state through Appmaster. Appendix basic components in the Spark architecture Clustermanager: In the standalone mode is master (Master node), control the whole cluster, monitor worker. In yarn mode for the resource manager. Worker: From node, responsible for control compute node, start executor or driver. In yarn mode, the NodeManager is responsible for computing the node control. D
"Note" This series of articles, as well as the use of the installation package/test data can be in the "big gift –spark Getting Started Combat series" get1 Spark Streaming Introduction1.1 OverviewSpark Streaming is an extension of the Spark core API that enables the processing of high-throughput, fault-tolerant real-time streaming data. Support for obtaining data
extends the spark RDD API, allowing us to create a forward graph with any property that is bound to each node and edge. GRAPHX also provides a wide variety of operator diagram operators, as well as a library of common graph algorithms.Cluster Manager cluster managers at the bottom, spark can effectively scale from one compute node to hundreds of nodes. To achieve this goal while maximizing flexibility,
phase (Stage): Each job will be split into a lot of task, each group of tasks is called Stage, also can be called Taskset, a job is divided into several stages;
L Task: A work assignment that is sent to a executor;
1.2 Spark running basic process
Spark Run basic process see schematic below
1. Build the Spark appl
Three, in-depth rddThe Rdd itself is an abstract class with many specific implementations of subclasses:
The RDD will be calculated based on partition:
The default partitioner is as follows:
The documentation for Hashpartitioner is described below:
Another common type of partitioner is Rangepartitioner:
The RDD needs to consider the memory policy in the persistence:
Spark offers many storagelevel
1. Introduction
The Spark-submit script in the Spark Bin directory is used to start the application on the cluster. You can use the Spark for all supported cluster managers through a unified interface, so you do not have to specifically configure your application for each cluster Manager (It can using all Spark ' s su
Nov 6 17:22:3...|HOTFIX-Fix-python...|+----------------+--------------------+--------------------+--------------------+--------------------+(2) Calculate the total number of submissionsscala> sqlContext.sql("SELECT count(*) as TotalCommitNumber FROM commitlog").show+-----------------+|TotalCommitNumber|+-----------------+| 13507|+-----------------+(3) descending order by number of submissionsScala> Sqlcontext.sql ("Select Author,count (*) as CountNumber from Commitlog
impressive.
Christopher laments that the spark community is strong enough to allow Adatao to achieve its current accomplishments in the short term, promising to give the code back to the community in the future. Databricks co-founder Patrick Wendell: Understanding the performance of spark applications
for Spark programmers, this speech is a must-see . Patrick fr
1, first download the image to local. https://hub.docker.com/r/gettyimages/spark/~$ Docker Pull Gettyimages/spark2, download from https://github.com/gettyimages/docker-spark/blob/master/docker-compose.yml to support the spark cluster DOCKER-COMPOSE.YML fileStart it$ docker-compose Up$ docker-compose UpCreating spark_master_1Creating spark_worker_1Attaching to Sp
Step 1: Test spark through spark Shell
Step 1:Start the spark cluster. This is very detailed in the third part. After the spark cluster is started, webui is as follows:
Step 2: Start spark shell:
In this case, you can view the shell in the following Web console:
S
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.