Come with me. Data Mining (--spark) Getting Started

Last Update:2015-03-15 Source: Internet

Author: User

Tags hadoop mapreduce

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

About Spark

Spark is the common parallel of the open source class Hadoop MapReduce for UC Berkeley AMP Lab, Spark, with the benefits of Hadoop MapReduce But unlike MapReduce, the job intermediate output can be stored in memory, thus eliminating the need to read and write HDFs, so spark is better suited for the algorithm of map reduce, such as data mining and machine learning, that needs to be iterated.

Spark Benefits

Spark is an open source project of the most popular parallel computing framework for the next generation of Hadoop, based on memory and cloud computing, especially in support of interactive Query, stream computing, graph calculations, and more.
Spark has an unparalleled advantage in machine learning, especially for algorithms that require multiple iterations of computation. At the same time, Spark has a very good fault-tolerant and scheduling mechanism to ensure stable operation of the system, Spark's current development philosophy is to use a computational framework to set SQL, machine learning, Graph Computing, streaming Computing and many other functions in a project, with very good ease of use. At present, Spark has built its entire big data processing ecosystem, such as stream processing, graph technology, machine learning, NoSQL query and other aspects have their own technology, and is the Apache top project, it can be expected in the second half of 2014 in the community and commercial applications will have an explosive growth. The biggest advantage of spark is speed, which is 100 times times faster than Hadoop in iterative processing computing; another irreplaceable advantage of spark is that "one Stack to rule them all", Spark uses a unified technology stack to solve all the core issues of cloud computing big data, which directly lays the unified of its cloud computing big data domain;

Is the use time using the logistic regression algorithm:

Spark currently supports Scala, Python, and Java programming.

As the native language of Spark, Scala is the first choice for developing spark applications, and its elegant, concise code makes it feel like heaven to develop code for the MapReduce code.

Can be architected on top of Hadoop to read Hadoop, hbase data.

How Spark is deployed

1, standalone mode, that is, independent mode, with a complete service, can be deployed to a single cluster, no need to rely on any other resource management system.

2, Spark on Mesos mode. This is the model used by many companies, and it is officially recommended (one of the reasons, of course, is the blood relationship).

3, Spark on yarn mode. This is one of the most promising deployment modes.

Spark native Installation

Process: Enter linux-> install jdk-> install scala-> install Spark.

Installation and configuration of the JDK (slightly).

Install Scala and go to http://www.scala-lang.org/download/download.

Uncompressed after download.

Tar zxvf scala-2.11.6// renamed MV scala-2.11.6 Scala// setup Configuration Export Scala_home=/home/hadoop/software/scalaexport PATH= $SCALA _home/bin; $PATH

Source/etc/profile

Scala-2.11.6-Copyright 2002-2013, LAMP/EPFL

Scala Setup was successful.

Download spark from http://spark.apache.org/downloads.html and install it.

Uncompressed after download.

Enter $spark_home/bin, run

./run-example SPARKPI

Run results

Spark Assembly have been built with Hive, including DataNucleus jars on classpathusing Spark' s default log4j profile:org/apache/spark/log4j-defaults.properties15/03/14 23:41:40 INFO sparkcontext:running Spark  Version 1.3.015/03/14 23:41:40 WARN utils:your hostname, localhost.localdomain resolves to a loopback address:127.0.0.1; Using 192.168.126.147 instead (onInterfaceeth0)15/03/14 23:41:40 WARN Utils:set spark_local_ipifYou need to bind to another address15/03/14 23:41:41 WARN nativecodeloader:unable to loadnative-hadoop Library forYour platform ... using builtin-Java classes where applicable15/03/14 23:41:41INFO securitymanager:changing View ACLs To:hadoop15/03/14 23:41:41INFO securitymanager:changing Modify ACLs to:hadoop15/03/14 23:41:41INFO SecurityManager:SecurityManager:authentication Disabled; UI ACLs disabled; Users with view Permissions:set (ha Doop); Users with Modify Permissions:set (Hadoop)15/03/14 23:41:42INFO Slf4jlogger:slf4jlogger started15/03/14 23:41:42INFO remoting:starting Remoting15/03/14 23:41:42 INFO remoting:remoting started; Listening on addresses: [AKKA.TCP://[email protected]:60926]15/03/14 23:41:42 INFO utils:successfully started service ' sparkdriver ' on port 60926.15/03/14 23:41:42INFO sparkenv:registering Mapoutputtracker15/03/14 23:41:43INFO sparkenv:registering Blockmanagermaster15/03/14 23:41:43 INFO diskblockmanager:created Local directory at/tmp/spark-285a6144-217c-442c-bfde-4b282378ac1e/ blockmgr-f6cb0d15-d68d-4079-a0fe-9ec0bf8297a415/03/14 23:41:43 INFO Memorystore:memorystore started with capacity 265.1MB15/03/14 23:41:43 INFO httpfileserver:http File Server directory is/tmp/spark-96b3f754-9cad-4ef8-9da7-2a2c5029c42a/ httpd-b28f3f6d-73f7-46d7-9078-7ba7ea84ca5b15/03/14 23:41:43INFO httpserver:starting HTTP Server15/03/14 23:41:43 INFO server:jetty-8.y.z-SNAPSHOT15/03/14 23:41:43 Info abstractconnector:started [email protected]:4254815/03/14 23:41:43 Info utils:successfully start Ed Service ' HTTP file server ' on port 42548.15/03/14 23:41:43INFO sparkenv:registering outputcommitcoordinator15/03/14 23:41:43 INFO server:jetty-8.y.z-SNAPSHOT15/03/14 23:41:43 Info abstractconnector:started [email protected]:404015/03/14 23:41:43 Info utils:successfully starte D service ' Sparkui ' on port 4040.15/03/14 23:41:43 INFO sparkui:started sparkui at http://192.168.126.147:404015/03/14 23:41:44 INFO sparkcontext:added JAR file:/home/hadoop/software/spark-1.3.0-bin-hadoop2.4/lib/ Spark-examples-1.3.0-hadoop2.4.0.jar at http://192.168.126.147:42548/jars/spark-examples-1.3.0-hadoop2.4.0.jar with timestamp 142634770448815/03/14 23:41:44 INFO executor:starting Executor ID <driver>On host localhost15/03/14 23:41:44 INFO akkautils:connecting to HeartbeatReceiver:akka.tcp://[Email Protected]:60926/user/heartbeatreceiver15/03/14 23:41:44 INFO Nettyblocktransferservice:server created on 3940815/03/14 23:41:44INFO blockmanagermaster:trying to register Blockmanager15/03/14 23:41:44 INFO blockmanagermasteractor:registering block manager localhost:39408 with 265.1 MB RAM, Blockmanageri D (<driver>, localhost, 39408)15/03/14 23:41:44INFO blockmanagermaster:registered Blockmanager15/03/14 23:41:45 Info sparkcontext:starting job:reduce at SPARKPI.SCALA:3515/03/14 23:41:45 Info dagscheduler:got Job 0 (reduce at sparkpi.scala:35) with 2 output partitions (allowlocal=false)15/03/14 23:41:45 INFO dagscheduler:final stage:stage 0 (reduce at sparkpi.scala:35)15/03/14 23:41:45 INFO Dagscheduler:parents ofFinalstage:list ()15/03/14 23:41:45INFO dagscheduler:missing parents:list ()15/03/14 23:41:45 INFO dagscheduler:submitting Stage 0 (mappartitionsrdd[1] at map at sparkpi.scala:31), which has no missing parents15/03/14 23:41:45 Info memorystore:ensurefreespace (1848) called with curmem=0, MAXMEM=27801944015/03/14 23:41:45 info Me Morystore:block broadcast_0 stored as values in memory (estimated size 1848.0 B, free 265.1MB)15/03/14 23:41:45 INFO memorystore:ensurefreespace (1296) called with curmem=1848, MAXMEM=27801944015/03/14 23:41:45 INFO memorystore:block broadcast_0_piece0 stored as bytes in memory (estimated size 1296.0 B, free 265.1MB)15/03/14 23:41:45 INFO blockmanagerinfo:added broadcast_0_piece0 in memory on localhost:39408 (size:1296.0 B, free:265. 1MB)15/03/14 23:41:45Info blockmanagermaster:updated Info of block Broadcast_0_piece015/03/14 23:41:45 info sparkcontext:created broadcast 0 from broadcast at DAGSCHEDULER.SCALA:83915/03/14 23:41:45 info DA Gscheduler:submitting 2 missing tasks from Stage 0 (mappartitionsrdd[1) at map at sparkpi.scala:31)15/03/14 23:41:45 INFO taskschedulerimpl:adding task set 0.0 with 2Tasks15/03/14 23:41:45 INFO tasksetmanager:starting task 0.0 in stage 0.0 (TID 0, localhost, process_local, 1340bytes)15/03/14 23:41:45 INFO tasksetmanager:starting Task 1.0 in Stage 0.0 (TID 1, localhost, process_local, 1340bytes)15/03/14 23:41:45 INFO executor:running Task 1.0 in Stage 0.0 (TID 1)15/03/14 23:41:45 INFO executor:running task 0.0 in stage 0.0 (TID 0)15/03/14 23:41:45 INFO executor:fetching http://192.168.126.147:42548/jars/spark-examples-1.3.0-hadoop2.4.0.jar with timestamp 142634770448815/03/14 23:41:45 INFO utils:fetching http://192.168.126.147:42548/jars/spark-examples-1.3.0-hadoop2.4.0.jar to/tmp/ spark-db1e742b-020f-4db1-9ee3-f3e2d90e1bc2/userfiles-96c6db61-e95e-4f9e-a6c4-0db892583854/ Fetchfiletemp5600234414438914634.tmp15/03/14 23:41:46 INFO executor:adding file:/tmp/spark-db1e742b-020f-4db1-9ee3-f3e2d90e1bc2/ Userfiles-96c6db61-e95e-4f9e-a6c4-0db892583854/spark-examples-1.3.0-hadoop2.4.0.jar toclassLoader15/03/14 23:41:47 INFO executor:finished Task 1.0 in Stage 0.0 (TID 1). 736bytes result sent to driver15/03/14 23:41:47 INFO executor:finished task 0.0 in stage 0.0 (TID 0). 736bytes result sent to driver15/03/14 23:41:47 INFO tasksetmanager:finished task 0.0 in stage 0.0 (TID 0) inch 1560 ms on localhost (1/2)15/03/14 23:41:47 INFO tasksetmanager:finished Task 1.0 in Stage 0.0 (TID 1) inch 1540 MS on localhost (2/2)15/03/14 23:41:47 INFO taskschedulerimpl:removed TaskSet 0.0, whose tasks has all completed, from pool15/03/14 23:41:47 INFO dagscheduler:stage 0 (reduce at sparkpi.scala:35) finished in 1.578s15/03/14 23:41:47 INFO dagscheduler:job 0 finished:reduce at sparkpi.scala:35, took 2.099817SPi is roughly3.1443815/03/14 23:41:47 INFO contexthandler:stopped O.s.j.s.servletcontexthandler{/metrics/json,NULL}15/03/14 23:41:47 INFO contexthandler:stopped O.s.j.s.servletcontexthandler{/stages/stage/kill,NULL}15/03/14 23:41:47 INFO contexthandler:stopped o.s.j.s.servletcontexthandler{/,NULL}15/03/14 23:41:47 INFO contexthandler:stopped o.s.j.s.servletcontexthandler{/Static,NULL}15/03/14 23:41:47 INFO contexthandler:stopped O.s.j.s.servletcontexthandler{/executors/threaddump/json,NULL}15/03/14 23:41:47 INFO contexthandler:stopped o.s.j.s.servletcontexthandler{/executors/threaddump,NULL}15/03/14 23:41:47 INFO contexthandler:stopped O.s.j.s.servletcontexthandler{/executors/json,NULL}15/03/14 23:41:47 INFO contexthandler:stopped o.s.j.s.servletcontexthandler{/executors,NULL}15/03/14 23:41:47 INFO contexthandler:stopped O.s.j.s.servletcontexthandler{/environment/json,NULL}15/03/14 23:41:47 INFO contexthandler:stopped o.s.j.s.servletcontexthandler{/environment,NULL}15/03/14 23:41:47 INFO contexthandler:stopped O.s.j.s.servletcontexthandler{/storage/rdd/json,NULL}15/03/14 23:41:47 INFO contexthandler:stopped O.s.j.s.servletcontexthandler{/storage/rdd,NULL}15/03/14 23:41:47 INFO contexthandler:stopped O.s.j.s.servletcontexthandler{/storage/json,NULL}15/03/14 23:41:47 INFO contexthandler:stopped o.s.j.s.servletcontexthandler{/storage,NULL}15/03/14 23:41:47 INFO contexthandler:stopped O.s.j.s.servletcontexthandler{/stages/pool/json,NULL}15/03/14 23:41:47 INFO contexthandler:stopped O.s.j.s.servletcontexthandler{/stages/pool,NULL}15/03/14 23:41:47 INFO contexthandler:stopped O.s.j.s.servletcontexthandler{/stages/stage/json,NULL}15/03/14 23:41:47 INFO contexthandler:stopped o.s.j.s.servletcontexthandler{/stages/stage,NULL}15/03/14 23:41:47 INFO contexthandler:stopped O.s.j.s.servletcontexthandler{/stages/json,NULL}15/03/14 23:41:47 INFO contexthandler:stopped o.s.j.s.servletcontexthandler{/stages,NULL}15/03/14 23:41:47 INFO contexthandler:stopped O.s.j.s.servletcontexthandler{/jobs/job/json,NULL}15/03/14 23:41:47 INFO contexthandler:stopped o.s.j.s.servletcontexthandler{/jobs/job,NULL}15/03/14 23:41:47 INFO contexthandler:stopped O.s.j.s.servletcontexthandler{/jobs/json,NULL}15/03/14 23:41:47 INFO contexthandler:stopped o.s.j.s.servletcontexthandler{/jobs,NULL}15/03/14 23:41:47 INFO sparkui:stopped Spark Web UI at http://192.168.126.147:404015/03/14 23:41:47INFO dagscheduler:stopping Dagscheduler15/03/14 23:41:47 INFO mapoutputtrackermasteractor:mapoutputtrackeractor STOPPED!15/03/14 23:41:47INFO Memorystore:memorystore cleared15/03/14 23:41:47INFO Blockmanager:blockmanager stopped15/03/14 23:41:47INFO Blockmanagermaster:blockmanagermaster stopped15/03/14 23:41:47 INFO outputcommitcoordinator$outputcommitcoordinatoractor:outputcommitcoordinator stopped!15/03/ 14 23:41:47INFO sparkcontext:successfully stopped Sparkcontext15/03/14 23:41:47INFO remoteactorrefprovider$remotingterminator:shutting down remote daemon.15/03/14 23:41:47 INFO remoteactorrefprovider$remotingterminator:remote daemon shut down; Proceeding with flushing remote transports.

You can see that the output is 3.14438.

Come with me. Data Mining (--spark) Getting Started

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More