Apache Spark 1.4 reads files on Hadoop 2.6 file system

Last Update:2015-07-12 Source: Internet

Author: User

Tags deprecated log4j

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

scala> val file = Sc.textfile ("Hdfs://9.125.73.217:9000/user/hadoop/logs")

Scala> val count = file.flatmap (line = Line.split ("")). Map (Word = = (word,1)). Reducebykey (_+_)

Scala> Count.collect ()

Take the classic wordcount of Spark as an example to verify that spark reads and writes to the HDFs file system

1. Start the Spark shell

/root/spark-1.4.0-bin-hadoop2.4/bin/spark-shell

Log4j:warn No Appenders could is found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
Log4j:warn Initialize the log4j system properly.
Log4j:warn See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark ' s default log4j profile:org/apache/spark/log4j-defaults.properties
15/07/12 21:32:05 INFO securitymanager:changing View ACLs To:root
15/07/12 21:32:05 INFO securitymanager:changing Modify ACLs To:root
15/07/12 21:32:05 INFO SecurityManager:SecurityManager:authentication disabled; UI ACLs Disabled; Users with view permissions:set (root); Users with modify Permissions:set (root)
15/07/12 21:32:05 INFO httpserver:starting HTTP Server
15/07/12 21:32:05 INFO utils:successfully started service ' HTTP Class Server ' on port 50452.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \ _/_ '/__/' _/
/___/. __/\_,_/_//_/\_\ Version 1.4.0
/_/

Using Scala version 2.10.4 (OpenJDK 64-bit Server VM, Java 1.7.0_45)
Type in expressions to has them evaluated.
Type:help for more information.
15/07/12 21:32:09 INFO sparkcontext:running Spark version 1.4.0
15/07/12 21:32:10 INFO securitymanager:changing View ACLs To:root
15/07/12 21:32:10 INFO securitymanager:changing Modify ACLs To:root
15/07/12 21:32:10 INFO SecurityManager:SecurityManager:authentication disabled; UI ACLs Disabled; Users with view permissions:set (root); Users with modify Permissions:set (root)
15/07/12 21:32:10 INFO Slf4jlogger:slf4jlogger started
15/07/12 21:32:10 INFO remoting:starting Remoting
15/07/12 21:32:10 INFO remoting:remoting started; Listening on addresses: [Akka.tcp://[email protected]:35775]
15/07/12 21:32:10 INFO utils:successfully started service ' sparkdriver ' on port 35775.
15/07/12 21:32:10 INFO sparkenv:registering mapoutputtracker
15/07/12 21:32:10 INFO sparkenv:registering blockmanagermaster
15/07/12 21:32:10 INFO diskblockmanager:created Local directory at/tmp/spark-6bd4dc00-8a04-4b62-8f16-76f4beeba918/ blockmgr-b0db297e-f183-4ca5-8cb5-7ee943df509d
15/07/12 21:32:10 INFO Memorystore:memorystore started with capacity 265.4 MB
15/07/12 21:32:10 INFO httpfileserver:http File Server directory is/tmp/spark-6bd4dc00-8a04-4b62-8f16-76f4beeba918/ httpd-b22e2de4-9618-4bba-b25a-a8c1fd28826d
15/07/12 21:32:10 INFO httpserver:starting HTTP Server
15/07/12 21:32:10 INFO utils:successfully started service ' HTTP file server ' on port 55255.
15/07/12 21:32:10 INFO sparkenv:registering outputcommitcoordinator
15/07/12 21:32:11 INFO utils:successfully started service ' Sparkui ' on port 4040.
15/07/12 21:32:11 INFO sparkui:started Sparkui at http://9.125.73.217:4040
15/07/12 21:32:11 INFO executor:starting Executor ID driver on host localhost
15/07/12 21:32:11 INFO executor:using REPL class uri:http://9.125.73.217:50452
15/07/12 21:32:11 INFO utils:successfully started service ' Org.apache.spark.network.netty.NettyBlockTransferService ' On port 60268.
15/07/12 21:32:11 INFO Nettyblocktransferservice:server created on 60268
15/07/12 21:32:11 INFO blockmanagermaster:trying to register Blockmanager
15/07/12 21:32:11 INFO blockmanagermasterendpoint:registering block manager localhost:60268 with 265.4 MB RAM, Blockmanag Erid (Driver, localhost, 60268)
15/07/12 21:32:11 INFO blockmanagermaster:registered Blockmanager
15/07/12 21:32:11 INFO sparkiloop:created Spark context.
Spark context available as SC.
15/07/12 21:32:12 INFO hivecontext:initializing execution hive, version 0.13.1
15/07/12 21:32:12 INFO hivemetastore:0: Opening Raw store with implemenation Class:o Rg.apache.hadoop.hive.metastore.ObjectStore
15/07/12 21:32:12 INFO Objectstore:objectstore, initialize called
15/07/12 21:32:13 INFO persistence:property datanucleus.cache.level2 unknown-will be ignored
15/07/12 21:32:13 INFO persistence:property hive.metastore.integral.jdo.pushdown unknown-will be ignored
15/07/12 21:32:13 WARN CONNECTION:BONECP specified but not present in CLASSPATH (or one of dependencies)
15/07/12 21:32:13 WARN CONNECTION:BONECP specified but not present in CLASSPATH (or one of dependencies)
15/07/12 21:32:14 INFO objectstore:setting Metastore object pin classes with hive.metastore.cache.pinobjtypes= "Table, Storagedescriptor,serdeinfo,partition,database,type,fieldschema,order "
15/07/12 21:32:15 INFO metastoredirectsql:mysql check failed, assuming we is not in mysql:lexical error at line 1, Colu MN 5. Encountered: "@" (+), after: "".
15/07/12 21:32:15 INFO datastore:the class "Org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as " Embedded-only "So does not has its own datastore table.
15/07/12 21:32:15 INFO datastore:the class "Org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" So does not has its own datastore table.
15/07/12 21:32:17 INFO datastore:the class "Org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as " Embedded-only "So does not has its own datastore table.
15/07/12 21:32:17 INFO datastore:the class "Org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" So does not has its own datastore table.
15/07/12 21:32:17 INFO objectstore:initialized objectstore
15/07/12 21:32:17 WARN objectstore:version information not found in Metastore. Hive.metastore.schema.verification is isn't enabled so recording the schema version 0.13.1AA
15/07/12 21:32:17 INFO hivemetastore:added Admin role in Metastore
15/07/12 21:32:17 INFO hivemetastore:added public role in Metastore
15/07/12 21:32:17 INFO hivemetastore:no User is added in Admin role, since config is empty
15/07/12 21:32:18 INFO Sessionstate:no Tez session required at this point. Hive.execution.engine=mr.
15/07/12 21:32:18 INFO sparkiloop:created SQL context (with Hive support):
SQL context available as SqlContext.

Scala>

2. Read the files on the HDFs

scala> val file = Sc.textfile ("Hdfs://9.125.73.217:9000/hbase/hbase.version")
15/07/12 21:34:50 INFO memorystore:ensurefreespace (80368) called with curmem=0, maxmem=278302556
15/07/12 21:34:50 INFO memorystore:block broadcast_0 stored as values in memory (estimated size 78.5 KB, free 265.3 MB)
15/07/12 21:34:50 INFO memorystore:ensurefreespace (17237) called with curmem=80368, maxmem=278302556
15/07/12 21:34:50 INFO memorystore:block broadcast_0_piece0 stored as bytes in memory (estimated size 16.8 KB, free 265.3 MB)
15/07/12 21:34:50 INFO blockmanagerinfo:added broadcast_0_piece0 in memory on localhost:60268 (size:16.8 KB, free:265.4 MB)
15/07/12 21:34:50 INFO sparkcontext:created broadcast 0 from Textfile at <console>:21
File:org.apache.spark.rdd.rdd[string] = mappartitionsrdd[1] at textfile at <console>:21

3. Calculate the number of words

Scala> val count = file.flatmap (line = Line.split ("")). Map (Word = = (word,1)). Reducebykey (_+_)

Scala> val count = file.flatmap (line = Line.split ("")). Map (Word = = (word,1)). Reducebykey (_+_)
15/07/12 21:38:43 INFO fileinputformat:total input paths to process:1
count:org.apache.spark.rdd.rdd[(String, Int)] = shuffledrdd[8] at Reducebykey at <console>:23

Scala>

Scala> Count.collect ()
15/07/12 21:39:25 INFO sparkcontext:starting job:collect at <console>:26
15/07/12 21:39:25 INFO dagscheduler:registering RDD 7 (map at <console>:23)
15/07/12 21:39:25 INFO dagscheduler:got Job 0 (collect at <console>:26) with 3 output partitions (allowlocal=false)
15/07/12 21:39:25 INFO dagscheduler:final stage:resultstage 1 (collect at <console>:26)
15/07/12 21:39:25 INFO dagscheduler:parents of final stage:list (Shufflemapstage 0)
15/07/12 21:39:25 INFO dagscheduler:missing parents:list (shufflemapstage 0)
15/07/12 21:39:25 INFO dagscheduler:submitting shufflemapstage 0 (mappartitionsrdd[7] at map at <console>:23), whic H has no missing parents
15/07/12 21:39:25 INFO memorystore:ensurefreespace (4128) called with curmem=297554, maxmem=278302556
15/07/12 21:39:25 INFO memorystore:block broadcast_2 stored as values in memory (estimated size 4.0 KB, free 265.1 MB)
15/07/12 21:39:25 INFO memorystore:ensurefreespace (2305) called with curmem=301682, maxmem=278302556
15/07/12 21:39:25 INFO memorystore:block broadcast_2_piece0 stored as bytes in memory (estimated size 2.3 KB, free 265.1 MB)
15/07/12 21:39:25 INFO blockmanagerinfo:added broadcast_2_piece0 in memory on localhost:60268 (size:2.3 KB, free:265.4 MB)
15/07/12 21:39:25 INFO sparkcontext:created broadcast 2 from broadcast at dagscheduler.scala:874
15/07/12 21:39:25 INFO dagscheduler:submitting 3 missing tasks from Shufflemapstage 0 (mappartitionsrdd[7) at map at < CONSOLE&GT;:23)
15/07/12 21:39:25 INFO taskschedulerimpl:adding task set 0.0 with 3 tasks
15/07/12 21:39:25 INFO tasksetmanager:starting task 0.0 in stage 0.0 (TID 0, localhost, any, 1406 bytes)
15/07/12 21:39:25 INFO tasksetmanager:starting Task 1.0 in Stage 0.0 (TID 1, localhost, any, 1406 bytes)
15/07/12 21:39:25 INFO executor:running Task 1.0 in Stage 0.0 (TID 1)
15/07/12 21:39:25 INFO executor:running task 0.0 in stage 0.0 (TID 0)
15/07/12 21:39:25 INFO hadooprdd:input split:hdfs://9.125.73.217:9000/hbase/hbase.version:0+3
15/07/12 21:39:25 INFO hadooprdd:input split:hdfs://9.125.73.217:9000/hbase/hbase.version:3+3
15/07/12 21:39:25 INFO deprecation:mapred.tip.id is deprecated. Instead, use Mapreduce.task.id
15/07/12 21:39:25 INFO deprecation:mapred.task.id is deprecated. Instead, use Mapreduce.task.attempt.id
15/07/12 21:39:25 INFO deprecation:mapred.task.partition is deprecated. Instead, use Mapreduce.task.partition
15/07/12 21:39:25 INFO deprecation:mapred.job.id is deprecated. Instead, use Mapreduce.job.id
15/07/12 21:39:25 INFO deprecation:mapred.task.is.map is deprecated. Instead, use Mapreduce.task.ismap
15/07/12 21:39:25 INFO executor:finished task 0.0 in stage 0.0 (TID 0). 2003 bytes result sent to driver
15/07/12 21:39:25 INFO executor:finished Task 1.0 in Stage 0.0 (TID 1). 2003 bytes result sent to driver
15/07/12 21:39:25 INFO tasksetmanager:starting Task 2.0 in stage 0.0 (TID 2, localhost, any, 1406 bytes)
15/07/12 21:39:25 INFO executor:running Task 2.0 in stage 0.0 (TID 2)
15/07/12 21:39:25 INFO tasksetmanager:finished Task 1.0 in Stage 0.0 (TID 1) inch 162 ms on localhost (1/3)
15/07/12 21:39:25 INFO tasksetmanager:finished task 0.0 in stage 0.0 (TID 0) inch 179 ms on localhost (2/3)
15/07/12 21:39:25 INFO hadooprdd:input split:hdfs://9.125.73.217:9000/hbase/hbase.version:6+1
15/07/12 21:39:25 INFO executor:finished Task 2.0 in stage 0.0 (TID 2). 2003 bytes result sent to driver
15/07/12 21:39:25 INFO dagscheduler:shufflemapstage 0 (map at <console>:23) finished in 0.205 s
15/07/12 21:39:25 INFO dagscheduler:looking for newly runnable stages
15/07/12 21:39:25 INFO DAGScheduler:running:Set ()
15/07/12 21:39:25 INFO DAGScheduler:waiting:Set (resultstage 1)
15/07/12 21:39:25 INFO DAGScheduler:failed:Set ()
15/07/12 21:39:25 INFO dagscheduler:missing Parents for Resultstage 1:list ()
15/07/12 21:39:25 INFO dagscheduler:submitting resultstage 1 (shuffledrdd[8] at Reducebykey), <console>:23 is now runnable
15/07/12 21:39:25 INFO tasksetmanager:finished Task 2.0 in stage 0.0 (TID 2) in MS on localhost (3/3)
15/07/12 21:39:25 INFO taskschedulerimpl:removed TaskSet 0.0, whose tasks all completed, from pool
15/07/12 21:39:25 INFO memorystore:ensurefreespace (2288) called with curmem=303987, maxmem=278302556
15/07/12 21:39:25 INFO memorystore:block broadcast_3 stored as values in memory (estimated size 2.2 KB, free 265.1 MB)
15/07/12 21:39:25 INFO memorystore:ensurefreespace (1377) called with curmem=306275, maxmem=278302556
15/07/12 21:39:25 INFO memorystore:block broadcast_3_piece0 stored as bytes in memory (estimated size 1377.0 B, free 265. 1 MB)
15/07/12 21:39:25 INFO blockmanagerinfo:added broadcast_3_piece0 in memory on localhost:60268 (size:1377.0 B, free:265. 4 MB)
15/07/12 21:39:25 INFO sparkcontext:created broadcast 3 from broadcast at dagscheduler.scala:874
15/07/12 21:39:25 INFO dagscheduler:submitting 3 missing tasks from Resultstage 1 (shuffledrdd[8 "at Reducebykey" at <c ONSOLE&GT;:23)
15/07/12 21:39:25 INFO taskschedulerimpl:adding Task Set 1.0 with 3 tasks
15/07/12 21:39:25 INFO tasksetmanager:starting task 0.0 in Stage 1.0 (TID 3, localhost, process_local, 1165 bytes)
15/07/12 21:39:25 INFO tasksetmanager:starting Task 1.0 in Stage 1.0 (TID 4, localhost, process_local, 1165 bytes)
15/07/12 21:39:25 INFO executor:running task 0.0 in Stage 1.0 (TID 3)
15/07/12 21:39:25 INFO executor:running Task 1.0 in Stage 1.0 (TID 4)
15/07/12 21:39:25 INFO shuffleblockfetcheriterator:getting 1 non-empty blocks out of 3 blocks
15/07/12 21:39:25 INFO shuffleblockfetcheriterator:getting 1 non-empty blocks out of 3 blocks
15/07/12 21:39:25 INFO shuffleblockfetcheriterator:started 0 remote fetches in 7 ms
15/07/12 21:39:25 INFO shuffleblockfetcheriterator:started 0 remote fetches in 8 ms
15/07/12 21:39:25 INFO executor:finished Task 1.0 in Stage 1.0 (TID 4). 1031 bytes result sent to driver
15/07/12 21:39:25 INFO executor:finished task 0.0 in Stage 1.0 (TID 3). 1029 bytes result sent to driver
15/07/12 21:39:25 INFO tasksetmanager:starting Task 2.0 in Stage 1.0 (TID 5, localhost, process_local, 1165 bytes)
15/07/12 21:39:25 INFO executor:running Task 2.0 in Stage 1.0 (TID 5)
15/07/12 21:39:25 INFO tasksetmanager:finished task 0.0 in Stage 1.0 (TID 3) with MS on localhost (1/3)
15/07/12 21:39:25 INFO shuffleblockfetcheriterator:getting 0 non-empty blocks out of 3 blocks
15/07/12 21:39:25 INFO shuffleblockfetcheriterator:started 0 remote fetches in 0 ms
15/07/12 21:39:25 INFO tasksetmanager:finished Task 1.0 in Stage 1.0 (TID 4) in + MS on localhost (2/3)
15/07/12 21:39:25 INFO executor:finished Task 2.0 in Stage 1.0 (TID 5). 882 bytes result sent to driver
15/07/12 21:39:25 INFO tasksetmanager:finished Task 2.0 in Stage 1.0 (TID 5) in 6 ms on localhost (3/3)
15/07/12 21:39:25 INFO taskschedulerimpl:removed TaskSet 1.0, whose tasks all completed, from pool
15/07/12 21:39:25 INFO dagscheduler:resultstage 1 (collect at <console>:26) finished in 0.043 s
15/07/12 21:39:25 INFO dagscheduler:job 0 finished:collect at <console>:26, took 0.352074 s
res1:array[(String, Int)] = Array ((? 8,1), (pbuf,1))

Scala>

Apache Spark 1.4 reads files on Hadoop 2.6 file system

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More