1. Environment
- os:red Hat Enterprise Linux Server release 6.4 (Santiago)
- Hadoop:hadoop 2.4.1
- hive:0.11.0
- Jdk:1.7.0_60
- spark:1.1.0 (built-in sparksql)
- scala:2.11.2
2.Spark Cluster Planning
- Account: Ebupt
- master:eb174
- slaves:eb174, eb175, eb176
3.SparkSQL Development History
September 11, 2014, release Spark1.1.0. Spark introduced Sparksql from 1.0 onwards. Spark1.1.0 change is sparksql and Mllib. See release note for details.
The predecessor of Sparksql was shark. Due to Shark's own imperfections, June 1, 2014 Reynold Xin announced: Stop the development of shark. Sparksql discarded the original shark code, absorbing some of the advantages of shark, such as memory column storage (In-memory columnar Storage), hive compatibility, etc., re-development sparksql.
4. Configuration
- Installation configuration with Spark-0.9.1 (see blog: Spark, shark cluster installation deployment and problem resolution)
- Copy the $hive_home/conf/hive-site.xml configuration file to the $spark_home/conf directory.
- Copy the $hadoop_home/etc/hadoop/hdfs-site.xml configuration file to the $spark_home/conf directory.
5. Running
- Start the Spark cluster
- Start Sparksql Client:./spark/bin/spark-sql--master spark://eb174:7077--executor-memory 3g
- Run SQL, Access hive table:spark-sql> Select COUNT (*) from TEST.T1;
14/10/08 20:46:04 Info parsedriver:parsing command:select COUNT (*) from test.t114/10/08 20:46:05 INFO parsedriver:parse completed14/10/08 20:46:05 INFO metastore:trying to connect to Metastore with URI thrift://eb170:908314/10/08 20:46:05 I NFO metastore:waiting 1 seconds before next connection attempt.14/10/08 20:46:06 INFO sparkdeployschedulerbackend:regist Ered executor:actor[akka.tcp://[email protected]:55408/user/executor#1282322316] with ID 214/10/08 20:46:06 INFO sparkdeployschedulerbackend:registered Executor:actor[akka.tcp://[email protected]:56138/user/executor #-264112470] with ID 014/10/08 20:46:06 INFO sparkdeployschedulerbackend:registered executor:actor[akka.tcp://[email& nbsp;protected]:43791/user/executor#-996481867] with ID 114/10/08 20:46:06 INFO blockmanagermasteractor:registering Block manager eb174:54967 with 265.4 MB ram14/10/08 20:46:06 INFO blockmanagermasteractor:registering block manager eb176 : 60783 with 265.4 MB ram14/10/08 20:46:06 INFO BLockmanagermasteractor:registering block manager eb175:35197 with 265.4 MB ram14/10/08 20:46:06 INFO metastore:connected To metastore.14/10/08 20:46:07 INFO deprecation:mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps14/10/08 20:46:07 INFO memorystore:ensurefreespace (406982) called with curmem=0, maxMem= 27830255614/10/08 20:46:07 INFO memorystore:block broadcast_0 stored as values in memory (estimated size 397.4 KB, free 2 65.0 MB) 14/10/08 20:46:07 INFO memorystore:ensurefreespace (25198) called with curmem=406982, maxmem=27830255614/10/08 20:46:07 INFO memorystore:block broadcast_0_piece0 stored as bytes in memory (estimated size 24.6 KB, free 265.0 MB) 14/10 /08 20:46:07 INFO blockmanagerinfo:added broadcast_0_piece0 in memory on eb174:49971 (size:24.6 KB, free:265.4 MB) 14/10 /08 20:46:07 Info blockmanagermaster:updated info of block broadcast_0_piece014/10/08 20:46:07 info sparkcontext:startin G Job:collect at hivecontext.scala:41514/10/08 20:46:08 INFO fileinpUtformat:total input paths to process:114/10/08 20:46:08 INFO dagscheduler:registering RDD 5 (mappartitions at Exchang e.scala:86) 14/10/08 20:46:08 INFO dagscheduler:got Job 0 (collect at hivecontext.scala:415) with 1 output partitions Owlocal=false) 14/10/08 20:46:08 INFO dagscheduler:final stage:stage 0 (collect at hivecontext.scala:415) 14/10/08 20:46:08 Info dagscheduler:parents of final stage:list (stage 1) 14/10/08 20:46:08 INFO dagscheduler:missing parents:lis T (Stage 1) 14/10/08 20:46:08 INFO dagscheduler:submitting Stage 1 (mappartitionsrdd[5] at mappartitions: (which), missing parents14/10/08 20:46:08 INFO memorystore:ensurefreespace (11000) called with curmem=432180, Max mem=27830255614/10/08 20:46:08 INFO memorystore:block broadcast_1 stored as values in memory (estimated size 10.7 KB, fre E 265.0 MB) 14/10/08 20:46:08 INFO memorystore:ensurefreespace (5567) called with curmem=443180, maxmem=27830255614/10/08 20:46:08 INFO Memorystore:blockBroadcast_1_piece0 stored as bytes in memory (estimated size 5.4 KB, free 265.0 MB) 14/10/08 20:46:08 INFO Blockmanagerinfo : Added broadcast_1_piece0 in Memory on eb174:49971 (size:5.4 KB, free:265.4 MB) 14/10/08 20:46:08 INFO blockmanagermaste r:updated info of block broadcast_1_piece014/10/08 20:46:08 info dagscheduler:submitting 2 missing tasks from Stage 1 (M APPARTITIONSRDD[5] at mappartitions at exchange.scala:86) 14/10/08 20:46:08 INFO taskschedulerimpl:adding Task Set 1.0 wit H 2 tasks14/10/08 20:46:08 INFO tasksetmanager:starting task 0.0 in Stage 1.0 (TID 0, eb174, node_local, 1199 bytes) 14/10 /08 20:46:08 INFO tasksetmanager:starting Task 1.0 in Stage 1.0 (TID 1, eb176, node_local, 1199 bytes) 14/10/08 20:46:08 I NFO connectionmanager:accepted connection from [eb176/10.1.69.176:49289]14/10/08 20:46:08 INFO ConnectionManager: Accepted connection from [eb174/10.1.69.174:33401]14/10/08 20:46:08 INFO sendingconnection:initiating connection to [ eb176/10.1.69.176:60783]14/10/08 20: 46:08 info sendingconnection:initiating connection to [eb174/10.1.69.174:54967]14/10/08 20:46:08 info Sendingconnection:connected to [eb176/10.1.69.176:60783], 1 messages pending14/10/08 20:46:08 INFO sendingconnection: Connected to [eb174/10.1.69.174:54967], 1 messages pending14/10/08 20:46:08 INFO blockmanagerinfo:added broadcast_1_ Piece0 in Memory on eb176:60783 (size:5.4 KB, free:265.4 MB) 14/10/08 20:46:08 INFO blockmanagerinfo:added broadcast_1_p Iece0 in Memory on eb174:54967 (size:5.4 KB, free:265.4 MB) 14/10/08 20:46:08 INFO blockmanagerinfo:added broadcast_0_pi Ece0 in Memory on eb174:54967 (size:24.6 KB, free:265.4 MB) 14/10/08 20:46:08 INFO blockmanagerinfo:added broadcast_0_pi Ece0 in Memory on eb176:60783 (size:24.6 KB, free:265.4 MB) 14/10/08 20:46:10 INFO tasksetmanager:finished Task 1.0 in S Tage 1.0 (Tid 1) in 2657 MS on eb176 (a) 14/10/08 20:46:10 INFO tasksetmanager:finished task 0.0 in Stage 1.0 (TID 0) in 2675 MS on eb174 (2/2) 14/10/08 20:46:10 INFO DAGSCheduler:stage 1 (mappartitions at exchange.scala:86) finished in 2.680 s14/10/08 20:46:10 INFO taskschedulerimpl:remove D TaskSet 1.0, whose tasks has all completed, from pool 14/10/08 20:46:10 INFO dagscheduler:looking for newly runnable s Tages14/10/08 20:46:10 Info DAGScheduler:running:Set () 14/10/08 20:46:10 info DAGScheduler:waiting:Set (Stage 0) 14/10/0 8 20:46:10 Info DAGScheduler:failed:Set () 14/10/08 20:46:10 info dagscheduler:missing parents for Stage 0:list () 14/10/0 8 20:46:10 INFO dagscheduler:submitting Stage 0 (mappedrdd[9] at map at hivecontext.scala:360), which are now RUNNABLE14/1 0/08 20:46:10 Info Memorystore:ensurefreespace (9752) called with curmem=448747, maxmem=27830255614/10/08 20:46:10 info M Emorystore:block broadcast_2 stored as values in memory (estimated size 9.5 KB, free 265.0 MB) 14/10/08 20:46:10 INFO Memo Rystore:ensurefreespace (4941) called with curmem=458499, maxmem=27830255614/10/08 20:46:10 INFO Memorystore:block Broadcast_2_piece0 stored as BytEs in memory (estimated size 4.8 KB, free 265.0 MB) 14/10/08 20:46:10 INFO blockmanagerinfo:added broadcast_2_piece0 in Me Mory on eb174:49971 (size:4.8 KB, free:265.4 MB) 14/10/08 20:46:10 Info blockmanagermaster:updated info of block BROADCA st_2_piece014/10/08 20:46:11 INFO dagscheduler:submitting 1 missing tasks from Stage 0 (mappedrdd[9) at map at Hivecontex t.scala:360) 14/10/08 20:46:11 Info taskschedulerimpl:adding Task set 0.0 with 1 tasks14/10/08 20:46:11 INFO tasksetmanage R:starting task 0.0 in stage 0.0 (TID 2, eb175, process_local, 948 bytes) 14/10/08 20:46:11 INFO Statsreportlistener:fini Shed stage: [email protected]14/10/08 20:46:11 INFO statsreportlistener:task Runtime: (Count:2, mean:2666.000000, stdev:9.000000, max:2675.000000, min:2657.000000) 14/10/08 20:46:11 INFO statsreportlistener:0% 5% 10% 25% 50% 75% 90% 95% 100%14/10/08 20:46:11 INFO statsreportlistener:2.7 s 2.7 S 2.7 s 2.7 S 2.7 S 2.7 s 2.7 s 2.7 S 2.7 s14/10/08 20:46:11 INFO statsreportlistener:shuffle bytes written: (Count:2, mean:50.000000, stdev:0.000 max:50.000000, min:50.000000) 14/10/08 20:46:11 INFO statsreportlistener:0% 5% 10% 25% 50% 75% 90% 95% 100%14/10/08 20:46:11 INFO statsreportlistener:50.0 b 50.0 B 50.0 B 50.0 B 50.0 B 50.0 b 50.0 B 50.0 B 50.0 b14/10/08 20:46:11 INFO statsreportlistener:task result size: (Count:2, mean:1848.000000, Stdev : 0.000000, max:1848.000000, min:1848.000000) 14/10/08 20:46:11 INFO statsreportlistener:0% 5% 10% 25% 50% 75% 90% 95% 100%14/10/08 20:46:11 INFO statsreportlistener:1848.0 b 1848.0 B 1 848.0 b 1848.0 b 1848.0 b 1848.0 b 1848.0 b 1848.0 b 1848.0 b14/10/08 20:46:11 INFO Statsreportlistener:executor (Non-fetch) Time pct: (Count:2, mean:86.309428, stdev:0.103820, max:86.413248, min:86.2 05607) 14/10/08 20:46:11Info statsreportlistener:0% 5% 10% 25% 50% 75% 90% 95% 100%14/10/08 20:46:11 INFO S Tatsreportlistener:86% of the percentage of the percent of the more than the percentage of the percent of the more than the percentage of the%14/10/08 20:46:11 INFO statsre Portlistener:other time pct: (Count:2, mean:13.690572, stdev:0.103820, max:13.794393, min:13.586752) 14/10/08 20:46:1 1 info statsreportlistener:0% 5% 10% 25% 50% 75% 90% 95% 100%14/10/08 20:46:11 INFO Statsreportlistener:14% of the% of the total% of the% of the%14/10/08 20:46:11 INFO Conne ctionmanager:accepted connection from [eb175/10.1.69.175:36187]14/10/08 20:46:11 INFO sendingconnection:initiating Connection to [eb175/10.1.69.175:35197]14/10/08 20:46:11 INFO sendingconnection:connected to [eb175/10.1.69.175:35197 ], 1 messages pending14/10/08 20:46:11 INFO blockmanagerinfo:added broadcast_2_piece0 in memory on eb175:35197 (size:4.8 KB, free:265.4 MB) 14/10/08 20:46:12 INFO mapoutputtrackermasteractor:asked to send map output locations for shuffle 0 to [email protected]:5808514/ 10/08 20:46:12 Info mapoutputtrackermaster:size of output statuses for shuffle 0 is bytes14/10/08 20:46:12 info TaskS etmanager:finished task 0.0 in stage 0.0 (TID 2) in 1428 MS on eb175 (1/1) 14/10/08 20:46:12 INFO dagscheduler:stage 0 (c Ollect at hivecontext.scala:415) finished in 1.432 s14/10/08 20:46:12 INFO taskschedulerimpl:removed TaskSet 0.0, whose t Asks has all completed, from pool 14/10/08 20:46:12 INFO statsreportlistener:finished stage: [EMAIL PROTECTED]14/10 /08 20:46:12 INFO statsreportlistener:task Runtime: (Count:1, mean:1428.000000, stdev:0.000000, max:1428.000000, min: 1428.000000) 14/10/08 20:46:12 INFO statsreportlistener:0% 5% 10% 25% 50% 75% 90% 95% 100%14/10/08 20:46:12 INFO statsreportlistener:1.4 s 1.4 S 1.4 S 1.4 S 1.4 S 1.4 S 1.4 S 1.4 S14/10/08 20:46:12 INFOStatsreportlistener:fetch wait Time: (Count:1, mean:0.000000, stdev:0.000000, max:0.000000, min:0.000000) 14/10/08 20: 46:12 INFO statsreportlistener:0% 5% 10% 25% 50% 75% 90% 95% 100%14/10/08 20:46:12 Info statsreportlistener:0.0 ms 0.0 ms 0.0 ms 0.0 ms 0.0 ms 0.0 ms 0.0 ms 0.0 ms 0.0 MS14/10/08 20:46:12 INFO Statsreportlistener:remote bytes read: (count:1, mean:100.000000, stdev:0.000000, max:100.000000, min:100.000000) 14/ 10/08 20:46:12 INFO statsreportlistener:0% 5% 10% 25% 50% 75% 90% 95% 100%14/10/08 20:46:12 INFO statsreportlistener:100.0 b 100.0 b 100.0 b 100.0 b 100.0 b 100.0 b 100.0 b 100.0 b 100.0 b14/10/08 20: 46:12 info sparkcontext:job finished:collect at hivecontext.scala:415, took 4.787407158 s14/10/08 20:46:12 INFO StatsRep Ortlistener:task result Size: (Count:1, mean:1072.000000, stdev:0.000000, max:1072.000000, min:1072.000000) 14/10/08 2 0:46:12 INFO Statsreportlistener:0% 5% 10% 25% 50% 75% 90% 95% 100%14/10/08 20:46:12 INFO Statsreportlisten er:1072.0 b 1072.0 b 1072.0 b 1072.0 b 1072.0 b 1072.0 b 1072.0 b 1072.0 B 1072.0 b14/10/08 20:46:12 INFO statsreportlistener:executor (non-fetch) Time pct: (Count:1, mean:80.252101, S tdev:0.000000, max:80.252101, min:80.252101) 14/10/08 20:46:12 INFO statsreportlistener:0% 5% 10% 25% 50% 75% 90% 95% 100%14/10/08 20:46:12 INFO statsreportlistener:80% 80 80 80 8 0% of%14/10/08 20:46:12 INFO statsreportlistener:fetch wait Time pct: (Count:1, mean:0.000 stdev:0.000000, max:0.000000, min:0.000000) 14/10/08 20:46:12 INFO statsreportlistener:0% 5% 10% 25% 50% 75% 90% 95% 100%14/10/08 20:46:12 INFO statsreportlistener:0% 0 0 0 0% 0 0% 0% 0%14/10/08 20:46:12 INFO statsreportlistener:other time pct: (Count:1, mean:19.747899, stdev:0.000000, MA x:19.747899, min:19.747899) 14/10/08 20:46:12 INFO statsreportlistener:0% 5% 10% 25% 50% 75% 90% 95% 100%14/10/08 20:46:12 INFO statsreportlistener:20% 20 20 20 20% 20 20 %%5078time taken:7.581 seconds
View Code
Attention:
- When you start Spark-sql, if you do not specify master, run as local, master can specify either standalone address or yarn;
- When you set Master to yarn (spark-sql--master yarn), you can monitor the execution of the job by http://$master. 8088 page;
- If Spark.master is configured in $spark_home/conf/spark-defaults.conf spark://eb174 : 7077, not specifying master when starting Spark-sql is also running on the standalone cluster.
6. Problems encountered and solutions
① running SQL statement on SPARK-SQL client command line interface cannot resolve Unknownhostexception:ebcloud (this is the dfs.nameservices of Hadoop)
14/10/08 20:42:44 ERROR CliDriver:org.apache.spark.SparkException:Job aborted due to stage Failure:task 0 in stage 1.0 failed 4 times, most recent Failure:lost task 0.3 in Stage 1.0 (TID 4, eb174): Java.lang.IllegalArgumentException:ja Va.net.UnknownHostException:ebcloud Org.apache.hadoop.security.SecurityUtil.buildTokenService (Securityutil.java : 377) Org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy (namenodeproxies.java:240) ORG.APACHE.HADOOP.HD Fs. Namenodeproxies.createproxy (namenodeproxies.java:144) org.apache.hadoop.hdfs.DFSClient.<Init>(dfsclient.java:579) org.apache.hadoop.hdfs.DFSClient.<Init>(dfsclient.java:524) org.apache.hadoop.hdfs.DistributedFileSystem.initialize (distributedfilesystem.java:146 ) Org.apache.hadoop.fs.FileSystem.createFileSystem (filesystem.java:2397) ORG.APACHE.HADOOP.FS.FILESYSTEM.ACC ess$200 (filesystem.java:89) org.apache.hadoop.fs.filesystem$cache.getinternal (filesystem.java:2431)
Cause: Spark cannot get the address of HDFs correctly. Therefore, you copy the HDFs configuration file for Hadoop hdfs-site.xml to the $spark_home/conf directory.
Ii
14/10/08 20:26:46 Info blockmanagermaster:updated info of block Broadcast_0_piece0 14/10/08 20:26:46 info sparkcontext:s tarting job:collect at hivecontext.scala:415 14/10/08 20:29:19 WARN retryinvocationhandler:exception while invoking Clas s org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo over eb171/10.1.69.171:8020. Not retrying because Failovers (a) exceeded maximum allowed (Java.net.ConnectException:Call) from eb174/10.1.69.174 t o eb171:8020 failed on connection Exception:java.net.ConnectException:Connection refused; For more details see:http://wiki.apache.org/hadoop/connectionrefused at Sun.reflect.NativeConstructorAccessorImpl.newInstance0 (Native Method) at Sun.reflect.NativeConstructorAccessorImpl.newInstance (nativeconstructoraccessorimpl.java:57) at Sun.reflect.DelegatingConstructorAccessorImpl.newInstance (delegatingconstructoraccessorimpl.java:45) at Java.lang.reflect.Constructor.newInstance (constructor.java:526) at Org.apache.hadoop. Net.NetUtils.wrapWithMessage (netutils.java:783) at Org.apache.hadoop.net.NetUtils.wrapException (netutils.java:730) at Org.apache.hadoop.ipc.Client.call (client.java:1414) at Org.apache.hadoop.ipc.Client.call (client.java:1363)
Cause: HDFs connection failed because Hdfs-site.xml is not fully synchronized to the slaves node.
7. References
- sparkSQL1.1 Introduction VI: The basic application of Sparksql
- Spark SQL CLI Description
- Local class incompatible Serialversionuid
- Cannot Submit Tez application
Preliminary discussion on configuration and usage of Sparksql