Pre-condition Description
Hive on Spark is hive running on spark, using the spark execution engine instead of MapReduce, as is the case with hive on Tez.
Starting with Hive version 1.1, Hive on Spark has become part of the hive code, and on the Spark branch you can see the Https://github.com/apache/hive/tree/spark, And will periodically move on to the master branch.
The discussion and progress on the HIVE on Spark can be seen here https://issues.apache.org/jira/browse/HIVE-7292.
Hive on Spark Document: Https://issues.apache.org/jira/secure/attachment/12652517/Hive-on-Spark.pdf
SOURCE download
git clone https://github.com/apache/hive.git Hive_on_spark
Compile
cd hive_on_spark/ git branch -r origin/head -> origin/master origin/hive-4115 origin/hive-8065 origin/beeline-cli origin/ branch-0.10 origin/branch-0.11 origin/branch-0.12 origin/branch-0.13 origin/branch-0.14 origin/branch-0.2 origin/branch-0.3 origin/ branch-0.4 origin/branch-0.5 origin/branch-0.6 origin/branch-0.7 Origin/branch-0.8 origin/branch-0.8-r2 origin/branch-0.9 origin/branch-1 origin/branch-1.0 origin/branch-1.0.1 origin/branch-1.1 origin/ branch-1.1.1 origin/branch-1.2 origin/cbo origin/hbase-metastore Origin/llap origin/master origin/maven origin/next origin/parquet origin/ptf-windowing origin/release-1.1 origin/spark origin/spark-new origin/spark2 origin/tez origin/vectorization git checkout origin/spark git branch * (detached from origin/spark) master123456789101112131415161718192021222324252627282930313233343536373839404142434445
Modify $hive_on_spark/pom.xml
Spark version changed to spark1.4.1
<spark.version>1.4.1</spark.version>1
Hadoop version changed to 2.3.0-cdh5.1.0
Compile command
Export maven_opts= "-xmx2g-xx:maxpermsize=512m-xx:reservedcodecachesize=512m" mvn clean package-phadoop-2- DskipTests12
Add Spark's dependency to hive methodSpark home:/home/cluster/apps/spark/spark-1.4.1
Hive Home:/home/cluster/apps/hive_on_spark
1.set the property ' Spark.home ' to point to the spark installation:
Hive> set spark.home=/home/cluster/apps/spark/spark-1.4.1; 1
Define the SPARK_HOME environment variable before starting Hive Cli/hiveserver2:
Export spark_home=/home/cluster/apps/spark/spark-1.4.11
3.Set the spark-assembly jar on the Hive Auxpath:
Hive--auxpath/home/cluster/apps/spark/spark-1.4.1/lib/spark-assembly-*.jar1
Add the spark-assembly jar for the current user session:
hive> Add jar/home/cluster/apps/spark/spark-1.4.1/lib/spark-assembly-*.jar;1
Link the spark-assembly jar to $HIVE _home/lib.
Errors that may occur during the start of hive:[error] terminal initialization failed; falling back to Unsupportedjava.lang.incompatibleclasschangeerror: found class jline. Terminal, but interface was expected at jline. Terminalfactory.create (terminalfactory.java:101) at jline. Terminalfactory.get (terminalfactory.java:158) at Jline.console.consolereader.<init> (consolereader.java:229) at jline.console.ConsoleReader.<init> (consolereader.java:221) at jline.console.ConsoleReader.<init> (consolereader.java:209) at org.apache.hadoop.hive.cli.clidriver.getconsolereader (CliDriver.java:773 ) at org.apaChe.hadoop.hive.cli.CliDriver.executeDriver (clidriver.java:715) at org.apache.hadoop.hive.cli.clidriver.run (clidriver.java:675) at org.apache.hadoop.hive.cli.clidriver.main (clidriver.java:615) &NBSP;&NBSP;&NBSP;&NBSP;AT&NBSP;SUN.REFLECT.NATIVEMETHODACCESSORIMPL.INVOKE0 (Native Method) at sun.reflect.nativemethodaccessorimpl.invoke ( nativemethodaccessorimpl.java:57) at Sun.reflect.DelegatingMethodAccessorImpl.invoke (delegatingmethodaccessorimpl.java:43) at java.lang.reflect.method.invoke (method.java:606) at org.apache.hadoop.util.runjar.main (runjar.java:212) Exception in thread "Main" java.lang.incompatibleclasschangeerror: found&nBsp;class jline. terminal, but interface was expected123456789101112131415161718
Workaround: Export Hadoop_user_classpath_first=true
Error resolution for other scenarios see: https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
The Spark.eventLog.dir parameter needs to be set, such as:Set spark.eventlog.dir= hdfs://master:8020/directory
Otherwise the query will be error, otherwise always error:/tmp/spark-event Similar folder does not exist
After you start hive, set the execution engine to spark:Hive> Set Hive.execution.engine=spark;1
To set the operating mode of spark:Hive> Set spark.master=spark://master:70771
or Yarn:spark.master=yarn.
Configure spark-application configs for HiveCan be configured in spark-defaults.conf or Hive-site.xml
spark.master=<spark master url>spark.eventlog.enabled=true; spark.executor.memory=512m; spark.serializer= org.apache.spark.serializer.kryoserializer;spark.executor.memory=... #Amount of memory to use per executor process.spark.executor.cores=... #Number of cores per executor.spark.yarn.executor.memoryOverhead=...spark.executor.instances=... #The number of executors assigned to each application.spark.driver.memory=... #The amount of memory assigned to the Remote Spark Context (RSC) . we recommend 4gb.spark.yarn.driver.memoryoverhead=... #We recommend 400 (MB). 12345678910
Parameter configuration See document: https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
Information such as job/stages can be viewed on the monitoring page after executing the SQL statementhive (default) > select city_id, count (*) c from city_info group by city_id order by c desc limit 5; query id = spark_20150309173838_444cb5b1-b72e-4fc3-87db-4162e364cb1etotal jobs = 1Launching Job 1 out of 1In order to change the average load for a reducer (in bytes): set Hive.exec.reducers.bytes.per.reducer=<number>in order to limit the maximum number of reducers: set hive.exec.reducers.max=<number>in order to set a constant number of reducers: set mapreduce.job.reduces=<number>state = sentstate = startedstate = startedstate = startedstate = startedquery hive on spark job[0] Stages:1status: running (hive on spark job[0]) job progress formatcurrenttime stageid_stageattemptid: succeededtaskscount (+runningtaskscount-failedtaskscount)/TotalTasksCount [stagecost]2015-03-09 17:38:11,822 stage-0_0: 0 (+1)/1 Stage-1_0: 0/1 stage-2_0: 0/1state = startedstate = startedstate = started2015-03-09 17:38:14,845 stage-0_0: 0 (+1)/1 Stage-1_0: 0/1 Stage-2_0: 0/1state = STARTEDstate = Started2015-03-09 17:38:16,861 stage-0_0: 1/1 finished stage-1_0: 0 (+1)/1 Stage-2_0: 0/1state = SUCCEEDED2015-03-09 17:38:17,867 stage-0_0: 1/1 finished stage-1_0: 1/1 finished stage-2_0: 1/1 Finishedstatus: finished successfully in 10.07 secondsokcity_id c-1000 22826-10 17294-20 10608-1 6186 4158time taken: 18.417 seconds, fetched: 5 row (s)
Hive on Spark compilation