Hive on Spark compilation

Source: Internet
Author: User

Pre-condition Description

Hive on Spark is hive running on spark, using the spark execution engine instead of MapReduce, as is the case with hive on Tez.
Starting with Hive version 1.1, Hive on Spark has become part of the hive code, and on the Spark branch you can see the Https://github.com/apache/hive/tree/spark, And will periodically move on to the master branch.
The discussion and progress on the HIVE on Spark can be seen here https://issues.apache.org/jira/browse/HIVE-7292.
Hive on Spark Document: Https://issues.apache.org/jira/secure/attachment/12652517/Hive-on-Spark.pdf

SOURCE download

git clone https://github.com/apache/hive.git Hive_on_spark

Compile
 cd hive_on_spark/ git branch -r  origin/head -> origin/master   origin/hive-4115  origin/hive-8065  origin/beeline-cli  origin/ branch-0.10  origin/branch-0.11  origin/branch-0.12  origin/branch-0.13   origin/branch-0.14  origin/branch-0.2  origin/branch-0.3  origin/ branch-0.4  origin/branch-0.5  origin/branch-0.6  origin/branch-0.7   Origin/branch-0.8  origin/branch-0.8-r2  origin/branch-0.9  origin/branch-1   origin/branch-1.0  origin/branch-1.0.1  origin/branch-1.1  origin/ branch-1.1.1  origin/branch-1.2  origin/cbo  origin/hbase-metastore   Origin/llap  origin/master  origin/maven  origin/next  origin/parquet   origin/ptf-windowing  origin/release-1.1  origin/spark  origin/spark-new  origin/spark2   origin/tez  origin/vectorization git checkout origin/spark git branch *  (detached from  origin/spark)    master123456789101112131415161718192021222324252627282930313233343536373839404142434445

Modify $hive_on_spark/pom.xml
Spark version changed to spark1.4.1

<spark.version>1.4.1</spark.version>1

Hadoop version changed to 2.3.0-cdh5.1.0

 

Compile command

Export maven_opts= "-xmx2g-xx:maxpermsize=512m-xx:reservedcodecachesize=512m" mvn clean package-phadoop-2- DskipTests12
Add Spark's dependency to hive method

Spark home:/home/cluster/apps/spark/spark-1.4.1
Hive Home:/home/cluster/apps/hive_on_spark

1.set the property ' Spark.home ' to point to the spark installation:

Hive> set spark.home=/home/cluster/apps/spark/spark-1.4.1; 1
    1. Define the SPARK_HOME environment variable before starting Hive Cli/hiveserver2:

Export spark_home=/home/cluster/apps/spark/spark-1.4.11

3.Set the spark-assembly jar on the Hive Auxpath:

Hive--auxpath/home/cluster/apps/spark/spark-1.4.1/lib/spark-assembly-*.jar1
    1. Add the spark-assembly jar for the current user session:

hive> Add jar/home/cluster/apps/spark/spark-1.4.1/lib/spark-assembly-*.jar;1
    1. Link the spark-assembly jar to $HIVE _home/lib.

Errors that may occur during the start of hive:
[error] terminal initialization failed; falling back to  Unsupportedjava.lang.incompatibleclasschangeerror: found class jline. Terminal, but interface was expected        at  jline. Terminalfactory.create (terminalfactory.java:101)         at jline. Terminalfactory.get (terminalfactory.java:158)         at  Jline.console.consolereader.<init> (consolereader.java:229)          at jline.console.ConsoleReader.<init> (consolereader.java:221)          at jline.console.ConsoleReader.<init> (consolereader.java:209)          at org.apache.hadoop.hive.cli.clidriver.getconsolereader (CliDriver.java:773 )         at org.apaChe.hadoop.hive.cli.CliDriver.executeDriver (clidriver.java:715)          at org.apache.hadoop.hive.cli.clidriver.run (clidriver.java:675)          at org.apache.hadoop.hive.cli.clidriver.main (clidriver.java:615)      &NBSP;&NBSP;&NBSP;&NBSP;AT&NBSP;SUN.REFLECT.NATIVEMETHODACCESSORIMPL.INVOKE0 (Native Method)          at sun.reflect.nativemethodaccessorimpl.invoke ( nativemethodaccessorimpl.java:57)         at  Sun.reflect.DelegatingMethodAccessorImpl.invoke (delegatingmethodaccessorimpl.java:43)          at java.lang.reflect.method.invoke (method.java:606)          at org.apache.hadoop.util.runjar.main (runjar.java:212) Exception in thread   "Main"  java.lang.incompatibleclasschangeerror: found&nBsp;class jline. terminal, but interface was expected123456789101112131415161718

Workaround: Export Hadoop_user_classpath_first=true

Error resolution for other scenarios see: https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started

The Spark.eventLog.dir parameter needs to be set, such as:

Set spark.eventlog.dir= hdfs://master:8020/directory
Otherwise the query will be error, otherwise always error:/tmp/spark-event Similar folder does not exist

After you start hive, set the execution engine to spark:
Hive> Set Hive.execution.engine=spark;1
To set the operating mode of spark:
Hive> Set spark.master=spark://master:70771

or Yarn:spark.master=yarn.

Configure spark-application configs for Hive

Can be configured in spark-defaults.conf or Hive-site.xml

spark.master=<spark master url>spark.eventlog.enabled=true;             spark.executor.memory=512m;              spark.serializer= org.apache.spark.serializer.kryoserializer;spark.executor.memory=...   #Amount  of memory  to use per executor process.spark.executor.cores=...   #Number  of  cores per executor.spark.yarn.executor.memoryOverhead=...spark.executor.instances=...    #The  number of executors assigned to each  application.spark.driver.memory=...   #The  amount of memory assigned to  the Remote Spark Context  (RSC) . we recommend  4gb.spark.yarn.driver.memoryoverhead=...   #We  recommend 400  (MB). 12345678910 

Parameter configuration See document: https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started

Information such as job/stages can be viewed on the monitoring page after executing the SQL statement
hive  (default) > select city_id, count (*)  c from city_info group  by city_id order by c desc limit 5; query id = spark_20150309173838_444cb5b1-b72e-4fc3-87db-4162e364cb1etotal jobs =  1Launching Job 1 out of 1In order to change the  average load for a reducer  (in bytes):  set  Hive.exec.reducers.bytes.per.reducer=<number>in order to limit the maximum  number of reducers:  set hive.exec.reducers.max=<number>in order  to set a constant number of reducers:  set  mapreduce.job.reduces=<number>state = sentstate = startedstate =  startedstate = startedstate = startedquery hive on spark job[0]  Stages:1status: running  (hive on spark job[0]) job progress formatcurrenttime  stageid_stageattemptid: succeededtaskscount (+runningtaskscount-failedtaskscount)/TotalTasksCount  [stagecost]2015-03-09 17:38:11,822 stage-0_0: 0 (+1)/1       Stage-1_0: 0/1  stage-2_0: 0/1state = startedstate = startedstate  = started2015-03-09 17:38:14,845 stage-0_0: 0 (+1)/1       Stage-1_0: 0/1  Stage-2_0: 0/1state = STARTEDstate =  Started2015-03-09 17:38:16,861 stage-0_0: 1/1 finished stage-1_0: 0 (+1)/1       Stage-2_0: 0/1state = SUCCEEDED2015-03-09 17:38:17,867  stage-0_0: 1/1 finished stage-1_0: 1/1 finished stage-2_0: 1/1  Finishedstatus: finished successfully in 10.07 secondsokcity_id c-1000   22826-10      17294-20     10608-1      6186     4158time taken: 18.417 seconds, fetched: 5 row (s)


Hive on Spark compilation

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.