Hive on Spark compilation

Last Update:2015-09-25 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Pre-condition Description

Hive on Spark is hive running on spark, using the spark execution engine instead of MapReduce, as is the case with hive on Tez.
Starting with Hive version 1.1, Hive on Spark has become part of the hive code, and on the Spark branch you can see the Https://github.com/apache/hive/tree/spark, And will periodically move on to the master branch.
The discussion and progress on the HIVE on Spark can be seen here https://issues.apache.org/jira/browse/HIVE-7292.
Hive on Spark Document: Https://issues.apache.org/jira/secure/attachment/12652517/Hive-on-Spark.pdf

SOURCE download

git clone https://github.com/apache/hive.git Hive_on_spark

Compile

 cd hive_on_spark/ git branch -r  origin/head -> origin/master   origin/hive-4115  origin/hive-8065  origin/beeline-cli  origin/ branch-0.10  origin/branch-0.11  origin/branch-0.12  origin/branch-0.13   origin/branch-0.14  origin/branch-0.2  origin/branch-0.3  origin/ branch-0.4  origin/branch-0.5  origin/branch-0.6  origin/branch-0.7   Origin/branch-0.8  origin/branch-0.8-r2  origin/branch-0.9  origin/branch-1   origin/branch-1.0  origin/branch-1.0.1  origin/branch-1.1  origin/ branch-1.1.1  origin/branch-1.2  origin/cbo  origin/hbase-metastore   Origin/llap  origin/master  origin/maven  origin/next  origin/parquet   origin/ptf-windowing  origin/release-1.1  origin/spark  origin/spark-new  origin/spark2   origin/tez  origin/vectorization git checkout origin/spark git branch *  (detached from  origin/spark)    master123456789101112131415161718192021222324252627282930313233343536373839404142434445

Modify $hive_on_spark/pom.xml
Spark version changed to spark1.4.1

<spark.version>1.4.1</spark.version>1

Hadoop version changed to 2.3.0-cdh5.1.0

Compile command

Export maven_opts= "-xmx2g-xx:maxpermsize=512m-xx:reservedcodecachesize=512m" mvn clean package-phadoop-2- DskipTests12

Add Spark's dependency to hive method

Spark home:/home/cluster/apps/spark/spark-1.4.1
Hive Home:/home/cluster/apps/hive_on_spark

1.set the property ' Spark.home ' to point to the spark installation:

Hive> set spark.home=/home/cluster/apps/spark/spark-1.4.1; 1

Define the SPARK_HOME environment variable before starting Hive Cli/hiveserver2:

Export spark_home=/home/cluster/apps/spark/spark-1.4.11

3.Set the spark-assembly jar on the Hive Auxpath:

Hive--auxpath/home/cluster/apps/spark/spark-1.4.1/lib/spark-assembly-*.jar1

Add the spark-assembly jar for the current user session:

hive> Add jar/home/cluster/apps/spark/spark-1.4.1/lib/spark-assembly-*.jar;1

Link the spark-assembly jar to $HIVE _home/lib.

Errors that may occur during the start of hive:

[error] terminal initialization failed; falling back to  Unsupportedjava.lang.incompatibleclasschangeerror: found class jline. Terminal, but interface was expected        at  jline. Terminalfactory.create (terminalfactory.java:101)         at jline. Terminalfactory.get (terminalfactory.java:158)         at  Jline.console.consolereader.<init> (consolereader.java:229)          at jline.console.ConsoleReader.<init> (consolereader.java:221)          at jline.console.ConsoleReader.<init> (consolereader.java:209)          at org.apache.hadoop.hive.cli.clidriver.getconsolereader (CliDriver.java:773 )         at org.apaChe.hadoop.hive.cli.CliDriver.executeDriver (clidriver.java:715)          at org.apache.hadoop.hive.cli.clidriver.run (clidriver.java:675)          at org.apache.hadoop.hive.cli.clidriver.main (clidriver.java:615)      &NBSP;&NBSP;&NBSP;&NBSP;AT&NBSP;SUN.REFLECT.NATIVEMETHODACCESSORIMPL.INVOKE0 (Native Method)          at sun.reflect.nativemethodaccessorimpl.invoke ( nativemethodaccessorimpl.java:57)         at  Sun.reflect.DelegatingMethodAccessorImpl.invoke (delegatingmethodaccessorimpl.java:43)          at java.lang.reflect.method.invoke (method.java:606)          at org.apache.hadoop.util.runjar.main (runjar.java:212) Exception in thread   "Main"  java.lang.incompatibleclasschangeerror: found&nBsp;class jline. terminal, but interface was expected123456789101112131415161718

Workaround: Export Hadoop_user_classpath_first=true

Error resolution for other scenarios see: https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started

The Spark.eventLog.dir parameter needs to be set, such as:

Set spark.eventlog.dir= hdfs://master:8020/directory
Otherwise the query will be error, otherwise always error:/tmp/spark-event Similar folder does not exist

After you start hive, set the execution engine to spark:

Hive> Set Hive.execution.engine=spark;1

To set the operating mode of spark:

Hive> Set spark.master=spark://master:70771

or Yarn:spark.master=yarn.

Configure spark-application configs for Hive

Can be configured in spark-defaults.conf or Hive-site.xml

spark.master=<spark master url>spark.eventlog.enabled=true;             spark.executor.memory=512m;              spark.serializer= org.apache.spark.serializer.kryoserializer;spark.executor.memory=...   #Amount  of memory  to use per executor process.spark.executor.cores=...   #Number  of  cores per executor.spark.yarn.executor.memoryOverhead=...spark.executor.instances=...    #The  number of executors assigned to each  application.spark.driver.memory=...   #The  amount of memory assigned to  the Remote Spark Context  (RSC) . we recommend  4gb.spark.yarn.driver.memoryoverhead=...   #We  recommend 400  (MB). 12345678910

Parameter configuration See document: https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started

Information such as job/stages can be viewed on the monitoring page after executing the SQL statement

hive  (default) > select city_id, count (*)  c from city_info group  by city_id order by c desc limit 5; query id = spark_20150309173838_444cb5b1-b72e-4fc3-87db-4162e364cb1etotal jobs =  1Launching Job 1 out of 1In order to change the  average load for a reducer  (in bytes):  set  Hive.exec.reducers.bytes.per.reducer=<number>in order to limit the maximum  number of reducers:  set hive.exec.reducers.max=<number>in order  to set a constant number of reducers:  set  mapreduce.job.reduces=<number>state = sentstate = startedstate =  startedstate = startedstate = startedquery hive on spark job[0]  Stages:1status: running  (hive on spark job[0]) job progress formatcurrenttime  stageid_stageattemptid: succeededtaskscount (+runningtaskscount-failedtaskscount)/TotalTasksCount  [stagecost]2015-03-09 17:38:11,822 stage-0_0: 0 (+1)/1       Stage-1_0: 0/1  stage-2_0: 0/1state = startedstate = startedstate  = started2015-03-09 17:38:14,845 stage-0_0: 0 (+1)/1       Stage-1_0: 0/1  Stage-2_0: 0/1state = STARTEDstate =  Started2015-03-09 17:38:16,861 stage-0_0: 1/1 finished stage-1_0: 0 (+1)/1       Stage-2_0: 0/1state = SUCCEEDED2015-03-09 17:38:17,867  stage-0_0: 1/1 finished stage-1_0: 1/1 finished stage-2_0: 1/1  Finishedstatus: finished successfully in 10.07 secondsokcity_id c-1000   22826-10      17294-20     10608-1      6186     4158time taken: 18.417 seconds, fetched: 5 row (s)

Hive on Spark compilation

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Hive on Spark compilation

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support