Hive on Spark Configuration summary

Source: Internet
Author: User
Tags join

Environment configuration

Maven-3.3.3

JDK 7u79

Scala 2.10.6

Hive 2.0.1

Spark 1.5.0 Source

Hadoop 2.6.4


The hive version of the spark version to match, so download the hive source code in Pom.xml to see Spark.version to determine which version of Spark to use.

Note that you must has a version of Spark which does not include the Hive jars. Meaning one which was wasn't built with the Hive profile.

Note: The Spark official web Pre-build spark-2.x are all integrated with hive, so if you want to use hive on spark then you have to download the source code to compile

Recommended hive-1.2.1 on spark-1.3.1/hive-2.0.1 on spark-1.5.2
compiling spark

The default is to use Scala 2.10.4来 compiled


Export maven_opts= "-xmx2g-xx:maxpermsize=512m-xx:reservedcodecachesize=512m"


Mvn-pyarn-phadoop-2.6-dskiptests Clean Package


./make-distribution.sh--name Xm-spark--tgz-phadoop-2.6-pyarn


If it's compiled with Scala 2.11.x.

./dev/change-scala-version.sh 2.11
Mvn-pyarn-phadoop-2.6-dscala-2.11-dskiptests Clean Package
./make-distribution.sh--name Xm-spark--tgz-phadoop-2.6-pyarn


The tar package will be generated in the Spark catalog.


hive-site.xml Configuration

<property>
<name>hive.execution.engine</name>
<value>spark</value>
</property>


spark-default.conf Configuration

Set Spark.master=<spark master url> #默认可以不用设置 set spark.eventlog.enabled=true;              Set Spark.eventlog.dir=<spark Event log folder (must exist) > set spark.executor.memory=512m; Set Spark.serializer=org.apache.spark.serializer.kryoserializer; Spark.executor.instances=x

Hive website Recommended configuration

Hive.vectorized.execution.enabled=true hive.cbo.enable=true hive.optimize.reducededuplication.min.reducer=4
Hive.optimize.reducededuplication=true Hive.orc.splits.include.file.footer=false Hive.merge.mapfiles=true
Hive.merge.sparkfiles=false hive.merge.smallfiles.avgsize=16000000 hive.merge.size.per.task=256000000
Hive.merge.orcfile.stripe.level=true hive.auto.convert.join=true Hive.auto.convert.join.noconditionaltask=true
hive.auto.convert.join.noconditionaltask.size=894435328 Hive.optimize.bucketmapjoin.sortedmerge=false
hive.map.aggr.hash.percentmemory=0.5 hive.map.aggr=true Hive.optimize.sort.dynamic.partition=false
Hive.stats.autogather=true hive.stats.fetch.column.stats=true Hive.vectorized.execution.reduce.enabled=false
hive.vectorized.groupby.checkinterval=4096 hive.vectorized.groupby.flush.percent=0.1
Hive.compute.query.using.stats=true hive.limit.pushdown.memory.usage=0.4 Hive.optimize.index.filter=true hive.exec.reducers.bytes.per.reducer=67108864 hive.smbjoin.cache.rows=10hive.exec.orc.default.stripe.size=67108864 Hive.fetch.task.conversion=more
hive.fetch.task.conversion.threshold=1073741824 Hive.fetch.task.aggr=false
Mapreduce.input.fileinputformat.list-status.num-threads=5 Spark.kryo.referencetracking=false Spark.kryo.classestoregister=org.apache.hadoop.hive.ql.io.hivekey,org.apache.hadoop.io.byteswritable, Org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch

Summary of Issues

1.causedby:java.lang.noclassdeffounderror:org/apache/hive/spark/client/job

A. Add-phive or-phive-thrift when compiling spark

B. Errors caused by mismatch between Hive and spark compilation versions


2.Failed to execute Spark task with exception ' Org.apache.hadoop.hive.ql.metadata.HiveException (Failed to create spark C Lient.) ' Failed:execution Error, return code 1 from Org.apache.hadoop.hive.ql.exec.spark.SparkTask

A. Hive and spark versions do not match

B. A client startup failure due to a Scala environment error (Install Scala, restart yarn)


3. Errors caused by environment configuration


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.