Environment configuration
Maven-3.3.3
JDK 7u79
Scala 2.10.6
Hive 2.0.1
Spark 1.5.0 Source
Hadoop 2.6.4
The hive version of the spark version to match, so download the hive source code in Pom.xml to see Spark.version to determine which version of Spark to use.
Note that you must has a version of Spark which does not include the Hive jars. Meaning one which was wasn't built with the Hive profile.
Note: The Spark official web Pre-build spark-2.x are all integrated with hive, so if you want to use hive on spark then you have to download the source code to compile
Recommended hive-1.2.1 on spark-1.3.1/hive-2.0.1 on spark-1.5.2
compiling spark
The default is to use Scala 2.10.4来 compiled
Export maven_opts= "-xmx2g-xx:maxpermsize=512m-xx:reservedcodecachesize=512m"
Mvn-pyarn-phadoop-2.6-dskiptests Clean Package
./make-distribution.sh--name Xm-spark--tgz-phadoop-2.6-pyarn
If it's compiled with Scala 2.11.x.
./dev/change-scala-version.sh 2.11
Mvn-pyarn-phadoop-2.6-dscala-2.11-dskiptests Clean Package
./make-distribution.sh--name Xm-spark--tgz-phadoop-2.6-pyarn
The tar package will be generated in the Spark catalog.
hive-site.xml Configuration
<property>
<name>hive.execution.engine</name>
<value>spark</value>
</property>
spark-default.conf Configuration
Set Spark.master=<spark master url> #默认可以不用设置 set spark.eventlog.enabled=true; Set Spark.eventlog.dir=<spark Event log folder (must exist) > set spark.executor.memory=512m; Set Spark.serializer=org.apache.spark.serializer.kryoserializer; Spark.executor.instances=x
Hive website Recommended configuration
Hive.vectorized.execution.enabled=true hive.cbo.enable=true hive.optimize.reducededuplication.min.reducer=4
Hive.optimize.reducededuplication=true Hive.orc.splits.include.file.footer=false Hive.merge.mapfiles=true
Hive.merge.sparkfiles=false hive.merge.smallfiles.avgsize=16000000 hive.merge.size.per.task=256000000
Hive.merge.orcfile.stripe.level=true hive.auto.convert.join=true Hive.auto.convert.join.noconditionaltask=true
hive.auto.convert.join.noconditionaltask.size=894435328 Hive.optimize.bucketmapjoin.sortedmerge=false
hive.map.aggr.hash.percentmemory=0.5 hive.map.aggr=true Hive.optimize.sort.dynamic.partition=false
Hive.stats.autogather=true hive.stats.fetch.column.stats=true Hive.vectorized.execution.reduce.enabled=false
hive.vectorized.groupby.checkinterval=4096 hive.vectorized.groupby.flush.percent=0.1
Hive.compute.query.using.stats=true hive.limit.pushdown.memory.usage=0.4 Hive.optimize.index.filter=true hive.exec.reducers.bytes.per.reducer=67108864 hive.smbjoin.cache.rows=10hive.exec.orc.default.stripe.size=67108864 Hive.fetch.task.conversion=more
hive.fetch.task.conversion.threshold=1073741824 Hive.fetch.task.aggr=false
Mapreduce.input.fileinputformat.list-status.num-threads=5 Spark.kryo.referencetracking=false Spark.kryo.classestoregister=org.apache.hadoop.hive.ql.io.hivekey,org.apache.hadoop.io.byteswritable, Org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch
Summary of Issues
1.causedby:java.lang.noclassdeffounderror:org/apache/hive/spark/client/job
A. Add-phive or-phive-thrift when compiling spark
B. Errors caused by mismatch between Hive and spark compilation versions
2.Failed to execute Spark task with exception ' Org.apache.hadoop.hive.ql.metadata.HiveException (Failed to create spark C Lient.) ' Failed:execution Error, return code 1 from Org.apache.hadoop.hive.ql.exec.spark.SparkTask
A. Hive and spark versions do not match
B. A client startup failure due to a Scala environment error (Install Scala, restart yarn)
3. Errors caused by environment configuration