This article focuses on some of the most recent issues that have been experienced in using spark SQL.
1 Spark 2.0.1, when starting Thriftserver or Spark-sql, if you want to spark-sql run on HDFs, you need to add the parameter "--conf spark.sql.warehouse.dir=hdfs:// Hostname:9000/user/hive/warehouse "
For example, start Thriftserver:
Bin/start-thriftserver. sh --master Spark://hostname:7077--conf spark.sql.warehouse.dir=hdfs:// Hostname:9000/user/hive/warehouse--driver-memory 2g--executor-memory 35g
Description
Spark.sql.warehouse.dir This parameter is not specified, spark SQL automatically creates a spark-warehouse directory in Spark_home that holds the corresponding data
The driver-memory parameter is the amount of memory that can be used by the executing app
The Executor-memory parameter sets the current task to occupy the memory of each worker in the spark cluster
If CORE is not specified, the total number of CPUs per worker is used by default
2 if Spark SQL is run on HDFs, the Hfds permission exception error occurs when the user writes data to other tables in Beeline
The workaround is to close the permissions setting for HDFs, and in Hadoop 2.7.3, close the HDFs permission check for the parameter
Hdfs-site.xml
< Property > < name >dfs.permissions.enabled</name> <value >false</value> </Property >
3 Spark SQL executes SQL-like commands directly in Beeline using parquet compression
CREATE TABLE parquettable (Name string) USING Org.apache.spark.sql.parquet OPTIONS ( path "examples/src/main/ Resources/Users.parquet " );
Or
CREATE TABLE parquettable (Name string) USING Org.apache.spark.sql.parquet;
Other than that
If you use the sbin/stop-all.sh command, there are still some workers in the cluster or the master process can not exit, the general environment is chaotic, resulting in kill-15 PID can be
There is also a situation, if the user after the sbin/start-all.sh, found that the spark cluster in the confusion of several worker or Master process, the same is the environment chaos caused by the user as long as the kill-15 PID can
To completely resolve this situation, users should first suspend the spark cluster
Sbin/stop-all. Sh
And then kill-15 command to kill the spark process that can't be stopped.
Finally, users need to manually delete all/tmp/spark* files in the cluster to ensure that the environment is clean.
Spark SQL Optimization Insights