Spark SQL Optimization Insights

Source: Internet
Author: User

This article focuses on some of the most recent issues that have been experienced in using spark SQL.

1 Spark 2.0.1, when starting Thriftserver or Spark-sql, if you want to spark-sql run on HDFs, you need to add the parameter "--conf spark.sql.warehouse.dir=hdfs:// Hostname:9000/user/hive/warehouse "

For example, start Thriftserver:

Bin/start-thriftserver. sh --master Spark://hostname:7077--conf spark.sql.warehouse.dir=hdfs://  Hostname:9000/user/hive/warehouse--driver-memory 2g--executor-memory 35g

Description

Spark.sql.warehouse.dir This parameter is not specified, spark SQL automatically creates a spark-warehouse directory in Spark_home that holds the corresponding data

The driver-memory parameter is the amount of memory that can be used by the executing app

The Executor-memory parameter sets the current task to occupy the memory of each worker in the spark cluster

If CORE is not specified, the total number of CPUs per worker is used by default

2 if Spark SQL is run on HDFs, the Hfds permission exception error occurs when the user writes data to other tables in Beeline

The workaround is to close the permissions setting for HDFs, and in Hadoop 2.7.3, close the HDFs permission check for the parameter

Hdfs-site.xml

     <  Property >        < name >dfs.permissions.enabled</name>        <value  >false</value>     </Property  >

3 Spark SQL executes SQL-like commands directly in Beeline using parquet compression

CREATE TABLE parquettable       (Name string)      USING Org.apache.spark.sql.parquet      OPTIONS (         path "examples/src/main/ Resources/Users.parquet "      );

Or

   CREATE TABLE parquettable       (Name string)      USING Org.apache.spark.sql.parquet;

Other than that

If you use the sbin/stop-all.sh command, there are still some workers in the cluster or the master process can not exit, the general environment is chaotic, resulting in kill-15 PID can be

There is also a situation, if the user after the sbin/start-all.sh, found that the spark cluster in the confusion of several worker or Master process, the same is the environment chaos caused by the user as long as the kill-15 PID can

To completely resolve this situation, users should first suspend the spark cluster

Sbin/stop-all. Sh

And then kill-15 command to kill the spark process that can't be stopped.

Finally, users need to manually delete all/tmp/spark* files in the cluster to ensure that the environment is clean.

Spark SQL Optimization Insights

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.