Spark hardware configuration

Last Update:2015-01-11 Source: Internet

Author: User

Tags spark rdd

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Storage System spark tasks need to load data from some external storage system (e.g. HDFS or HBase), it is important that the storage system is close to the spark system, we have the following recommendations: (1) If possible, run spark on the same HDFS node, The simplest approach is to create a cluster-independent pattern that raises the same node (http://spark.apache.org/docs/latest/ spark-standalone.html), and configure Spark's configure and Hadoop memory and CPU usage to avoid interference (for Hadoop,) or you can run Hadoop and spark in one of the same cluster Manager like Mesos or Hadoop YARN (2) If you can, run spark on different nodes and need to use an HDFS node inside the same LAN. (3) For low latency data storage like hbase, using data on different nodes is less disruptive than using local storage system data (but HBase storage is better than local storage to avoid interference) local hard drive while spark can perform a large number of computations in memory, It still requires a local hard drive as data storage, not suitable for storing data in RAM, and protecting intermediate output phases, we recommend 4-8 hard drives per node, no raid (just like different mount points) to mount a hard disk in Linux using Noatime option (HTTP ://www.centos.org/docs/5/html/global_file_system/s2-manage-mountnoatime.html) reduces unnecessary write operations, in Spark, configuration The Spark.local.dir variable is separated by a "," Number (http://spark.apache.org/docs/latest/ configuration.html), if you're running HDFs, it's just on the same hard drive as HDFs. Memory in general, Spark can run on any machine with 8G to hundreds of gigabytes memory, and in all cases, we recommend configuring up to 75% of the memory capacity of spark, and the other capacity is the system and buffer cache usage. How much memory you need depends on your application decision, determine how much memory specific size your app uses, and you need to load some specific data into the spark RDD and use the UI's Storage tab (http://<driver-node>:4040) to observe the amount of memory used. Note that memory usage greatly affects the storage level and serialization format, as well as how the scheduling guide is optimized (http://spark.apache.org/docs/latest/tuning.html). Finally, note that Java VMs do not always perform well on more than GB of RAM. If you have such a ram machine, you can run a few more workers on it, in Spark's standalone mode, you can set multiple workers on each node, set the Spark_worker_instances variable in conf/spark-env.sh, and set SPARK_WORKER_CORES 的核数Network based on experience, when the data in memory, using the Gigabit NIC program will run faster, especially "distributed reduce" application such as Group-bys reduction, Reduce-bys and SQL join, In a given application, you can view the process of spark shuffles and how much data executes shuffles through the UI. CPU cores Spark Each cluster to start tens of thousands of threads, with at least 8-16 cores per cluster. The load on your work depends on the CPU, and you need more: Once the data is in memory, more applications depend on CPU or bandwidth

Spark hardware configuration

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More