Spark hardware configuration

Source: Internet
Author: User
Tags spark rdd

Storage System spark tasks need to load data from some external storage system (e.g. HDFS or HBase), it is important that the storage system is close to the spark system, we have the following recommendations:  (1) If possible, run spark on the same HDFS node, The simplest approach is to create a cluster-independent pattern that raises the same node (http://spark.apache.org/docs/latest/ spark-standalone.html), and configure Spark's configure and Hadoop memory and CPU usage to avoid interference (for Hadoop,) or you can run Hadoop and spark in one of the same cluster Manager like Mesos or Hadoop YARN (2) If you can, run spark on different nodes and need to use an HDFS node inside the same LAN. (3) For low latency data storage like hbase, using data on different nodes is less disruptive than using local storage system data (but HBase storage is better than local storage to avoid interference)   local hard drive while spark can perform a large number of computations in memory, It still requires a local hard drive as data storage, not suitable for storing data in RAM, and protecting intermediate output phases, we recommend 4-8 hard drives per node, no raid (just like different mount points) to mount a hard disk in Linux using Noatime option (HTTP ://www.centos.org/docs/5/html/global_file_system/s2-manage-mountnoatime.html) reduces unnecessary write operations, in Spark, configuration   The Spark.local.dir variable is separated by a "," Number (http://spark.apache.org/docs/latest/ configuration.html), if you're running HDFs, it's just on the same hard drive as HDFs.   Memory in general, Spark can run on any machine with 8G to hundreds of gigabytes memory, and in all cases, we recommend configuring up to 75% of the memory capacity of spark, and the other capacity is the system and buffer cache usage. How much memory you need depends on your application decision, determine how much memory specific size your app uses, and you need to load some specific data into the spark RDD and use the UI's Storage tab (http://<driver-node>:4040) to observe the amount of memory used. Note that memory usage greatly affects the storage level and serialization format, as well as how the scheduling guide is optimized (http://spark.apache.org/docs/latest/tuning.html). Finally, note that Java VMs do not always perform well on more than GB of RAM. If you have such a ram machine, you can run a few more workers on it, in Spark's standalone mode, you can set multiple workers on each node, set the Spark_worker_instances variable in conf/spark-env.sh, and set SPARK_WORKER_CORES  的核数Network based on experience, when the data in memory, using the Gigabit NIC program will run faster, especially "distributed reduce" application such as Group-bys reduction, Reduce-bys and SQL join,  In a given application, you can view the process of spark shuffles and how much data executes shuffles through the UI. CPU cores Spark Each cluster to start tens of thousands of threads, with at least 8-16 cores per cluster. The load on your work depends on the CPU, and you need more: Once the data is in memory, more applications depend on CPU or bandwidth

Spark hardware configuration

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.