New node method for spark cluster

Source: Internet
Author: User

Spark cluster processing capacity is not enough to expand, how to add new nodes in the existing spark cluster? This article shows an example of how to add a node to the spark cluster.

1. Cluster environment

The existing spark cluster consists of 3 machines, the user name is CDAHDP, the home directory is/HOME/AP/CDAHDP, the configuration is 2c8g virtual machine, the cluster is based on the yarn architecture.

Master:128.196.54.112/w118pc01vm01

Slave1:128.196.54.113/w118pc02vm01

Slave2:128.196.54.114/w118pc03vm01

Related software versions: jdk1.7, scala2.10.4, hadoop2.6.0, spark1.1

Now you need to add a node:128.196.54.115/w118pc04vm01, 2c8g

Stop the current cluster first: Stop Spark and stop HDFs and yarn.

2. New node requirements

(1) New node needs to increase user CDAHDP, home directory/HOME/AP/CDAHDP. Consistent with the cluster's existing machines.

(2) Modify the/etc/hosts file for all nodes and update the IP hostname configuration of the new node.

(3) Configure SSH so that the new node and the nodes in the cluster can ssh to each other without password.

(4) Install JDK, Scala, Hadoop, and spark on the new node. Its version, installation directory, environment variable settings are consistent with the existing nodes in the cluster. For example, you can assign values directly from a cluster node.

3. Configuration file Modification

(1) Modify the $hadoop_home/etc/hadoop/slaves file and add the new node as the slave node.

(2) Modify the $spark_home/conf/slaves file and add the new node as the slave node.

(3) format the namenode of the new node:

CD $HADOOP _home/bin

./hdfs Namenode-format

4. Start a new cluster

Start Hdfs,yarn, as well as spark.

CD $HADOOP _home/sbin

./start-dfs.sh &&./start-yarn.sh

CD $SPARK _home/sbin

./start-all.sh

Prior to expansion:

After expansion:

5. Cluster load Balancing

(1) View the basic information of the HDFS cluster: executing Hadoop dfsadmin-report

(2) Load balancing: Perform start-balancer.sh under $hadoop_home/sbin/

Description: The balancer operation is a slower process, so it executes in the background. In the balance process, the speed at which data is migrated between nodes is 1m/s by default.

Before load Balancing:

Perform load balancing:

After load balancing:

At this point, the addition of the new node in the Spark cluster is complete.

New node method for spark cluster

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.