Big Data Series (1)--hadoop cluster bad environment configuration

Source: Internet
Author: User
Tags administrator password

Objective

With regard to the hottest technology trends, there is no doubt that big data is the first one of the hottest technology points, the concept of big data and methodology of the overwhelming spread, but in fact, many companies or technical staff can not explain the actual meaning of it or can not be found to be carried out by the feasibility of the program, There are a lot of data related projects such as a few reports, write a few T-SQL statements are labeled " Big Data Project ", of course, the current hot topic, first put "Big data" hat buckle, so as to show the project on the tall, get the company's attention or high-level leadership concerns.

First, the concept or architecture of big data continues to exist in the context of contentious issues. Currently, the solution for big data projects that can actually be landed is: some of the open source distributed solutions that Hadoop is core.

Secondly, in this series, we don't talk about some abstract methodological or conceptual stuff, and I'll actually share with you how a real big data solution will be implemented on the ground. Including its associated open source system: Hive, Spark, Sqoop, Hue, Zookeeper, Kafka and many other products.

Once again, the ecosystem of big data every product has a strong technical background to support. So, this series we focus on how to build and use a lot of technical implementation points, do not pull taixu things.

Technical preparation

In this article, we mainly analyze how to build a Hadoop cluster environment, in fact, the construction of Hadoop is divided into three forms: single-machine mode, pseudo-distribution mode, the full distribution mode, about these three models are actually rip, as long as the complete distribution mode, that is, the cluster mode of construction, The remaining two patterns will naturally be used, the first two models generally used in the development or testing environment, the biggest advantage of Hadoop is distributed cluster computing, so in the production environment is the final mode of construction: the full distribution mode.

So, this article is about building the Hadoop cluster environment.

Generally, when a company starts to build a Hadoop cluster, it needs to consider the technical point:

First, the choice of hardware

First of all, the choice of hardware for the Hadoop cluster environment is simply to choose around several faces:

1, need to build a cluster containing a few nodes (node)?

On this issue, the point to consider is the need to build several server environments, because in a distributed environment, a server is a node, so in the selection of the node is to refer to the current cluster to apply the business scenario to determine, of course, In a distributed cluster environment, the more nodes brought about by the overall performance of the cluster, it also means that the cost of the increase.

However, there is a minimum number of nodes on the Hadoop cluster for your reference.

First, in a Hadoop cluster environment, Namenode,secondarynamenode and Datanode are required to be allocated on different nodes, so at least three nodes are of course these roles. This also means that at least three servers are required. Of course, when a Hadoop run job is completed, another role history server is needed to record the operation of the historical program, and it is recommended that the role be run with a separate server.

So, at least three servers are required to build in one of the simplest Hadoop distributed clusters:

      • The first one to record all the data distribution, the process of running is Namenode
      • The second is used to back up all data distribution, after all, when the previous server down, you can also recover data through the server. So, the server running the program is Secondarynamenode
      • The third is used to store the actual data, the process of running is Datanode
      • The fourth is an optional server used to record the health of the application history. The program that runs is the history server.

2, the cluster environment in each service how to choose the configuration?

In fact, this is the problem of configuration selection, about the configuration is nothing more than memory, CPU, storage and so on how to choose, of course, in the company budget allows, the higher the better, about these issues in the building of the Hadoop environment, need to be considered from the following points.

First of all, about a few nodes in the cluster is based on the role of the focus of the configuration, not all servers are required to make the same configuration, in the Hadoop cluster environment, the most important is the Namenode running server, because it plays the role of the entire cluster scheduling and coordination work, Of course, one of the most important processes in this role is resource management (ResourceManager), which truly coordinates the operation of each node in the entire cluster. So this server is configured to be higher than the other nodes.

Second, the process of running a Hadoop cluster involves pulling all of the data distribution records into memory, so this means that when the data for the entire cluster becomes larger, we know that in big data environments, terabytes or petabytes of data are common, which means that the data distribution record is also increased, So you need to increase the memory, here is a reference:

General 1GB Memory can manage millions of block files.

Example: Bolck 128M, copy of 3, 200 clusters, 4TB data, required Namenode Memory: 200 (server) x 4194304MB (4TB data)/(128MB x 3) = 2,184,533,.33 files =2.18 million files, So the memory value is close to 2.2G.

Again, because there is a machine here to do backup, so secondary namenode need memory and namenode need the same amount of memory, and then is from the node of the various servers need to be in stock, here also has a reference:

      • First compute the virtual core number of the current CPU (Vcore): Number of virtual cores (Vcore) =CPU * Single CPU composite *HT (number of hyper-threads)
      • then the memory capacity is configured according to the virtual cores: memory capacity = Virtual core number (Vcore) *2GB (at least 2GB)

Regarding the CPU choice, because Hadoop is a distributed computing operation, so its running model is basically intensive parallel computing, so the recommended CPU should choose multi-core as far as possible, the condition allows each node to be so.

Then, in a large distributed cluster, it is also important to note that because the distributed computing requires frequent communication and IO operations between the nodes, which means that there is a requirement for network bandwidth, so it is recommended to use more than gigabit NICs, the conditions allow gigabit network cards, switches are also so.

3, how to configure the storage size of each node in the cluster environment? What RAID needs to be introduced?

First of all to talk about the raid problem, before because the purpose of RAID is to prevent data loss of the storage layer data backup mechanism, now the best use scenario is a single service this high-risk configuration, and then distributed cluster, The stored data is distributed to the various data nodes (DataNode), and the Hadoop application has been implemented by default data backup, so raid in the distributed system is not much effect , but the egg! In fact, the principle is very simple, the single node in the cluster of data backup in the event of an unplanned outage is basically unable to recover the effective data.

Then we analyze the problem of storage, you can make it clear that: the size of the volume of data determines the overall storage size of the cluster, also determines the size of the entire cluster!

To give an example:

If we are currently able to determine the amount of stock data is 1TB, and then about 10GB per day to increase the amount of data, then the current cluster in the next year , the cluster storage size calculation method is:

(1tb+10gb*365 days) *3*1.3=17.8TB

As you can see, the size of this cluster needs to be about 18T of storage for a year, and here's an explanation of the calculated formula, which is a 3 of the current data in order to prevent the loss of redundant backups, and by default, three copies of the data are stored on different servers. The purpose is then multiplied by 1.3 to reserve space as a node's operating system or as a calculated temporary result.

Then we then compute the number of nodes:

Number of nodes (Nodes) =18tb/2tb=9

The above calculation formula divided by 2TB is the assumption that each node has 2TB of storage space, here according to the storage size of the cluster can calculate the total number of data storage nodes of the cluster: 9.

So the sum of points needed: sum of points =9 (data storage node) +2 (Namenode and Secondarynamenode) = 11.

To this, you need to build 11 servers to run the cluster.

Second, the choice of software

The choice of Hadoop cluster environment software is nothing more than a selection of software products around this: OS operating system, Hadoop version, JDK version, hive version, MySQL version, etc.

1. Which of the operating system should I choose?

Hadoop products are developed by the Java language, so the recommended Linux operating system, the reason is simple open source free, on a free this reason is enough to PK off Microsoft's operating system, because we know that the cluster environment is a lot of servers, so if the cost of Microsoft server is much higher, of course , in fact, big data open-source products in the basic can not find the shadow of Microsoft, so from this point, Microsoft has pulled a lot, even in the lonely!

Therefore, in the open-source Linux operating system is also blossoming, various versions, friends can self-check the different versions of the differences and advantages, here I directly tell everyone I recommend the operating system CentOS.

The following copy from Bo friend Shrimp's introduction:

CentOS is an enterprise-class Linux distribution that is freely available based on Red Hat Enterprise Linux . Each version of CentOS is supported for seven years (via security update). The new version of CentOS is released once every two years , and each version of CentOS is updated periodically (approximately every six months) to support new hardware. This creates a secure, low-maintenance, stable, highly predictive, high-repeatability Linux environment.

CentOS Features

    • CentOS can be understood as the Red Hat as series! It's all about making improvements to red Hat as! There are no differences between operation, use and red Hat!
    • CentOS is completely free and there is no problem with red HAT AS4 requiring serial numbers.
    • CentOS 's exclusive Yum command supports online upgrades, updates the system instantly, and does not require the money to buy support services like Red Hat!
    • CentOS fixes many red HAT as bug!
    • CentOS Release Notes: CentOS3.1 is equivalent to Red Hat AS3 Update1 CentOS3.4 equivalent to Red Hat AS3 Update4 CentOS4.0 equivalent to Red Hat AS4.

Well, I believe the above reasons are enough to conquer you.

2. What is the problem with Hadoop version selection?

There are many versions of the changes to the history of Hadoop, and interested children's shoes can be viewed on their own, and I'm only splitting the Hadoop version into 2 in a big way, Here temporarily called Hadoop1.0 and Hadoop2.0, as of the time I wrote the article, the Hadoop2.0 version has been quite stable, and gradually in the enterprise application of large-area promotion, about these two versions I do not go to too much introduction, users can check themselves, or refer to my previous article about two versions of the architecture comparison.

So, the version I applied to this series is based on the Hadoop2.0 series.

And on the JDK version of the problem is to match the version of Hadoop, and other related products we will analyze, of course, we can also from the Hadoop official online query, here do not repeat.

Operating system

To facilitate the demonstration, I will use the virtual machine to explain to you, of course, interested children's shoes can also download their own virtual machine to follow me to build this platform, here I choose the virtual machine: VMware.

We can download the installation on-line, the process is very simple, there is no need to explain, of course, your PC configuration is a good point, at least 8G or more, or basic play can not be virtual machine.

Installation completed is the above look, the relevant information on the Internet, we do not repeat here.

Then, we carry out the installation of Liunx operating system, the above has said, we chose the CentOS operation, so need to go to the CentOS official website to download and install on the line, remember: no fear, no money!

official website and documentation

Official homepage: http://www.centos.org/

Official wiki:http://wiki.centos.org/

Official Chinese Document: Http://wiki.centos.org/zh/Documentation

Installation Instructions: http://www.centos.org/docs/

Here in the selection of the CentOS version of the need to remember, if not the company requirements, try not to choose the latest, but to choose the most stable, the reason is very simple, who do not want to be a new version of the mouse.

Then select the stable version to download, here I recommend CentOS6.8 64-bit operating system.

Then, click Find Download package to download on the line.

Before installing each node, we need to prepare the relevant node configuration information in advance, such as computer name, IP address, installation role, super Administrator account information, memory allocation, storage, etc., so I have listed a table for everyone to refer to:

ip address " TD class= "Xl66" width= "176" >
machine name role OS highest administrator name (name) Maximum Administrator password (PWD) General User name (name) General user password (PWD)
Master.hadoop 192.168.1.50 Master CentOS6.8 Root password01! Hadoop password01!
Salve01.hadoop 192.168.1.51 Salve1 CentOS6.8 Root password01! Hadoop password01!
Salve02.hadoop 192.168.1.52 Salve2 CentOS6.8 Root password01! Hadoop password01!
Salve03.hadoop 192.168.1.53 Salve3 CentOS6.8 Root password01! Hadoop password01!
MySQLServer01 102.168.1.100 MySQLServer Ubuntu Root password01! Hadoop password01!
MySQLServer02 102.168.1.101 MySQLServer Ubuntu Root password01! Hadoop password01!

As you can see, I'm going to go ahead and plan four servers to build Hadoop clusters, assign them machine names, Ip,ip need to be set up as a unified network segment, and in order to build our Hadoop cluster, we need to create a separate user for all the nodes in the cluster. Here I have a name, called Hadoop, of course, for the convenience of memory I unified all the password set to password01!.

Of course, we've configured our memory and storage in advance, because we know that the virtual machines we use are dynamically tuned based on the usage.

In addition, I set up two Ubuntu servers to install MySQLServer separately, build a master-slave mode, we know that Ubuntu is a friendly interface operating system, and the purpose of this separation of Hadoop cluster is because the MySQL database is a comparison of memory resources, So we have a separate machine to install, of course, MySQL is not required by the Hadoop cluster, there is no necessary relationship between the two , the purpose of building it for the subsequent installation of hive to analyze the data application, and we can in this machine to develop debugging, Of course window platform can also, after all, we use the Windows platform is the most skilled.

Conclusion

This is already some length, first of all, on the Hadoop Big Data cluster building follow-up, such as the use of zookeeper to build Hadoop high-availability platform, map-reducer sequence development, hive product data analysis, spark application development, Hue's cluster of bad environment integration and operation, SQOOP2 data extraction, etc., interested in children's shoes can be noticed in advance.

This article is mainly about building a Hadoop big data cluster needs to be prepared for the content and some attention, about open source Big data product ecosystem is very large, so we will spend a lot of time on the application of the scene, But its bottom support framework is the yarn computing model of Hadoop and the HDFs Distributed file storage System.

Through the introduction of this article, how much also need to know that they need to master the technical points: such as the operation of Linux operating system operations, MySQL relational data application and management, the basic knowledge of Java development and many other thresholds.

Questions can be message or private messages, at any time waiting for interested children's shoes to increase the data platform in-depth study. Learn together and progress together.

If you read this blog, feel that you have something to gain, please do not skimp on your " recommendation ".

Big Data Series (1)--hadoop cluster bad environment configuration

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.