Four scenarios for OpenStack deployment to Hadoop

Last Update:2014-12-18 Source: Internet

Author: User

Keywords Private cloud these can realize

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

As companies begin to leverage cloud computing and large data technologies, they should now consider how to use these tools in conjunction. In this case, the enterprise will achieve the best analytical processing capabilities, while leveraging the private cloud's fast elasticity (rapid elasticity) and single lease features. How to collaborate utility and implement deployment is the problem that this article hopes to solve.

Some basic knowledge

The first is OpenStack. As the most popular open source cloud version today, it includes controllers, computing (Nova), Storage (Swift), Message Queuing (RABBITMQ), and Network (Quantum) components. Figure 1 provides a diagram of these components (not including Quantum network components).

▲ Figure 1. OpenStack components

Together, these components provide an environment that allows for dynamic rationing of computing and storage resources. From a hardware standpoint, these services can be extended to many virtual and physical servers. For example, most organizations deploy a physical server as a controller node and deploy another physical server as a compute node. Many organizations also choose to isolate their storage environment to a dedicated physical server, which means using a separate server for the SWIFT storage environment for OpenStack deployments.

The second is big data. It is generally understood as a data collection of three data sources: traditional data (structured data), perceived data (log data and metadata), and social (social media) data. Large data is often stored in new technology models, such as NoSQL distributed database. There are four types of non-relational databases that manage this system (NRDBMS): Based on columns, key values, charts, and document based. These Nrdbms gather the source data together and analyze the aggregated information with an analytical program such as MapReduce.

The traditional large data environment includes an analytic program, a data storage, an extensible file system, a Workflow manager, a distributed sorting and hashing solution, and a data flow programming framework. The data flow programming framework that is commonly used for commercial applications is structured Query Language (SQL), and for open source applications, SQL alternatives, such as Apache Pig for Hadoop, are often used. On the commercial side, Cloudera provides one of the most stable and comprehensive solutions, while Apache Hadoop is the most popular version of open source Hadoop.

The third is Apache Hadoop. Includes a variety of components, including Hadoop Distributed file system (i.e., HDFS, an extensible filesystem), HBase (Database/data storage), Pig, Hadoop (profiling method), and MapReduce (distributed sorting and hashing). As shown in Figure 2, the Hadoop task is decomposed into several nodes, while the MapReduce task is decomposed into a tracker (tracker).

▲ Figure 2. Part of the Hdfs/mapreduce layer

Figure 3 shows how MapReduce performs a task, takes input and performs a series of grouping, sorting, and merging operations, and then renders sorted and hashed output.

▲ Figure 3. Advanced MapReduce Diagram

Figure 4 illustrates a more complex MapReduce task and its components.

▲ Figure 4. MapReduce Data Flow Diagram

Although the Hadoop MapReduce is more complex than the traditional analytical environment (such as IBM Cognos and Satori Procube Online analysis processing), its deployment is still scalable and cost-effective.

Overall consideration

Large data technology and a private cloud environment are useful, but if you combine the two, the business will get a huge profit. While combining the two makes the environment more complex, companies can still see the remarkable synergy that OpenStack private cloud and Apache Hadoop environments combine. How to do it better?

Programme 1. Swift+apache Hadoop MapReduce

In a private cloud environment, one of the common large data deployment models is to deploy the OpenStack Swift storage technology to the Apache Hadoop MapReduce cluster to implement processing capabilities. The advantage of using this architecture is that the enterprise will get an extensible storage node that can be used to process its ever-accumulating data. According to the IDC survey, the annual growth rate has reached 60%, which will meet growing data requirements while allowing the organization to launch a pilot project to deploy a private cloud.

The best use scenario for this deployment model is that businesses want to try to use private cloud technology through storage pools while using large data technologies internally. Best practices show that organizations should first deploy large data technologies to your production data warehouse environment, and then build and configure your private cloud storage solution. If you successfully integrate the Apache Hadoop MapReduce technology into your Data warehouse environment, and you have built and run your private cloud storage pool correctly, you can integrate private cloud storage data with the preconfigured Hadoop MapReduce environment.

Programme 2. Swift + Cloudera Apache release

For businesses that are unwilling to start using large data from scratch, you can use large data devices from solution vendors such as Cloudera. The Cloudera distribution includes the Apache Hadoop (CDH) solution, which allows organizations to recruit or train employees for every nuance of Hadoop so that they can achieve a higher return on investment (ROI) in larger data. This is particularly appealing for businesses that do not have large data or private cloud skill sets and want to integrate the technology into their portfolio in a slow, incremental fashion.

Large data and cloud computing are relatively new technologies, and many companies want to achieve cost savings through them; however, many companies hesitate to use these technologies entirely. By leveraging a vendor-supported version of large data software, businesses will be more comfortable in this area, and can also learn how to use these technologies to their advantage. In addition, if large data sets are analyzed using large data software and can be managed through private cloud storage nodes, these enterprises can also achieve higher utilization. To best integrate this strategy into the enterprise, you first need to install, configure, and manage CDH to analyze the enterprise's data Warehouse environment and then add the data stored in Swift to where it is needed.

Programme 3. Swift, Nova + Apache Hadoop MapReduce

Businesses that want to achieve a higher degree of flexibility, scalability, and autonomy in a large data environment can take advantage of the innate capabilities of the Open-source products offered by Apache and OpenStack. For this reason, enterprises need to maximize the use of these two technology stacks, which requires a different way of thinking with the solution described above to design the environment.

To achieve a fully scalable, flexible, large data environment, you must run it in a private cloud environment that provides both storage and compute nodes. To do this, the enterprise must first build a private cloud and then add large data. Therefore, in this case, Swift, Nova, and RABBITMQ are necessary, and the controller nodes are used to manage and maintain the environment. However, the question is whether the enterprise needs to divide the environment into several parts (for example, not large data virtual machines or client instances) for different systems and business units. If the enterprise is ready to use the private cloud completely, then you should add Quantum to divide the different environments from the network perspective (see Figure 5).

▲ Figure 5. OpenStack Architecture

After you have set up and tested the private cloud environment, you can merge the Apache Hadoop components into it. At this point, the Nova instance can be used to store NoSQL or SQL data stores (yes, they can coexist) as well as Pig and MapReduce instances; Hadoop can be located on a stand-alone, non-Nova machine to provide processing functionality. In the near future, Hadoop is expected to run on the Nova instance, allowing private clouds to be included in all Nova instances.

Programme 4. GFS, Nova, Pig and MapReduce

From a schema perspective, there may be other options in addition to using OpenStack Swift to implement scalable storage. This example uses the Google File System (GFS), the Nova component, and the Apache Hadoop component, specifically, using Pig and MapReduce. The example allows the enterprise to focus on developing a private cloud computing node that is used only for computing processing while leveraging Google's public storage cloud as the data store. By using this hybrid cloud, the enterprise can focus on the core capabilities of the computing processing function, where the third party is responsible for implementing the storage. The model can take advantage of other vendor storage solutions, such as the Amazon simple Storage Service, but before any external storage is used, the enterprise should build the solution internally using extensible File System (XFS) and test accordingly. And then extend it to the public cloud. In addition, depending on the sensitivity of the data, organizations may need to use data protection mechanisms such as obfuscation (obfuscation), unbinding, encrypting, or hashing.

Tips and Hints

When incorporating cloud computing and large data technologies into the enterprise environment, it is important to build an employee's skill set for both technology platforms. Once your employees understand these technologies, you can build a lab to test the effects of these two platforms. Because there are many different components, it is important to follow the validated path mentioned earlier in the implementation process. In addition, organizations may experience setbacks when trying to merge both modes, and should use other methods after several attempts. These methods include devices and hybrid clouds.

Obstacles and traps

Since these are relatively new technologies, most businesses need to use existing resources for testing, followed by significant capital expenditures (CAPEX). However, if there is no reasonable budget and personnel training for the applications of these technologies in the enterprise, then the pilot and test work will fail. Similarly, if a complete private cloud deployment is lacking, the enterprise should first implement a large data technology in it before implementing a private cloud.

Finally, businesses need to develop a strategic roadmap for private clouds and large data plans. For a successful deployment, more analysis "work" is required, which may delay the process. To eliminate this risk, an iterative approach to project management should be deployed to the business unit in a phased manner, through which these technologies can be deployed to the enterprise. Businesses need to identify how to benefit companies by applying these new technologies, such as cost savings or enhanced processing capabilities. (Thanks @ program Ape tickets to share.) )

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More