Vi. Introduction to OpenStack extension topics

Last Update:2016-04-08 Source: Internet

Author: User

Tags aws emr

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Introduction to OpenStack extension topic in front

Learning Goals:

Learn about the automated deployment of OpenStack
Understand the issues that exist when Hadoop is cloud
Learn about Ceph and the application of Ceph in OpenStack
Learn about OpenStack and Docker

The contents of this note are:

OpenStack Automated Deployment
Problems with Hadoop Cloud
Enable Hadoop Cloud based on OpenStack
About Ceph
The application of Ceph in OpenStack
OpenStack and Docker

1. OpenStack Automated Deployment

Doing some OpenStack-based development work is not appropriate if you spend too much time on the deployment environment and debugging, but OpenStack provides some tools for automating deployment.

[Automated deployment is divided into three levels:]

Automatic installation of a single node. Often used to prepare a development environment, even in a single laptop can be completed an OpenStack environment, the common tool is devstack
Clustered installation. The most common tool is Puppet,puppet is not an OpenStack community, but it's a lot closer to OpenStack, and it's more common to install OpenStack with puppet.
Installation and deployment of multiple clusters. Common tools are fuel by mirantic, which is especially important for operators and companies hosting cloud businesses, because they may have to face different customers and manage many OpenStack clusters

2. Problems with Hadoop cloud

The previous Sahara was used to support Hadoop cloud. So what's the problem with the cloud of Hadoop?

Hadoop is a tool for big data processing, Hadoop has its own ecosystem, very large, the most important of the two software is the HDFS Distributed File System and MapReduce task scheduling framework . HDFS divides the files into chunks, divides the blocks into three copies on different nodes, forms a highly reliable file system, and MapReduce distributes the data we submit and the task jobs it handles to different nodes for parallel processing, MapReduce The biggest feature is the ability to dispatch tasks to the node near the data, as far as possible to reduce the transmission of data between the network, you can use the common inexpensive x86 server and network equipment to build a data parallel processing system.

Hadoop also has a real-world problem when it comes to installing, deploying, and debugging in a production environment, and wants to cloud Hadoop and provision a Hadoop cluster in the cloud with an easy point-and-click button, like an AWS EMR. So what is the problem with achieving this goal?

OpenStack has its own virtual machine scheduling , and Hadoop has its own Task Scheduler , which is two different levels of resource scheduling. Hadoop's three copies of HDFS will be in three virtual machines, if it is running on a physical machine, if the physical machine is broken then three copies will disappear, and did not achieve its purpose of reliability.

There is an open source sub-project in the Hadoop community called the Hadoop Virtual Extensions abbreviation HVE, a VMware-led, main component:

3. Cloud-based Hadoop for the implementation of OpenStack

Sahara has two modes, from deployment management and the provision of data processing services two levels to the Hadoop cloud to do some work.

Hadoop nodes use local disks to build a distributed file system

The virtual machine's local root hard disk (the BDA on Linux) is often temporary, its life cycle is often the same as the Instance life cycle of the virtual machine, if we use the root hard disk to build HDFS when Hadoop is cloud, the virtual machine will be created when needed, When you do not need to destroy, then the data in HDFS will also be gone, this is no drop. We can have another option: instead of using the root drive to build HDFS, use the volumes provided by Cinder. However, there is a problem, that is, MapReduce will try to dispatch data to the node near the data to reduce the transmission of data in the network, and Cinder volume is usually achieved through the network, so it does not take advantage of the advantages of Hadoop, but increase the burden of the network. Hadoop is often used in parallel processing based on generic, commercially inexpensive network equipment, which is bound to degrade the performance of the entire Hadoop cluster.

Summed up the contradiction is: if the local root hard disk, its data will be lost, if the use of cinder, it will bring great pressure on the network, and reduce the speed of data processing.

[The most correct approach should be:]

The Hadoop cluster still uses the local root hard disk to build HDFs, but the data source in Swift, that is, we upload the raw data to be processed, and to upload the Job to be run when it is not stored in HDFs, but uploaded to Swift inside, and then Hadoo When the P cluster provision out, it will read the data from Swift when it runs the Job, and then save the results to swift after processing. Then the Hadoop cluster built by the virtual organization can be destroyed, and the root drives are destroyed at the same time (because our data is already in Swift). As shown in the following:

The integration of Hadoop and Swift is now within the Hadoop community's open source project. When it comes to the cloud of Hadoop, I have to mention VMware's BDE (Big Data Extensions), which is very much like the Sahara of OpenStack, but it's much better in every way than Sahara, and the integration with HVE is not Tsuneyoshi

4. About Ceph

It says that with the combination of Hadoop and OpenStack, Hadoop is often thought of as being used for big data processing, and on the other hand, big data storage. When it comes to OpenStack storage, there are Ceph systems in addition to Swift.

Ceph is often used as a RBD device to provide block storage services architecturally, Ceph is divided from top to bottom in three levels:

Swift is more like the concept of a traditional document, and we upload the file to Swift, where it exists as an object, and it is a concept that distinguishes it from the traditional file system. Ceph's object actually splits a file into a number of objects that are placed on the underlying storage device, which is completely different from what a normal file object refers to. In fact, the underlying implementation of the entire architecture of its mechanism is not the same, the bottom layer of Ceph is RADOS this set of object storage, which is itself a complete storage system, all of the data stored in the CEPH system is actually implemented to the end of this layer. The high-reliability, high-scalability, high-performance automation that Ceph claims is also provided by this layer, where the underlying hardware is actually made up of a large number of storage nodes, either the x86 server or the ARM, the core of which is a thing called an OSD, which is an object Storage Daemon. The OSD inside the Ceph refers to the daemon (a software, a service process), and Cinder within the definition of the OSD (object storage device) has a strong correspondence between, can be considered as Object storage device abbreviated A version

[The entire RADOS system can be divided into the following parts: logical Structure]

Osds the cornerstone of the entire system
Monitors monitoring the status of the entire system
Clients client, without this client, the entire data read and write will be difficult to implement

[Addressing process in the CEPH system:]

To split this file into a few Objects.
Put this Objects in our PG group.
Each group is mapped to a different OSD by a CRUSH algorithm,

There is a larger feature, every step of it is calculated by the middle of the table is not checked, which makes Ceph scalability is better

Then

On the RADOS there is a library called Librados, which provides a set of APIs that directly access the RADOS system, a place where RADOS is not accessing the entire Ceph. In fact, what does Ceph's interface provide? In fact, by one of the above levels, this level consists of three parts, one is RADOS's Gateway, the second is the RBD module, and the third is the CEPH FS (file system).

The Gateway of RADOS is to encapsulate the underlying object storage system and provide an interface that is compatible with Swift, Amazon S3-compatible object storage, which is also HTTP, and the other part of the RBD block device that we said earlier, is often used as a Cinder The backend provides a volume to the virtual machine, and the third is Ceph FS, which encapsulates a POSIX interface on a ceph basis, forming a file system

Ceph's architecture, Ceph's data storage process, takes care to differentiate between two types of object storage and the concept of object in different contexts.

5. Ceph's application in OpenStack

[Ceph Popular Reason:]

Provides inexpensive block storage services for virtual machines through Cinder. Based on x86, build a distributed block storage system (emphasizing block storage, because it has strong consistency requirements, high performance requirements, these are not in the object storage, in HDFs is also relatively weak). Cinder supports Ceph-provided RBD devices, with the combination of Cinder and ceph when building an OpenStack cluster without the need to buy storage devices, you can directly use Ceph based on x86 hardware common such server, you can build a block storage system, Available for virtual machines to use (provided to the virtual machine to do the volume), low cost. In many cases, Ceph has even more performance than many professional storage devices.
Ceph provides a swift-compatible object storage interface, and if you are implementing an OpenStack deployment, you can use the same storage system for block and object storage, not necessarily to install a set of Swift. So, do you need Swift after you have Ceph? Ceph first fit is to do block storage, is strong consistency, swift default is final consistency, Swift is more consistent with object storage scenarios, Swift has advanced Feature for object storage such as Storage policy can be used by users to choose the data storage strategy, It is not available in many commercial object storage devices, so Swift is a leader in object storage and is built to support multi-tenancy and is perfect for the combination of OpenStack.
All OpenStack storage supports Ceph as a backend

[about the choice of Swift or Ceph:]

Case 1: Use Ceph with a small number of nodes, because Swift uses at least 5 nodes and cannot be scratched.

Case 2: The need to provide cloud disk services, do cloud disk means to face a large number of users, but also hope that the user experience is good, can let users choose how to save data, this time to choose Swift is better, on the one hand for this large-scale deployment of support is good, the second part of it has some advanced Feature, Third, it's cheaper than Ceph.

6. OpenStack and Docker

Container Technology

Hypervisor in the traditional sense is that can create an environment on a host, in which we can create some virtual machines, virtual machines can be installed in the guest operating system guest OS, and then the above to install our middleware and applications and so on. In fact, there is another kind of virtualization technology, is not installed in the Guest OS, directly based on the host operating system to do some of the operating environment of the isolation, do some resources division, and then each form a container, and then let these applications run into these containers, In fact, the bottom of the container is to share an operating system, and even share some common libraries. And the traditional Hypervisor way to compare, it has a big advantage is to save the Guest OS overhead, but also has its disadvantages such as only the use of the same as the host operating system, and no way to install like Windows and other operating systems, Even different versions of the same operating system will be very difficult, because his underlying mechanism does not support the mechanism to install another operating system, directly with the host of things, this is what we say Container technology, Linux under a more famous Container The name is directly called Linux Container (LXC)

With the development of Docker, Container technology will fire up a bit. What is Docker? (Google YI XIA NI JIU Know)

[Combination of OpenStack and Docker:]

Storage backend for Docker mirroring using Swift
Scheduling and managing Docker containers with Nova
Implementing Docker containers with Heat
Using Heat to implement Docker Orchestration
SDN solution for Docker based on Neutron design and implementation
Using Glance to manage the image of Docker

End.

Vi. Introduction to OpenStack extension topics

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More