Mesos structure and source code analysis

Source: Internet
Author: User
Tags call shell cassandra
This is a creation in Article, where the information may have evolved or changed.

Mesos According to the official introduction, is the kernel of the distributed operating system. The goal is "program against your datacenter-like it's a single pool of resources" that can be used as a PC for the entire data center. It can be said that this goal is the common goal of all the systems that claim to be dcos, this article analyzes the Mesos and the periphery frame from the structure and the source level, and see how Mesos achieves this goal, and how far apart this goal is now. Finally, compare the similarities and differences between Mesos and Kubernetes, both of which are affected by Google's Borg.

Reading objects: Technical people interested in mesos or distributed systems

Design Concept and architecture

Quote Mesos Paper in a sentence to illustrate the Mesos design concept:

Define a minimal interface that enables efficient resource sharing across frameworks, and otherwise push control of task s Cheduling and execution to the frameworks

Define a minimized interface to support cross-frame resource sharing, and other scheduling and execution tasks are delegated to the framework to control itself.

This sign Mesos not try to be a one-stack problem-solving system, but to realize resource sharing with minimal cost. Let's take a look at the official architecture diagram:

Key components and concepts:

    • Zookeeper is primarily used to implement master elections, which support the high availability of master.
    • Master Mesos The primary node, receives the registration of the slave and the framework scheduler, allocates resources.
    • Slave from the node, receives the master sent the task, dispatches the executor to execute.
    • The framework, such as HADOOP,MPI, is the framework, including the Scheduler,executor two parts. Scheduler runs independently, registers to master after startup, receives the resource offer message sent by master to decide whether to accept. Executor is a task that executes the framework for slave calls. Mesos has built-in commandexecutor (direct call shell) and dockerexecutor two executor, other custom executor need to provide a URI for slave download.
    • Task Mesos The most important task is to allocate resources, and then ask whether scheduler can use the resource execution Task,scheduler to bind the resource and task to the specified slave execution. A task can be a long life cycle, or it can use a short life cycle of bulk.

An example of another resource allocation provided by the authorities:

    1. SLAVE1 reports to Master with 4 CPUs and 4 GB of memory available
    2. Master sends a Resource offer to Framework1 to describe how many resources are available Slave1
    3. FW Scheduler in FrameWork1 will reply to Master, I have two tasks to run in Slave1, one task needs <2 cpu,1 GB memory = "", and the other task needs <1 cpu,2 GB of memory = "" >
    4. Finally, Master sends these Tasks to SLAVE1. Then, SLAVE1 has 1 CPUs and 1 GB of memory that are not used, so the allocation module can make these resources available to FRAMEWORK2

As you can see from this example, Mesos's core work is rarely, resource management and allocation, and task forwarding. Scheduling is implemented by the framework, the definition of a task and the implementation is also implemented by the framework, Mesos resource allocation granularity is task-based, but because the executor execution task may be implemented in the same process, resource constraints are just a flow control mechanism, does not actually control the granularity of the task.

Look at the official structure above, about should understand the approximate structure of the mesos, but the specific mesos why to do so, below our specific analysis.

Historical evolution

We took the time back to the 2009 when Mesos invented it, when Hadoop matured and was widely used, devouring everyone's servers. Spark was brewing, too. and the server configuration management domain puppet has not released 1.0,chef just appeared, "Configuration Management Infrastructure as Code" thought is slowly accepted, Ansible/salt has not appeared. We manage the server or static partition of the way, when the purchase of the server is arranged, these servers run what services, should be configured how many cpu/memory/disk, installation and maintenance more use of shell script.

Static Management Server, whether using the shell or puppet this tool, on the one hand is a waste of resources, on the other hand, service failure recovery, service migration requires manual intervention, because this tool is only managed at the time of deployment, the service process is not returned to the deployment tool management. and Mesos See the disadvantage of this way, try to achieve a resource sharing platform, improve resource utilization, realize dynamic operation and maintenance.

In this way it is easier to understand the practice of Mesos, Mesos master equivalent to Puppet/salt Master,mesos slave equivalent to Puppet/salt agent, Both of the things to do is to send instructions through the master sent to slave/agent execution, the difference is puppet/salt after successful execution will not care, and mesos after execution will maintain the state of the task in the master, the task can be restarted or migrated after hanging.

At the same time Mesos see Hadoop and other distributed systems have implemented their own scheduler and executor, so scheduler and executor specific implementation, through the development of the framework standards, by the third-party distributed framework to achieve their own, You are only responsible for forwarding the task to slave, and the executor of the framework is called by slave to execute the task.

In other words, due to the historical timing, Mesos adopted a conservative strategy to evolve.

Resource conflicts and isolation mechanisms

The primary problem with resource sharing is how to resolve resource conflict issues:

    1. Cpu/mem This mesos has a mesos containerizer built in by default and can be restricted by cgroups and namespaces.
    2. Network Port Mesos The default is to assign each task a random unused port (which can be assigned to multiple ports) that needs to be used from the environment variable to get to this port, which is a contractual rule and does not impose restrictions.
    3. File system By default, Mesos file system is multi-application sharing, by default, each task is assigned a sandbox directory as the working directory, sandbox on the host directory is related to TaskID, will not conflict, life cycle and task binding. Persistent Volume is also supported for applications that save persistent data and are implemented through directory mapping. The life cycle of the persistent volume is independent of the task and is determined by the scheduler, which means that mesos the conflict between the persisted data and the dynamic migration is handled by scheduler itself. Alternatively, you can choose to use Docker Containerizer to isolate the file system through Docker.
    4. Service discovery and load balancing Mesos does not support service discovery and load balancing by default, requiring users to implement them themselves. The marathon framework above the Mesos provides a marathon-lb that enables dynamic load balancing by monitoring Marathon's event modification Haproxy, but only for applications that are deployed through marathon.
    5. Container improvements Mesos built-in containers are not image-based, but Docker is image-based, and the problem is that some features are implemented in two ways (such as disk isolation, etc.). In addition, the daemon mechanism of Docker itself makes Mesos not directly manage the container process, so mesos also plans to improve the built-in container support image, compatible with the DOCKER/APPC image, will not use Docker as the default container. See: https://github.com/apache/mesos/blob/master/docs/container-image.md,MESOS-2840.

Source Code Architecture Analysis

The core of Mesos is written in C + +, mainly using a libprocess library, which is a library of C + + actor models (not very familiar with the actor model can see my previous article: Concurrency Pain Thread,goroutine,actor, This library by the Option,nothing,try, future,lambda,defer these have realized, let me fully experience the magic of C + +). Libprocess basically refers to the model of Erlang implementation, wherein the actor is called process, each process has a separate ID, we here for the convenience of understanding, the abstraction is called actor, specific actor collaboration mode see:

In the master node of Mesos, each framework and slave is a remote actor. On the slave node, each executor is an actor, except that the built-in executor is in the same process, while other custom executor are independent processes that interact between executor and slave through interprocess communication (network ports).

Mesos through the Actor model, simplifying the call of the distributed system and the complexity of concurrent programming, the actors communicate asynchronously through the message, only need to know the ID of each other, do not need to know each other and whether they are on the same node. The actor manager in the Libprocess package knows whether the receiver is a local actor or a remote actor, and if it is remote, forwards the message through the request interface. Libprocess also encapsulates the network layer, the Transport layer uses the HTTP protocol, the use of different messages to register different handler can also support the HTTP long polling mode subscription events. Mesos in order to improve the efficiency of message delivery parsing, message delivery supports JSON and PROTOBUF two formats.

The benefit of this architecture is that Mesos eliminates the reliance on Message Queuing. In general, this distributed message distribution system requires Message Queuing or central storage support, such as salt using Zeromq,kubenetes is ETCD, and Mesos is not dependent on external resource support, only through the actor model of fault-tolerant mechanism to achieve. And the disadvantage is the actor's own shortcomings, because the message is asynchronous, need to handle the loss of messages and time-out logic, Mesos does not guarantee the reliable delivery of messages, the delivery strategy provided is "at-most-once", actor needs to resolve the problem of message loss through the time-out retry mechanism. But any distributed system that needs to be called remotely needs to deal with a similar problem.

Implementation analysis of the framework

From the above analysis, we can see that the framework plays an important role in Mesos. If you are going to develop a distributed system yourself and want to run it in Mesos, you need to consider implementing a framework yourself.

Mesos provides a framework base library where third parties only need to implement scheduler, and executor two interfaces. The base library is implemented in C + + with the Java version provided by JNI and the Python version in Python's native way. and Golang version is independent development, do not rely on C + + library, code architecture is more elegant, want to use go to implement the actor can be referenced. The main thing about this framework base library (Schedulerdriver and Executordriver implementations) is to implement the Actor model we mentioned earlier, interacting with master and slave, Callback user-defined scheduler and executor when the message is received.

Here is the Java Scheduler interface:

  public interface Scheduler {void registered (Schedulerdriver driver, Frameworkid  Frameworkid, Masterinfo masterinfo);  void reregistered (Schedulerdriver driver, Masterinfo masterinfo); This is the most important method, when the system has idle resources, will ask scheduler,//whether to accept or reject the offer.  If accepted, it is also necessary to encapsulate the task information that uses the offer, calling driver to execute.  void Resourceoffers (Schedulerdriver driver, list<offer> offers);  If the offer is canceled or used by another framework, this method is called void offerrescinded (Schedulerdriver driver, OfferID OfferID);  Callback when the framework initiates a change in the task state.  void Statusupdate (Schedulerdriver driver, taskstatus status);  Custom Framework message void Frameworkmessage (Schedulerdriver driver,executorid Executorid,slaveid slaveid,byte[] () data);  void disconnected (schedulerdriver driver);  void Slavelost (Schedulerdriver driver, SlaveID SlaveID);  void Executorlost (Schedulerdriver driver,executorid executorid,slaveid slaveid,int status); void error (Schedulerdriver driver, String message);}  

The framework of the Mesos is independent of the Mesos system, the specific deployment method, and the high availability require the framework to solve itself, so to achieve a complete, highly available framework, the complexity is still very high. In addition, the framwork mechanism is suitable for distributed systems that require task distribution and scheduling, such as Hadoop,jenkins. Other distributed database such as Cassandra,mesos do is through the scheduler scheduling Cassandraexecutor deployment and management (including maintenance operations on the node, such as backup) Cassandra node, details can be see https:// Github.com/mesosphere/cassandra-mesos.

In addition Mesos master itself does not have persistent storage, all data is in memory, and data is lost after reboot. However, the active framework and slave register will send their current status to Master,master to recover data in this way. So if the framework needs to persist the execution record of a task, it needs to implement its own persisted storage.

Mesos Slave provides a mechanism for recovery to recover after a slave process restart. By default, after the slave process restarts, the executor/task associated with the process will be killed, but if the checkpoint setting is turned on for the framework configuration, the framework-related executor/ The task information is persisted to disk for recovery after the restart.

Marathon

Marathon is officially a private PAAs based on Mesos. It implements the framework of the Mesos, supports deployment of applications through shell commands and Docker, provides a web interface, supports parameter settings such as CPU/MEM, number of instances, and supports single-application scale, but does not support complex cluster definitions.

The marathon itself is implemented through Scala, and the actor model is used. It provides an event bus interface that allows other applications to monitor the event bus for dynamic configuration, such as the marathon-lb mentioned earlier.

Marathon because of the JVM-based, do the system installation package distribution of a little trouble, the result of people to facilitate the direct distribution of a more than 70 m shell script. At first glance startled me, more than 70 m of the shell to write how many lines, the results open a look, Java binary jar is also embedded inside. It's also a solution.

Aurora

Aurora attempts to define task and order relationships on Mesos with a custom configuration language to solve heterogeneous problems in various environments and reduce the complexity of shell scripting. As an adhesive language that can be understood as mesos, the status is similar to Salt/ansible's Yaml, except that Mesos itself does not support a similar configuration language, and Aurora supports it through the framework.

Comparison of Mesos and Kubernetes

Although Mesos and Kubernetes have borrowed from Borg's ideas, the ultimate goal is similar, but the solution is different. Mesos is a bit like federalism, acknowledging the sovereignty of the States (the framework), but the states let a part of the common mechanism come out by mesos to achieve, maximize the sharing of resources, improve resource utilization, the Framework and Mesos are relatively independent relationship. And Kubernetes is a bit like a single system, build a common platform, as far as possible to provide comprehensive capabilities (network, disk, memory, CPU), to develop a cluster application definition standards, any complex application can be defined according to this standard and the minimum cost of change in the deployment of the above run, The major change requirements are also due to the desire to enjoy Kubernetes's dynamic scaling capabilities. So Mesos try to do less, and kubernetes try to do more. Mesos is defined as the kernel of DCOs, but the kernel of the OS should be responsible for what it is, and the debate has never stopped.

Relatively speaking kubernetes use to be easier, and mesos more flexible, need to do custom development work more. Compared with the Cassandra mentioned earlier, the implementation of Cassandra-mesos is very complex, and Kubernetes's Cassandra example has only one class, which realizes Kubernetesseedprovider, Find the Cassandra Seed node through Kubernetes's service discovery mechanism. Of course Mesos on the Cassandra can be sent through Mesos to send backup management tasks, and Kubernetes does not provide task forwarding functions, such requirements users can be implemented by the Kubectl exec method. From this example, we can understand the difference between the two.

The purpose of the Kubernetes on Mesos project under Kubernetes is to use this feature of Mesos to implement other framework-shared resources on Kubernetes and Mesos. If you want to know more about kubernetes, you can see the kubernetes architecture I wrote last year.

Related reading

    1. Return of the Borg:how Twitter rebuilt Google ' s Secret weapon
    2. Mesos:a Platform for fine-grained Resource sharing in the Data Center
    3. Infoq Mesos Series
    4. Libprocess C + + 's Actor model Library
    5. MARATHON-LB Marathon-based load balancer
    6. Cassandra-mesos
    7. Kubernetes Cassandra Example
    8. Deployment tutorials for Mesos
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.