(Because the writing is really too good, so must be reproduced)
This was the first in a series of blogs posts that details some of the inner workings of Kubernetes. If you're simply an operator or the user of Kubernetes you don't necessarily need to understand these details. But if you prefer depth-first learning and really want to understand the details of what things work, this is for you.
This article assumes a working knowledge of Kubernetes. I ' m not going to define what Kubernetes are or the core components (e.g. Pod, Node, Kubelet).
In this article we talk about the core moving parts and what they work with each other to make things happen. The general class of systems like Kubernetes are commonly called container orchestration. But orchestration implies there are a central conductor with an up front plan. However, this isn ' t really a great description of Kubernetes. Instead, Kubernetes is more like jazz improv. There is a set of actors, which is playing off from each of the other to coordinate and react.
We ' ll start by going over the core components and what they do. Then we'll look at a typical flow that schedules and runs a Pod.
Datastore:etcd
ETCD is the core state store for Kubernetes. While there was important in-memory caches throughout the system, ETCD is considered the system of record.
quick Summary of ETCD:ETCD was a clustered database that prizes Consisten CY above partition tolerance. Systems of this class (ZooKeeper, parts of Consul) is patterned after a system developed at Google Called chubby. These systems is often called "lock Servers" as they can used to coordinate locking in a distributed systems. Personally, I find that name a bit confusing. The data model for ETCD (and chubby) are a simple hierarchy of the keys, the store simple unstructured values. It actually looks a lot like a file system. Interestingly, at Google, chubby are most frequently accessed using an abstracted file
interface that works across local files, object stores, etc. The highly consistent nature, however, provides for strict ordering of writes and allows clients to do atomic updates of a Set of values.
Managing state reliably are one of the more difficult things to does in any system. In a distributed system it's even more difficult as it brings in many subtle algorithms like raft or Paxos. By using ETCD, Kubernetes itself can concentrate on other parts of the system.
The idea of watch in ETCD (and similar systems) are critical for how Kubernetes works. These systems allow clients to perform a lightweight subscription for changes to parts of the key namespace. Clients get notified immediately when something they is watching changes. This can is used as a coordination mechanism between components of the distributed system. One component can write to ETCD and the componenents can immediately react to the change.
One-of-the-think of the inversion of the common pubsub mechanisms. In many queue systems, the topics store no real user data and the messages that is published to those topics contain rich Data. For systems like ETCD the keys (analogous to topics) store the real data while the messages (notifications of changes) con Tain no unique rich information. In other words, for queues the topics is simple and the messages rich while systems like ETCD is the opposite.
The common pattern is for clients to mirror a subset of the database in memory and then react to changes of that database. Watches is used as an efficient mechanism to keep the cache up to date. If the watch fails for some reason, the client can fall back to polling at the cost of increased load, network traffic and Latency.
Policy Layer:api Server
The heart of Kubernetes is a component which is, creatively, called the API Server. This was the only component in the system, talks to ETCD. In fact, Etcd are really an implementation detail of the API Server and it's theoretically possible to back Kubernetes wit H some other storage system.
The API Server is a policy component this provides filtered access to ETCD. Its responsibilities was relatively generic in nature and it was currently being broken out so the it can be used as a con Trol plane nexus for other types of systems.
The main currency of the API Server is a resource. These is exposed via a simple REST API. There is a standard structure to the most of the these resources that enables some expanded features. The nature and reasoning for that API structure are left as a topic for a future post. Regardless, the API Server allows various to create, read, write, update and watch for changes of resources.
Let ' s detail the responsibilities of the API Server:
- Authentication And Authorization. kubernetes has a pluggable auth system. There is some built in mechanisms for both authentication users and authorizing those users to access resources. In addition there is methods to call out to external services (potentially self-hosted on Kubernetes) to provide these SE Rvices. This type of extensiblity are core to what Kubernetes is built.
- next, the API Server runs a set of admission Controllers that can reject or modify requests. These allow policy to is applied and default values to is set. This was a critical place for making sure, the data entering the system is valid while the API Server client is still W Aiting for request confirmation. While these admission controllers was currently compiled in to the API Server, there was ongoing work to make this be anoth ER extensibility mechanism.
- the API server helps With api versioning. A Critical problem when versioning APIs are to allow for the representation of the resources to evolve. Fields would be added, deprecated, re-organized and the other ways Transformed. the API Server stores a" true "representation of a resource in ETCD and Converts/re Nders that resource depending on the version of the API being satisfied. planning for versioning and the evolution of The APIs have been a key effort for Kubernetes since early in the project. This was part of the allows Kubernetes to offer a decent deprecation policy relatively early in its lifecycle.
A critical feature of the API Server is so it also supports the idea of watch. This means, clients of the API Server can employ the same coordination patterns as with ETCD. Most coordination in Kubernetes consists of a component writing to an API Server resource this another component is Watchi Ng. The second component would then react to changes almost immediately.
Business Logic:controller Manager and Scheduler
the last piece of the puzzle are the code that actually makes the thing wo Rk! These is the components of that coordinate through the API Server. These is bundled into separate servers called The controller Manager and the scheduler. The choice these out is so they couldn ' t "cheat". If the core parts of the system had to the API Server like every other component it would help ensure that we were Building an extensible system from the start. The fact that there was just the these is a accident of history. They could conceivably be combined to one big binary or broken out into a dozen+ separate servers.
The components of this sorts of things to make the system work. The scheduler, specifically, (a) looks for Pods this aren ' t assigned to a node (unbound Pods), (b) examines the state of T He cluster (cached in memory), (c) picks a node, have free space and meets other constraints, and (d) binds the Pod to A node.
Similarly, there is code ("Controller") in the controller Manager to implement the behavior of a replicaset. (as a reminder, the replicaset ensures that there is a set number of replicas of a Pod Template running at any one time) This controller would watch both the Replicaset resource and a set of Pods based on the selector in that resource. It then takes action to Create/destroy Pods on order to maintain a stable set of Pods as described in the Replicaset. Most controllers follow this type of pattern.
Node Agent:kubelet
Finally, there is the agent of that sits on the node. This is also authenticates to the API Server as any other component. It's responsible for watching, the set of Pods that's bound to its node, and making sure those Pods are running. It then reports the back status as things change with respect to those Pods.
A Typical Flow
To help understand how this works, let's work through a example of how things get do in Kubernetes.
This sequence diagram shows what a typical flow works for scheduling a Pod. This shows the (somewhat rare) case where a user is creating a Pod directly. More typically, the user would create something like a replicaset and it'll be the replicaset that creates the Pod.
The basic flow:
- The user creates a Pod via the API server and the API server writes it to ETCD.
- The Scheduler notices an "unbound" pod and decides which node to run, the pod on. It writes this binding back to the API Server.
- The Kubelet notices a change in the set of Pods is bound to its node. It, in turn, runs the container via the container runtime (i.e. Docker).
- The Kubelet monitors the status of the Pod via the container runtime. As things change, the Kubelet would reflect the current status of back to the API Server.
Summing up
By using the API Server as a central coordination point, Kubernetes are able to has a set of components interact with each Other in a loosely coupled manner. Hopefully this gives you a idea about how Kubernetes are more jazz improv than orchestration.
Give us feedback on this article and suggestions for future "under the covers" type pieces. Hit me up on Twitter at @jbeda or @heptio.
(Original address: HTTPS://BLOG.HEPTIO.COM/CORE-KUBERNETES-JAZZ-IMPROV-OVER-ORCHESTRATION-A7903EA92CA)
Go Core Kubernetes:jazz Improv over Orchestration