Bare Metal k8s Cluster in the American fast food chain chick-fil-a large-scale use

Source: Internet
Author: User
Tags k8s kubernetes deployment

The American fast food chain, CHICK-FIL-A, uses kubernetes in the margins of its more than 2000 restaurants, which means that there are about 6000 devices running kubernetes at the same time. One of the biggest challenges associated with this is how to deploy and manage so many kubernetes clusters in a restaurant's physical machine. This article was written by Chick-fil-a's technical team to share their experience in kubernetes Cluster Management technology selection, physical machine kubernetes cluster installation and management.


In most cases, kubernetes are deployed in the cloud, or by skilled kubernetes technicians deployed on physical machines (or at least remote access). But for chick-fil-a, our kubernetes deployment is done by installers who focus only on the initial hardware installation. Because of their self-starting features, they never need to be connected directly to a computing device-they connect Ethernet and power cords and check the status of the cluster by looking at the application app. The entire replacement process is done by the restaurant owners/operators or their teams who are not technically very professional.


The biggest challenge is that our edge computing deployments are not entirely in the data center environment.


Edge Computing hardware and classic installation methods

Cluster Management: The choices we've considered


In order to solve the challenges of cluster management, we have done comprehensive technical research, have considered the following several options:

    •  -We started by investigating Kubespray based on Ansible, but we found it quite fragile. When things went well, we could get a cluster, but when things didn't go well, we created a brick that was hard to change back to the computer. We also found that the process of starting a cluster using Kubespray was very slow, and typically took up to 30 minutes on our hardware stack. We believe that Kubespray can have a long-term development, but as far as our findings are concerned, we think we have to explore other solutions in other directions.

    • Openshift -Openshift can create kubernetes clusters, but we don't like to be tied too tightly to the vendor's solution in the critical infrastructure segment, and don't want to assume the risk of future technology locking.

    • Kops -We are a loyal fan of Kops and we use it to deploy our cloud "Control Panel" Kubernetes cluster. Unfortunately, when we use it in our edge calculations, Kops is not a viable bare-metal solution. We look forward to seeing it evolve in the future.

    • Kubeadm -Kubeadm is another nice kubernetes cluster utility. The KUBEADM project looks promising, but we think it is more complex than some alternatives, especially in its flexibility, including ...

RKE


As far as our present choice is concerned, Rke is the ultimate winner. Rke is an open-source kubernetes cluster Management engine provided by rancher labs. Although we are not using rancher 2.0来 to manage our clusters for the time being, we do like to use Rke to initialize and maintain the simplicity of the cluster.


To use Rke, you need to identify a leader node and provide it with a configuration Yaml file that contains data about the cluster, primarily the host name of the node participating in the cluster activity.

If a node in the cluster is added, deleted, or killed, the configuration file needs to have an accurate description of the current and future nodes. If the configuration is not kept up-to-date, the cluster will fail. Although we think the lack of nodes should not cause the cluster initialization/update to fail, it is the case now.


installation process


Our installation process in the restaurant is very simple-unpacking the device, plugging it into the power supply and the labeled Switch port. They automatically start the power supply and implement self-booting and cluster creation. Rke allows non-technical users to perform installation and replacement work in an incredibly simple process without knowing the kubernetes or even the overall architecture, but it does require some more complex boot processes.

The nodes that are not yet included in the cluster need to be reconciled to each other to determine who will be included in the cluster. They also need to select a master node to perform cluster creation through Rke.


Highlander


To solve this problem, we developed the Highlander. Because we can only have one cluster initiator.

Highlander is part of our base edge image. When each node starts, UDP broadcasts its presence and asks if there is an established leader. It will also begin to listen to itself. After a few seconds there is no reply, it will send another broadcast, declaring itself a leader. Do you have any objections? Without a message, the node will soon become the leader of the cluster and respond to all requests received in the future as a leader.

If another node has already declared its own leader's role, the new node confirms the claim. The existing leader executes "RKE up" to take the new node to the existing cluster.

Nodes communicate regularly to ensure that leaders are still there. If the old leader is dead, a new leader will be elected through a simple protocol that uses random sleep and leadership statements. While this is simple and uncomplicated, it is easy to reason and understand, but it can work effectively in scale.

Once the leader election is complete, Highlander also ensures that the cluster is properly configured. In our environment, this includes:

? Switch Kubedns to Coredns

? Create Istio or other core control Panel nodes

? OAuth identity Authentication


Note: Each of our nodes has its own identity and a short JWT to access the verified resources. Highlander provides this identity and provides token tokens in the form of a kubernetes key .


Integration process


While we focus primarily on cluster initialization in this article, we'll cover the entire process of initializing node initialization in the restaurant in real time.



(unavoidable) failure


> Finally, we want to share some of our failed experiences. If the infrastructure fails, we want to be able to respond flexibly. Node failure can be caused by a number of reasons: Device failure, network switch failure, power cord accidentally unplugged. In all these cases, we must quickly diagnose what is the real cause of the failure and what is unrelated to the exception. The process is complex and we will bring another article to share this topic in the future. That is, when we fail to diagnose, our process is to deliver a basic picture substitute (including visual installation instructions) to the restaurant and have the restaurant owner/operator or their team perform the replacement work.

 

At the same time, our kubernetes cluster will continue to run with fewer nodes and is ready to meet the replacement node.


Bare Metal k8s Cluster in the American fast food chain chick-fil-a large-scale use

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.