CoreOS system upgrade details
Some time ago, DockerOne replied to a question about CoreOS upgrade. After careful consideration, there are still many details about this issue. Therefore, this article is for reference by gamers who are already using CoreOS in China.
System upgrades with CoreOS features
One of the original designs of CoreOS is "solving security problems caused by the malicious exploitation of known vulnerabilities due to the absence of timely upgrades and application patches of common server systems and software on the Internet ". Therefore, its upgrade method is unique among various Linux hair styles, especially compared with mainstream server-side systems.
Smooth upgrade
On the one hand, common Server systems such as RedHat, CentOS, Debian, Ubuntu, and even FreeBSD and Windows Server all have clear version limits, or they cannot be directly upgraded online to the new release version, either (such as Debian/Ubuntu and Versions later than Windows 7), though cross-version upgrades are prone to compatibility risks, once the upgrade fails, it is often faced with a dilemma.
This problem has been solved in some emerging Linux releases, such as Arch Linux: change the practice of releasing a major version that has accumulated many patches in the past to one month or shorter.Fast iterative update
And provided by the system itselfSmooth upgrade and rollback
. In this way, you can update the latest security patch system at any time and from any version. However, the smooth upgrade systems represented by Arch still cause some accidents that cannot be used after the system is updated, you may find many similar complaints when you search Baidu for the keyword "Arch upgrade problem. As a matter of fact, Arch's target users are Linux fans who are fond of early adopters, rather than server administrators or server application architects.
So is the idea of smooth upgrade unable to work in the server system. In fact, a careful analysis of the causes of smooth upgrade problems, the most critical factor is that the system design can only ensure a smooth upgrade from a clean system, if you modify some core components in the system (for example, upgrade Python2 to Python3), it is not within the control of the operating system designer. This is equivalent to setting up an after-sales overlord clause (just a metaphor, these Linux systems are actually free of charge): Self-modified, no warranty.
In the past, if a user wants to use a server system, he must install other external service software and programs on it. Therefore, it is almost unavoidable to intentionally or unintentionally modify the system. This problem has not been solved until the concept of application containers (especially Docker) has been widely promoted in recent years. CoreOS cleverly avoids user tampering with the system through containers, and proposes another solution:Read-only system partitions. You can run services through containers.
. I have to say that this is simply replacing another overlord clause with one overlord clause. However, this new "clause" brings additional benefits, it is much more secure to accept servers that require high stability and security.
"A lot of things need to be automated, so how can we build a container ." Well, that's a pleasant decision.
How to deploy a WordPress instance in CoreOS
Initial server operating system CoreOS experience
CoreOS practice: Analyzing etcd
CoreOS practice: Introduction to CoreOS and management tools
[Tutorial] Build your first application on CoreOS
Automatic update
On the other hand, in addition to major system version upgrades, minor patch updates for systems and key software often fail to be used in a timely manner due to system administrator negligence, this is also an important cause of system security issues (for example, the example recruited in BrowserStack in 2014 ).
This solution is simple:Automatic update
. Of course, this simple method has long been used. Even though we don't know much about the operating system, it's always a good practice for application software. So, in order to be unconventional, how can we make automatic updates more creative. First, let's take a look at the pitfalls of system upgrades.
At first glance, the operating system is similar to the problems that common applications encounter during upgrade. For example, when the software is in use, it is generally not possible to directly hotfix it, and the system is also the same (the Linux 4.0 kernel is already working to solve this pain point, therefore, it may become a false proposition in the future ). For example, software running may depend on each other, and the core components of the system may depend on each other. Therefore, once an upgrade is involved, version matching is involved. In addition, there are some differences between them. For example, many applications can be directly installed without installation. During the upgrade, you can simply replace the new file with the old one. To avoid installation of the operating system, some special skills are required.
The following describes in sequence how CoreOS handles these pitfalls.
Since the system cannot be hotfixed, it will be involved in the restart, which is quite taboo in the server system. To avoid service interruption during system restart, CoreOS is designedAutomatic service migration
Is provided by Fleet, its core component. Of course, this is not a perfect solution. I believe there will be more creative ways to replace it in the future.
A better solution for version matching problems at the application software level is the container, that is, to package all dependencies and deploy them together, and update the image of the entire container every time the update is performed. The same idea applies to the operating system. CoreOS is updated once every time.Overall upgrade
Download the complete system image, perform MD5 verification, restart the system, and replace the kernel with the peripheral dependency. The additional benefit is that each update must be successful or fail, and there will be no embarrassing situation where the upgrade is partially successful.
What problems do I have if I want to restart the system without installing software? One of the two aspects is that the application software is hosted and started by the operating system, and its files can be replaced by the system. The operating system itself is driven by several lines of startup code in the boot zone. It takes a little time to extract images, replace the system, and hurry up, well, it's really a dojo in the shell-you can't sort out the scenes (remember the waiting interface for Windows or Mac every time you upgrade the system ). Second, the system upgrade should be able to roll back. Otherwise, how can it be used in the production environment? Even if you do not consider the time required to replace the file during startup, if the file cannot be started after the update, the original system will be overwritten, this is a mine for myself. It can be seen that, for fast and secure upgrade, the update installation after restart is not the best solution in terms of Start Time and rollback difficulty.
To this end, CoreOS has a rare trick. Readers who have entered the CoreOS homepage should have seen thisA/B dual system partition
. As shown in the middle, when CoreOS is installed, two independent system partitions will be drawn on the hard disk (the space is roughly 1 GB), and only one of them will be used as the system kernel at a time, the new system image downloaded in the background will be deployed to the backup partition during system operation. During the restart, you only need to design a logic to switch the Primary and Secondary Division of the two partitions. the upgrade process is completed in less than minutes. If the startup fails, coreOS automatically detects and switches back to the original working partition. Use a pre-deployed partition to directly replace the startup method to avoid temporary installation of updates after the restart. The conversion of this idea is indeed a bit tricky.
An external question. Previously, I talked to other CoreOS fans about dual-system partitioning during the Meetup activity. At that time, we came to the same conclusion: since it is still necessary to restart, users who do not need two partitions have no practical benefits. In contrast, "smooth upgrade is the selling point, and dual partitions are just a gimmick ". I have also expressed similar views in the CoreOS Practice Guide series. It was not until later that I had carefully reflected on the clever design that I realized the one-sidedness of the original idea.
These methods are easy to say. If you want to implement them, it is not as easy as just pat your head. Throughout the open-source Linux system, the system that truly implements such background update design is also unique in CoreOS.
Upgrade parameter configuration
After understanding the self-Upgrade Method of CoreOS, let's continue with the configuration related to the upgrade. The options related to CoreOS system upgrade are usually passed when the server is started for the first timecloud-init
Ofcoreos.update
Item. After the system is started, you can/etc/coreos/update.conf
File. Configurable attributes include the upgrade channel, upgrade policy, and upgrade server. These three attributes have already been mentioned in the answer to DockerOne, and we will make a little deeper on this basis.
Initialize upgrade Configuration
This is the most common way to configure upgrade parameters. When the system is started for the first time, cloud-init completes initialization tasks related to most nodes and clusters. The three keys under CoreOS. update are related to coreos upgrade. The following is an example:
coreos: update: reboot-strategy: best-effort group: alpha server: https://example.update.core-os.net
Only group is required. It specifies the system upgrade channel. The default value of the upgrade policy reboot-strategy is best-effort, and the default value of the upgrade server is the official upgrade server of CoreOS.
Modify upgrade configurations
For started clusters, you can modify the upgrade parameters in the/etc/coreos/update. conf configuration file. The content format is simple and clear. Example:
GROUP=alphaREBOOT_STRATEGY=best-effortSERVER=https://example.update.core-os.net
Similarly, in most cases, the user will only see the value of GROUP, because only it is required. The remaining two rows may not exist. In this case, the default value is used instead.
Note that:
- After each modification, You need to execute
sudo systemctl restart update-engine
Command to make the configuration take effect
- Modifying the configuration of one node does not affect the upgrade configurations of other nodes in the cluster.
- It is best to allow nodes in the cluster to use the same upgrade channel for convenient management. Although mixed channels do not directly cause problems
- Cloud-init is preferred. Design the system parameters during initialization to reduce the additional modification workload.
Upgrade Channel
The upgrade channel indirectly defines the target version number for each CoreOS upgrade. This idea is probably based on the Chrome browser, which provides three official upgrade channels: Alpha (Beta version), Beta (Beta version), and Stable (official release version ). For example, if the user configures an Alpha channel, each update of the channel will be upgraded to the latest Beta system version. The memory version is similar to the so-called "development version" of Chrome browser. It will obtain new feature updates immediately. The stability is generally quite good, but it is not suitable for use as a product server, the main target audience is fresh developers and players. The stability of the Public Beta version is slightly higher, and new features will be pushed quickly. It is suitable for project development and testing environments. Components in the official release version are often not the latest version, but they have the highest stability and are suitable for use as product servers. CoreOS currently uses an integer number to represent the version number. The larger the number, the newer the release time.
The update frequency of each channel is as follows (see the official blog statement ):
- Alpha: released on Thursday every week
- Beta: released every two weeks
- Stable: released once a month
The current system version number and built-in component version number of each channel can be viewed on this webpage.
In addition to three public channels, users who have subscribed to the CoreUpdate service can also customize their own channels for upgrade, but this service is paid. In addition, users who use the Enterprise Edition to host the CoreOS system can also use this function for free. The start cost of the Enterprise Edition is less than $100/month for 10 nodes. See this link. Another local Enterprise Edition service starts at $2100/month for less than 25 nodes. The difference is that it provides additional manual technical support services, which is indeed the most expensive.
Upgrade Policy
The upgrade policy is mainly related to the restart update method after automatic upgrade. Its values can be best-effort (default), etcd-lock, reboot, and off. The functions are described as follows:
- Best-effort: If Etcd runs normally, it is equivalent to etcd-lock; otherwise, it is equivalent to reboot.
- Etcd-lock: automatically restarts after automatic upgrade. The LockSmith service is used to schedule the restart process.
- Reboot: automatically restarts the system immediately after automatic upgrade.
- Off: Wait for manual restart after automatic upgrade
The default method is best-effort, which is usually equivalent to the etcd-lock policy. The LockSmith service of CoreOS is used to schedule the upgrade process during the restart process. It mainly prevents external service interruption caused by too many nodes restarting at the same time, and the Leader node election of Etcd fails. Its working principle is very simple.coreos.com/updateengine/rebootlock/semaphore
The path shows all its configurations:
$ etcdctl get coreos.com/updateengine/rebootlock/semaphore{ "semaphore": 0, "max": 1, "holders": [ "010a2e41e747415ba51212fa995801dd" ]}
By setting a fixed number of locks, only the locked host can be restarted and upgraded. Otherwise, the lock changes. After the upgrade, the node will release the lock occupied by it, notifying other nodes to start the next round of competition for obtaining the upgrade lock.
In addition to directly modifying the Etcd content, CoreOS also provideslocksmithctl
Command to more intuitively view the status of the LockSmith service or set the number of locks to be upgraded.
View the status of the update lock:
$ Locksmithctl statusAvailable: 0 <-- number of remaining locks Max: 1 <-- total number of locks MACHINE id010a2e41e747108ba51212fa995801dd <-- get the lock Node
The node that gets the lock is the Machine ID of the node that has downloaded and deployed the new version of the system and is waiting for or about to restart (related to the upgrade policy. Uselocksmithctl set-max
The command can be used to modify the number of update locks (that is, the number of nodes that can be restarted at the same time ):
$ locksmithctl set-max 3Old: 1New: 3
If you uselocksmithctl status
Check the status and you will seeMax
The number is changed to 3.
In addition,locksmithctl unlock
The command can release the update lock from the node to which the lock is obtained. This command is rarely used unless a node cannot be restarted due to special reasons (such as hardware faults such as disk errors) after the lock is obtained ), therefore, the lock is always occupied. In this case, manual release is required.
Upgrade servers
Many users who want to use CoreOS in the Intranet are concerned about whether they can build their own upgrading servers on the Intranet? The answer is yes.
It is a pity that the CoreOS upgrade server is part of the CoreUpdate service, that is, it requires payment. However, considering that most of the users who build Server clusters on their own intranet are enterprise-level users, the charges are fair.
From the document, the server protocol used by CoreOS is completely compatible with Google's ChromeOS server, and can even be replaced by each other. Interestingly, both of them have their own open-source operating systems, but none of them have their own server upgrade implementation. This is like if you want users to build and upgrade their own servers, who will ensure that the images of these upgraded servers are up-to-date? What is the significance of the system security provided by the automatic upgrade.
By the way, there is a start_devserver tool in the CoreOS SDK to test and deploy your own CoreOS image (the system is open-source ), therefore, if you directly download the official image and provide it to this tool, you should be able to build your own intranet to upgrade the server. However, the official documents are vague and I will give you some suggestions.
Manual system upgrade
CoreOS will always automatically download and deploy the new version of the system in the background, even if the upgrade policy is set to off (this only disables automatic restart ). Therefore, in most cases, you do not need to manually trigger the system upgrade unless the version is tested and urgent. However, considering that there is always a need for new versions of obsessive-compulsive disorder users (in fact, mainly for system testing), CoreOS still provides a manual update approach.
View the current system version
Compared with Manual updates, users may want to see only which version of the current system is deployed. The method is simple. Checkos-release
File.
$ cat /etc/os-releaseNAME=CoreOSID=coreosVERSION=607.0.0VERSION_ID=607.0.0BUILD_ID=PRETTY_NAME="CoreOS 607.0.0"ANSI_COLOR="1;32"HOME_URL="https://coreos.com/"BUG_REPORT_URL="https://github.com/coreos/bugs/issues"
This file is actually a soft link pointing to the/usr/lib/OS-release file of the system partition, while the latter is part of the read-only partition, therefore, you do not have to worry that the content in this file will be tampered with externally.
Automatic Upgrade frequency
CoreOS will automatically detect the system version 10 minutes after startup and every one hour after that. If the new version is detected, it will be automatically downloaded and put on the backup partition, then, determine whether to automatically restart the node based on the previous upgrade policy. OK, that's simple.
The specific update detection record can be passedjournalctl -f -u update-engine
Command.
Manually trigger upgrade
Well, the following command is intended for users with upgraded obsessive-compulsive disorder.
The command is very simple:update_engine_client -update
If the prompt "Update failed" is displayed, it indicates that the current version is the latest. (If you do not understand CoreOS, do not give me a friendly prompt ). If a new version of the system is detected, it is immediately downloaded and deployed to the backup system partition.
$ update_engine_client -update[0404/032058:INFO:update_engine_client.cc(245)] Initiating update check and install.[0404/032058:INFO:update_engine_client.cc(250)] Waiting for update to complete.LAST_CHECKED_TIME=1428117554PROGRESS=0.000000CURRENT_OP=UPDATE_STATUS_UPDATE_AVAILABLENEW_VERSION=0.0.0.0... ...CURRENT_OP=UPDATE_STATUS_FINALIZINGNEW_VERSION=0.0.0.0NEW_SIZE=129636481Broadcast message from locksmithd at 2015-04-04 03:22:56.556697323 +0000 UTC:System reboot in 5 minutes!LAST_CHECKED_TIME=1428117554PROGRESS=0.000000CURRENT_OP=UPDATE_STATUS_UPDATED_NEED_REBOOTNEW_VERSION=0.0.0.0NEW_SIZE=129636481[0404/032258:INFO:update_engine_client.cc(193)] Update succeeded -- reboot needed.
After the deployment is complete, if your upgrade policy is not off, the system will send a message to all logged-on users: "The system will restart after five minutes ". Of course, you will also be kicked out of SSH Login five minutes later. When you log back again, you will find that the system has changed to a new version.
Better Upgrade Strategy
When we see the four upgrade strategies of CoreOS, we do not know whether readers have found a problem. The first three policies will restart the server immediately after the new system version is downloaded and deployed. If this happens to be the peak period of system access, the service will be automatically migrated to other nodes to continue running, and may still cause transient service interruption. In addition, the 4th policies simply wait for the Administrator to restart the system to complete the upgrade, and introduce additional manual intervention. If the restart is not timely, the cluster will not receive the necessary security updates.
Is there a way to prevent the server from restarting during the service peak period and not failing to update it for a long time? CoreOS provides a recommended solution. I call it a 5th update policy: Automatic Restart Based on timed detection.
This upgrade policy is not included in the built-in options. We need to do some additional work:
- Set the upgrade policy to off.
- Add a service to check whether a new system version has been deployed in the backup partition. If yes, restart it.
- Add a timer to trigger the above service during off-peak hours of the Cluster
For more details, please continue to read the highlights on the next page: