Zookeeper Vulnerability Analysis

Source: Internet
Author: User
Tags pagerduty

Zookeeper Vulnerability Analysis

For those who do not know ZooKeeper, it is a well-known open-source project that supports highly Reliable Distributed Coordination. It is trusted by many security companies around the world, including PagerDuty. It provides highly available and linear services based on the leader's philosophy, and these services can be dynamically reselected by most arbitration to ensure service consistency.

The leadership election and detection failure mechanisms have been quite mature and can be effectively operated. How is this done? Okay. After a long investigation, we decided to expose four different vulnerabilities. Two of the four vulnerabilities are from ZooKeeper and the other two are lurking in the Linux kernel. The following is the details.

Background: Use of Zookeeper in PagerDuty

In PagerDuty, we get different types of services that can drive warning pipelines. When events are received, they pass through these services in the form of a series of tasks. Each Service uses the monitoring mechanism of the Zookeeper cluster to coordinate the host to process the tasks. In this case, you can imagine that the impact of Zookeeper operations on PagerDuty reliability is very large.

Part 1: Zookeeper Vulnerability

Excessive client sessions

One day last year, an engineer noticed that a part of the Zookeeper cluster crashed in our test environment and locked the application timeout. After our confirmation, this cluster is reachable and can be monitored, but something is closed-each client has 10 seconds of active sessions with the relevant Zookeeper cluster members, but normally there will only be 2 seconds.

Why is the client so stupid? Maybe it is because we were exploiting a vulnerability in the ZooKeeper library. This problem is fixed after the entire ZooKeeper cluster is restarted, and we can no longer perform the same operation. After researching and viewing the software library code, we still cannot find a place that can cause session confusion. We entered a dead end. The worst thing is that we don't know if we can trigger that event in ZooKeeper.

Vulnerability #1

In less than a week, that happened again in our experimental environment. This time, it affects another ZooKeeper cluster, and this happens when we generate important loads. We realized that we could induce the synthesis load and wait an hour or two later, this problem will happen here.

We observe the ZooKeeper node and find that the session count is growing linearly. So we speculate that even if this is a client-side problem, however, it is likely that this event is triggered somewhere in ZooKeeper. So we started to study the log of ZooKeeper in depth. After some research, we found some valuable information in the log:

java.lang.OutOfMemoryError: Java heap space   at org.apache.jute.BinaryInputArchive.readString(BinaryInputArchive.java:81)   at org.apache.zookeeper.data.Id.deserialize(Id.java:54)   at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108)   at org.apache.zookeeper.data.ACL.deserialize(ACL.java:56)at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108)   at org.apache.zookeeper.proto.CreateRequest.deserialize(CreateRequest.java:91)   ...

Through stack tracking, we found: scheme = a _. readString ("scheme ");. Okay, the ZooKeeper Protocol has a four-byte scheme_len region... Maybe the calculation of the client's peer value produces an error.

Vulnerability #2

The results show that ZooKeeper does not obtain unprocessed expressions from its main thread, which means that ZooKeeper will continue to run if there is a data loss. Unfortunately, this means that the system's core mechanisms are still running.

Part 2: system kernel Vulnerabilities

TCP payload crash

Now we understand the failure of ZooKeeper, but there are still a lot of questions: how do we see the value of scheme_len? After decoding the data packets, we can see that not only scheme_len is affected, but the entire 16-byte data block seems to have crashed. Here is the first crash data packet we found:

When the data packet is output from the ZooKeeper node, we capture the damaged data packets. In other words, it crashes when it reaches ZooKeeper. This means that the data packet is damaged either when sent by the client or when transmitted between network devices. In general, if a packet is corrupted in a network intermediary device, it cannot pass the verification and the receiving system will discard the damaged packet. However, in this case, the TCP payload has obviously reached ZooKeeper, so the test is successful...

IPSec

One of the most important thing before we can further explain is that we have to understand the Transport mode.

PagerDuty uses IPSec to protect communication between hosts. The IP payload is encrypted, and the IP header is not encrypted. Therefore, data packets can be transmitted in the network as usual.

In the technology we used, all the information we captured in the experiment has been decrypted. The TCP Data header and payload are encrypted by IPSec, but the IP address header is not encrypted, which means we can verify it outside the IPsec protocol.

Our research on checksum shows that the IP checksum is valid, but the TCP checksum is invalid. There is only one possibility: After the TCP packet is formed and before the IP packet header is generated, the packet is damaged.

Vulnerability #3-fuzzy Behavior

If we want this to make sense, we have to get the answer from the most reliable and direct point. The source code of the Linux system contains the following information, which can be obtained from the Linux master branch:

/*      * 2) ignore UDP/TCP checksums in case      *    of NAT-T in Transport Mode, or      *    perform other post-processing fixes      *    as per draft-ietf-ipsec-udp-encaps-06,      *    section 3.1.2     */     if (x->props.mode == XFRM_MODE_TRANSPORT)       skb->ip_summed = CHECKSUM_UNNECESSARY;

Test

After research, we found that even if we can succeed occasionally, copying these problems is still very difficult. We need a simple method to detect and analyze it, instead of using some complicated things provided by ZooKeeper. The corrupted TCP payload seems to be using this method:

Impact

The first thing we notice is the system kernel version. No matter how hard we try, we can't copy the data in Linux 2.6, but we can only copy the data in the kernel of Linux 3.0 or later. Since we can trigger it in Linux 3.0 or later versions, it is not consistent with the problems that have occurred in previous studies.

Vulnerability #4-aesni-intel

The Intel x86 instruction set contains an AES command used to execute AES computing in the hardware. The aesni-intel kernel module uses the commands provided by this kernel to perform AES encryption. Since IPSec Encryption uses AES, this module can be used in Inter hardware transmission information encryption.

Part 3: Solution

What we do

The experiment has proved that uninstalling the module will block previous vulnerabilities. We evaluated the impact of the vulnerability, and the results showed that this was a high throughput problem, but we have a way to reduce its throughput. We also know that Xen HVM and Linux 2.6 are not affected. Based on understanding this, we can start planning an attack.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.