Kernel privilege escalation: how Kubernetes container isolation impacts privilege escalation attacks
Kamil Potrec
December 3, 2020
0 mins readDuring the day, I spend my time analyzing Terraform code, Kubernetes object configuration files, and identifying common security issues. When the sun sets, I put on my hoodie, fire up Linux VMs and debuggers to look under the hood of technologies that make up the cloud native ecosystem.
In this post, we will explore how Kubernetes container isolation impacts privilege escalation attacks. We will use common kernel exploitation techniques to figure out how container abstractions layers can hinder our path to that precious root shell.
What is privilege escalation?
Privilege escalation is a term used to describe the process of obtaining more permissions to a resource. Kernel privilege escalation is a process of obtaining these permissions by exploiting a weakness in one of many kernel entry points, also referred to as attack vectors. An attack vector is simply a path which provides access to the vulnerable code.
We interact with the kernel in many ways, by reading from the file system, opening a device file, issuing system calls, or sending a packet over the network interface. All of these actions require some sort of process to happen in the kernel space. When the kernel performs an action on behalf of the user process, we say that the kernel operates in a process context. Each process is represented in the kernel via a struct task_struct structure. These are stored in a circular doubly linked list, and accessed from PER_CPU variables on x86-64 architecture when a context switch happens from user space to kernel space.
A task_struct contains a struct creds member which holds the user identifier and capabilities associated with the process. This information is used by the kernel to determine if an action can be performed by the process, for example, if it is allowed to execute a specific syscall. The generic goal of the kernel privilege escalation process is to replace or update the credentials structure to gain more permissions.
How does privilege escalation work?
The most common technique to obtain elevated permissions in kernel space is to utilize the combination of kernel functions [commit_creds](https://github.com/torvalds/linux/blob/master/kernel/cred.c#L437)([prepare_kernel_cred(0)](https://github.com/torvalds/linux/blob/master/kernel/cred.c#L682)). This can only be achieved once an exploit obtained control over an instruction pointer (RIP), and successfully defeated memory access and randomization controls. [prepare_kernel_cred](https://github.com/torvalds/linux/blob/master/kernel/cred.c#L682)can generate the credentials object based on an existing one, or more generously generate a default one with full root permissions. [commit_creds](https://github.com/torvalds/linux/blob/master/kernel/cred.c#L437) simply updates the task_struct of the current process with the new credential object.
Kernel exploitation is a very large field, and so for this blog post, we will just explore an oversimplified version of kernel privilege escalation. There are numerous security controls in the kernel which are designed to make exploitation harder. SMEP, SMAP, KASLR, and KPTI are all mechanisms which are implemented in hardware or in the kernel and are turned on or off by the distribution you are using, or by a system administrator. There is no direct way to control these settings from Kubernetes and these are, therefore, out of scope for this post.
We will be using an old issue in the af_packet implementation that received CVE-2017-7308. The vulnerability is exploitable with the CAP_NET_RAW capability, as it requires access to raw sockets. Details of the vulnerability are exhaustively explained here, so we won’t go into that. We can obtain all the capabilities we need in an unprivileged user namespace. On the Ubuntu distribution, access to user namespaces is not restricted by default.
Let’s dig in
First off, let's look at the end-to-end process in a non-containerized environment first.
We need to connect GNU Debugger (gdb) to the Virtual Machine stub. Once the debugger is attached, we can set a breakpoint at a convenient location. In this case, we are using an mlock system call, which we can manually trigger from the exploit whenever we want to look at the internal state of the running process. Note that the gdb will only break if the executing process name is called “exploit”. This minimizes the risk of the breakpoint being triggered by some other process on the system. The setup tasks are conveniently scripted in a .gdb command file. We execute GDB with the -x flag to perform the setup in a consistent and repeatable way.
Breakpoints will be triggered before the unprivileged user namespace is created, just before we execute the vulnerability, and after we obtained root credentials. We implement this behavior by simply executing the correct syscall.
Now we can execute the exploit:
Our first breakpoint is triggered as expected. We can examine the cred structure by running gdb helper function $lx_current. The effective UID of the current process is 1000, and it has no effective capabilities in the current namespace as expected.
The second breakpoint is triggered after a call to unshare, and the new user namespace is created for the process. Observe how the UID remains unchanged, but cap_effective and user_ns attributes have changed. Capabilities are stored as a bitmask, which is more readable in hex format.
Our last breakpoint is triggered after the vulnerability is exploited. Observe that UID is now set to 0, and user namespace is reset to init_user_ns which represents the host’s init user namespace.
Our shell returns and we now have full root permissions on the host.
Kernel exploit in a container
Next, we will try to execute the very same exploit inside a pod. We have created a very simple pod object definition and deployed it into the cluster.
Lets see what happens in the default configuration.
Root by default
The image that we used in the demo does not specify an unprivileged user, and by default, Kubernetes will not enforce the UID. So, it appears that we had root access without needing to exploit the kernel. We are re-running the very same exploit, and break the kernel just before it executes the vulnerable path. If you look at the effective capabilities of the process, it’s clear that some are missing. The value is set to 2818844155, which represents the default capability set granted by Docker runtime.
After the exploit completes, the effective set once again includes all of the capabilities.
This time we will enforce non-root user id on the container, by setting runAs security context attributes.
This time we don’t have the root permissions out of the box. The exploit, however, performs identically with one major difference in the final result. It appears that we have all the permissions but we don’t see everything on the system.
Namespace cage
We managed to get all the capabilities and root UID, but only bypassed the capabilities barrier of the container—we still don’t have access to the host's filesystem, so we cannot see all the processes or even communicate over the host’s network interfaces.
At this point, we can load any kernel modules we like but that is noisy and will trigger most basic intrusion detection systems (one would hope). To test this we will remove an unused module. Note your docker image needs to have module packages installed. In the case of Debian images, you will need to install the kmod package.
Instead, we can extend our kernel exploit and set the [struct nsproxy](https://github.com/torvalds/linux/blob/master/include/linux/nsproxy.h#L31) object in the current context to point to namespaces we like. Namespaces are identified by inodes, but the kernel exports the address of [init_nsproxy](https://github.com/torvalds/linux/blob/master/kernel/nsproxy.c#L32) which we can use to copy host’s init namespaces to our container.
The sys_setns syscall can be used to update namespaces for the process context. There are three primary namespaces we want to PrivEsc into: PID, Network, and Mount. First of all, we need to obtain the reference to root namespaces, we can do that by moving container PID 1 into the host’s namespaces. Then we can get references to any namespaces from the /proc/ file system of PID 1. Finally, move the current process into the required namespaces.
After our exploit is executed, we can access all of the interesting system resources.
Usability of capabilities
Default capabilities assigned to Kubernetes containers (with the Docker runtime) grants CAP_NET_RAW to the container. Does this mean we would be able to exploit the vulnerability even if unprivileged user namespaces are disabled? We added code to set effective capabilities required to reach vulnerable code.
As you can see the exploit fails, but why?
This has to do with inheritable capabilities and how they are implemented. Even though the container runtime has granted these capabilities to the processes in the container, these have to be explicitly set on as effective via sys_capset. At the moment, only processes with UID 0 can set effective capabilities. So, if you want to run as a non-root user, but still have access to some of the capabilities you need to include a suid binary in your container to set the effective capabilities. Alternatively, you can simply set required capabilities on the executable and drop the container capabilities. File capabilities are limited to file systems with extended attributes.
Seccomp to the rescue
Let’s now talk about attack vector reachability. Our exploit works because unprivileged users can obtain CAP_NET_RAW capability in unprivileged user namespaces. We saw how this impacts our exploit in the above discussion about capabilities. There is one more countermeasure we can use to stop this attack—and yes you can enable it via Kubernetes.
Seccomp is a mechanism which can be utilized to reduce the kernel's attack surface by filtering system calls. Unfortunately, by default, Kubernetes will not apply a seccomp profile to your container. This means that all system calls are allowed, subject to the already discussed permissions checks. We can change that by adding annotation to the object declaration (pre-v1.19), or by adding the seccomp profile attribute to the pod security context.
Let's have a look at how the default profile provided by the container runtime (in this case Docker) affects our exploit. We are greeted with “Operation not permitted” error, because The default seccomp profile does not allow the unshare syscall.
Seccomp is great in limiting unnecessary kernel entry points. System calls, such as unshare, or userfaultfd, can be safely disabled for most use cases and are great at stopping some exploitation techniques. But there are some calls that would be tricky to block, such as waitid. You can find these and more techniques to exploit containers here.
Conclusions
We managed to prevent this exploit with a default seccomp profile. As you can see, even though our operating system is vulnerable, the exploit path is unreachable from our container (on this occasion). This technique could give you enough breathing room to plan the very much needed update to the operating system! You should consider these measures as defense-in-depth controls and mitigation strategies.
Always patch your systems! Snyk Infrastructure as Code can help you catch these mitigation options early on in your CI/CD pipeline, way before anything is deployed in production. We use adversarial techniques to identify high impact security options in Kubernetes, and cloud service providers. Use Snyk for free by registering for a free account.
Get started in capture the flag
Learn how to solve capture the flag challenges by watching our virtual 101 workshop on demand.