Systems programming · lesson 03

System calls

Your program runs in user space and cannot touch hardware directly. Every time it reads a file, writes to the terminal, or allocates memory, it crosses into kernel space through a controlled gate called a system call.

in progress
10 min

Two privilege levels

The x86-64 CPU has (among others) two privilege levels: ring 0 (kernel) and ring 3 (user). The kernel runs in ring 0 with full access to hardware — it can program the network card, read any physical memory, change page tables. Your program runs in ring 3 with restricted access — it can only touch its own virtual address space and cannot directly execute privileged instructions.

This separation is what makes the OS possible: the kernel enforces policy (which process can access which file, how much memory each gets), and user processes cannot bypass it. A bug in your program corrupts your memory, not the kernel's.

What a syscall is

A syscall is a deliberate, controlled transfer from ring 3 to ring 0. The CPU provides a dedicated instruction for this (syscall on x86-64). When your program executes it, the CPU saves your register state, switches privilege level, jumps to a kernel entry point, the kernel performs the operation, restores your registers, and returns to ring 3. The whole round trip costs roughly 100–300 nanoseconds — cheap, but measurable.

Every syscall has a number. On Linux x86-64: read is 0, write is 1, open is 2, exit is 60. Arguments go in registers. The return value (or error code) comes back in rax.

💡
printf is not a syscall — write is. printf is a C library function that formats its arguments into a buffer, then calls the write syscall to actually send bytes to the file descriptor. The C standard library wraps raw syscalls in friendly, portable functions. But under every I/O call is a syscall.

The core I/O syscalls

c
#include <unistd.h> #include <fcntl.h> // open: returns a file descriptor (int), or -1 on error int fd = open("file.txt", O_RDONLY); // read: fills buf with up to count bytes, returns bytes read char buf[256]; ssize_t n = read(fd, buf, sizeof(buf)); // write: sends count bytes from buf to fd, returns bytes written write(1, "hello\n", 6); // fd 1 = stdout // close: releases the file descriptor close(fd);
⚠️
Always check the return value. Every syscall can fail. open returns -1 if the file doesn't exist or permissions are wrong. read returns -1 on error and 0 at end of file. write can write fewer bytes than requested. Ignoring return values leads to silent data corruption and processes running with wrong state.

Errno: how syscall errors work

When a syscall fails, it returns -1 and sets the global variable errno to an error code. perror() prints a human-readable error message for the current errno value.

c
#include <fcntl.h> #include <unistd.h> #include <stdio.h> #include <errno.h> int main(void) { int fd = open("/etc/shadow", O_RDONLY); if (fd == -1) { perror("open"); // prints: "open: Permission denied" return 1; } close(fd); return 0; }

strace: seeing every syscall your program makes

strace traces all syscalls made by a process, printing each one with its arguments and return value. It's one of the most useful debugging tools for systems work — when a program silently fails, strace tells you exactly which syscall failed and why.

shell
$ strace ./hello execve("./hello", ["./hello"], 0x... /* env */) = 0 brk(NULL) = 0x... ... write(1, "Hello, world!\n", 14) = 14 exit_group(0) = ?

Even a trivial "Hello, world!" program makes dozens of syscalls before reaching main — the dynamic linker loading libc, the runtime setting up memory maps, reading configuration files. Only a handful are your code.

The cost of syscalls

Each syscall takes ~100–300 ns on modern hardware: saving registers, switching privilege level, invalidating TLB entries, executing kernel code, reversing all of that. For most programs this doesn't matter. For high-throughput servers handling millions of requests per second, it does.

This is why printf buffers output instead of calling write for each character. It's why databases use mmap or large read buffers instead of reading one record at a time. Minimizing syscalls is a real optimization technique.

one-line takeaway

A syscall is a controlled trap into the kernel — every file I/O, memory mapping, and process operation goes through one, and each one costs a privilege-level switch.