What are fibers?
The answer to that is simple: fibers are threads. Simple as that: sequential processes that we can spawn and synchronize with others. However, usually when we say “thread” we mean those threads implemented by the operating systems, while fibers (AKA lightweight threads or user-mode threads) are implemented in user mode. Actually, as we’ll see, both the abstraction and the implementation are the same; the only difference is the use case: OS threads can be used in any language but require a lot of RAM and are slow to synchronize and to spawn, while fibers are specific to a certain language or runtime, but are very lightweight, and are synchronized with virtually no overhead.
In this post I’ll only cover plain JVM threads, which have been mapped 1:1 to the operating system threads since basically forever, and Quasar fibers, which are the JVM-specific implementation of lightweight threads.
How is a fiber used?
That’s easy too: a Quasar fiber is used exactly like a Java thread, and by that I mean it implements the thread API. Quasar abstracts both thread implementations — Java’s Thread and Quasar’s Fiber — into Strand. This allows fibers and threads to interoperate seamlessly. For example your Java application will start in the main thread but if you want to use a fiber instead you can just spawn it from there and then join it; of course you can also do the opposite, that is spawning a thread from a fiber and then joining it. Or you could write methods that work generically with strands and then use them as either fibers or threads: in fact Quasar also includes a fiber-compatible, strand-based port of java.util.concurrent as well as Go-like channels, Erlang-like actors and Dataflow.
What is a thread?
A thread is a continuation scheduled to run on a CPU core at the appropriate time by a scheduler. A continuation is nothing more than a program counter, marking our point in the sequence of instructions, and a stack, storing the value of our variables. Quasar fibers are implemented just like OS threads, only in JVM bytecode rather than in the OS kernel.
The OS sees and uses the hardware runtime: CPUs, application data registers, code registers (i.e. the program counter), memory registers (e.g. the stack pointer), virtual memory and CPU modes. in particular CPUs can switch between a limited user mode and a more powerful kernel mode when trap events happen or syscalls are executed. OS threads share most kernel resources (e.g. I/O descriptors) and inhabit the same address space: this makes them lighter than several single-thread OS processes and at the same time allows them to share data. They don’t share CPU registers though and have separate stacks, which allows them to execute distinct control flows concurrently.
There can be more threads than there are available processors, so the OS needs to swap them in and out of CPUs by means of scheduling. A scheduling event can be triggered either by a preemption event trap (when the thread has consumed its CPU lease) or when the code explicitly invokes a syscall trap (to perform a kernel routine, like I/O). The CPU switches to kernel mode and transfers execution to a special memory area containing the trap handler: the kernel has now the power of snapshotting the CPU registers, including the program counter and the thread stack pointer, and package them into a Thread Control Block continuation.
After that the scheduler gets invoked to select the next thread to run, possibly in a different process; the kernel then restores all the relevant CPU registers and sets the processor up to switch back again into user mode and to resume execution at the restored program counter’s address.
OS threads give us everything we want, but for a heavy performance penalty: switching between threads involves jumping back and forth from user to kernel mode, possibly even across address space boundaries. These are expensive operations partly because they cause TLB flushes, cache misses and CPU pipelining havoc: that’s also why traps and syscalls can be orders of magnitude slower than regular procedure calls.
In addition, the kernel schedules threads (i.e. assigns their continuation to a CPU core) using a general-purpose scheduling algorithm, which might take into account all kinds of threads, from those serving a single transaction to those playing an entire video. Fibers, because they are scheduled at the application layer, can use a scheduler that is more appropriate for their use-case. As most fibers are used to serve transactions, they are usually active for very short periods of time and block very often. Their behavior is often to be awakened by IO or another fiber, run a short processing cycle, and then transfer control to another fiber (using a queue or another synchronization mechanism). Such behavior is best served by a scheduler employing an algorithm called “work-stealing”; this kind of scheduler is indeed employed by Erlang, Go and Quasar (by default). When fibers behave this way, work-stealing ensures minimal cache misses when switching between fibers.
The kernel can be our high-concurrency bottleneck. Suppose that thread switches were as fast as normal procedure calls, and we could avoid maintaining kernel data structures for threads: then we could gain a lot in terms of both memory footprint and switch efficiency. We want in-process, user-mode threads or… fibers.
Fibers for the JVM
Some language runtimes, such as Erlang’s, offer fibers natively but the JVM is not (yet) one of those. So how can we implement fibers on the JVM ourselves?
The Quasar approach is exactly the same used by the kernel to implement OS threads, only “reviewed” for the JVM runtime, which is stack-based rather than a typical register based machine: we “freeze” the fiber’s whole stack up to its root run method into a continuation task that we can re-schedule later on a regular Java executor. This process is called fiber suspension and I’ll now detail a bit more how it works.
The JVM has a standard, statically typed virtual instruction set (the bytecode format) that can be generated and edited with ASM at build-time, load-time or even run-time via agents. Each method in a call stack has an activation frame containing a current bytecode index (akin to the program counter), a local variables table and an operand stack. Thanks to the statically typed instructions and to the lexical scoping, the locals table and the operand stack can always be determined at compile-time for any method and any location in a method’s body.
Quasar’s Fiber.park suspends the current fiber and is used by all potentially fiber-suspending methods, which we call “suspendable”. These methods will throw a SuspendException to suspend the fiber. Exceptions undo the thread stack one frame at a time and they can be controlled via catch/finally blocks. Quasar contains several of them already, for example Fiber.sleep, channel and actor communication. Our code will include new suspendable methods built upon the existing ones and the Quasar instrumentor will:
- Fully analyze the bytecode to find all the calls into suspendable methods. A method that (potentially) calls into other suspendable methods is itself considered suspendable, transitively.
- Inject minimal bytecode in suspendable methods (and only them) that will manage an user-mode stack, in the following places:
- At the beginning we’ll check if we’re resuming the fiber and only in this case we’ll jump into the relevant bytecode index.
- Before a call into another suspendable method we’ll push a snapshot of the current activation frame, including the resume bytecode index; we can do it because we know the structure statically from the analysis phase.
- After a call into another suspendable method we’ll pop the top activation frame and, if resumed, we’ll restore it in the current fiber.
In general the suspendable methods can be found either through annotation of suspendable methods via @Suspendable or throws SuspendExecution, or through bytecode graph analysis (which may instrument too much and adversely affect performance).
The Quasar team is currently working with the OpenJDK developers to make the instrumentation process completely transparent and not require manual annotation or bytecode graph analysis by having the JVM expose more information about the stack at runtime. This would also make Quasar automatically support all JVM languages (at present Quasar includes support for Java 7, 8 and Kotlin while Pulsar extends it to Clojure and also offers an idiomatic API).
It turns out that a suspended fiber, just like an unscheduled OS thread, is simply a stack that can be resumed, or a continuation, “packaged” as a task that can be re-scheduled: the only difference is that the fiber stack as well as the scheduler are user-mode and run in the JVM rather than in the kernel.
Of course this is only the general idea and there are many more subtleties to take into account, for example:
- How do we implement the fiber stack, which is clearly a performance-critical piece of the puzzle?
- Which task executors are best for which scenarios, how do we choose them and how do we let the developer choosing them?
- How do we change a developer’s catch clauses that would trap our “special” suspension exception?
- Are there cases that can be optimized by avoiding (or reducing) an activation frame?
What about fiber performance?
Managing user-mode stacks for fibers brings some overhead; how much exactly depends on how often instrumented methods are called and how deep the fiber call stack is. For example Quasar integrations such as Comsat‘s are often based on the FiberAsync class which parks the fiber after an async API is invoked and unparks it when the completion callback is invoked. In this case the stack is very shallow and the call frequency is very low, because fibers are used only to perform I/O operations (which are orders of magnitude slower than a method call).
You can find an Akka-Quasar performance analysis in this blog post: even for very small workloads Quasar performs comparably to Akka, which is async and thus has no user-mode stack management overhead.
This benchmark analysis shows that serving HTTP requests with fibers rather than threads also has the benefit of significantly increasing a server’s capacity and makes it much more resilient. As for the client side, an upcoming post will discuss the benefits of fibers through a comparative load test.
Our experience is that Quasar fibers work very well in many concrete situations: they allow you to write straightforward fiber-blocking code, yet at the same time develop highly (and finely) concurrent systems that simply couldn’t run on bulky OS threads. The overhead of fibers is higher but still very low even when compared to async and monadic APIs, which have the disadvantage of introducing a cumbersome, infectious programming style and don’t interoperate with imperative control flow constructs built into a language.
So aren’t fibers generators or async/awaits?
No, as we have seen fibers are real threads: namely a continuation plus a scheduler. Generators and async/awaits are implemented with continuations (often a more limited form of continuation called stackless, which can only capture a single stack frame), but those continuations don’t have a scheduler, and are therefore not threads.
I hope this deep-dive into fibers and threads has dissipated some doubts. I’ll look forward to feedbacks and questions! Ping me on twitter: @ftudone or leave a comment below and I’ll try my best to respond swiftly.