Imagine a bacon-wrapped Ferrari. Still not better than our free technical reports.
See all our reports

Is Java 8 the fastest JVM ever? Performance benchmarking of Fork-Join

forks

Today, I want to continue talking about Java 8 and all kinds of tools that can help you on your quest to a better software. Java 8 was released just a couple of weeks ago, but, according to some early results of RebelLabs’ Java Tools & Technologies survey, which you can complete in just 3 minutes and helps a children’s charity (guilt trip!), 5% of respondents have started to check out Java 8! For the tools part, I’m going to look at the infamous Java Microbenchmark Harness project, which allows you to generate better benchmarks and measure performance of your code in a sane way.

This post is also about concurrency updates that landed into the JDK in the latest release. Actually, there were multiple changes to java.util.concurrent, but here the focus will be on the improvements introduced to the Fork-Join framework. We’re doing to discuss Fork-Join a bit and then try to implement a sample benchmark to compare the performance of the FJ in Java 7 vs. 8.

What you might care about Fork/Join

ForkJoin is a framework for parallel computation of, usually, recursive tasks. It was first introduced in the Java 7 and since then served its purpose well. The reason for that is that many large tasks in their nature can be represented as recursive.

For example, consider the most known MapReduce programming example: counting how many times different words occur in a document. Obviously, one can divide the document into parts, compute word counts part by part and combine the results in the end. Indeed, ForkJoin is an implementation of general MapReduce principles, only all the workers are threads in the same JVM instead of being a cluster of machines.

The central piece in the ForkJoin framework is a ForkJoinPool. It is an ExecutorService, so it is capable of accepting asynchronous tasks returning you a Future object, so you can keep track of what’s the status of the computation in question.

What makes it different from the other ExecutorServices is that its thread workers that at the moment don’t execute anything useful check out their buddies and borrow work from them. This technique is called work-stealing. So what’s the deal with work stealing?

queue

Work stealing is a way to manage workload in a decentralized fashion. There’s no-one to assign units of work to all available thread workers. Instead every one of them manages its personal queue of tasks. The trick is to manage those queues effectively.

There are two main questions about having every worker processing its own queue.

  • Where do externally submitted task go?
  • How can we organise work stealing so access to queues is efficient?

Essentially, there should be no difference between an externally submitted tasks and tasks created by workers themselves while executing some larger tasks. Both have similar requirements to be executed and provide a result. However the mechanics is different. The main difference is that tasks created by workers can be stolen. It means that despite being put to a certain worker’s queue they will be executed by someone else.

The way how ForkJoin framework handles it is pretty straightforward, every worker has two queues, one for external tasks and another to satisfy the stealing mechanics. When an external submit happens, random workers queue is picked to add the task. When a task is divided into the smaller counterparts, worker adds them to its own queue and hopes that its peers come to help.

The whole stealing idea is based of the fact that a worker adds tasks to the tail of its queue. Then during normal execution every worker tries to take a task from the head of the queue, and when failing to do so because there’s no work in its personal queue goes and steals a task from the tail of a someone else queue. This allows to omit locking on most queue operations and enhances performance.

Another trick that is employed to make ForkJoin pools work faster is that when a worker steals tasks, it leaves a hint which queue it took work from and the original owner of the work can determine it and help this particular worker, so the progress of the parent tasks moves faster.

All in all it is an extremely complicated machine which requires a lot of scientific background to make it work. Also properties and performance of this system are very implementation specific, so I doubt that it will change dramatically without major refactoring.

What was wrong with ForkJoin in Java 7?

Since ForkJoin framework was introduced in Java 7, it actually worked well there. However it didn’t stop the progress and within concurrency updates in Java 8 the ForkJoin was improved. From this Java Enhancement Proposal we find what exactly was improved.

Added functionality and improved performance for ForkJoinPools that allow them to be used more effectively in the increasingly wider range of applications that users desire. New features include support for completion-based designs that are often most appropriate for IO-bound usages, among others.

Another source of knowledge is of course talking to the authors of the improvements, for example an somewhat older update from Doug Lea:

Substantially better throughput when lots of clients submit lots of tasks. The idea is to treat external submitters in a similar way as workers — using randomized queuing and stealing. This also greatly improves throughput when all tasks are async and submitted to the pool rather than forked, which becomes a reasonable way to structure actor frameworks, as well as many plain services that you might otherwise use ThreadPoolExecutor for.

However it is not that easy to find out what exactly has been changed and which scenarios are affected. Instead, let’s take another approach to the problem. I’ll try to create a benchmark that simulates a very simple ForkJoin computation and measures the overhead of handling ForkJoin tasks vs. computing the values sequentially in a single thread. Hopefully it will give us a starting point to speculate about improvements.

Performance comparison

So I’ve set a small benchmark to figure out if the difference between Java 7 and Java 8 is actually noticeable. Here is a Github repo if you want to gaze at it or want to try it yourself.

Thanks to the recent effort by Oracle engineers OpenJDK now contains Java Microbenchmark Harness project, that is specifically aimed at creating benchmarks that are hopefully less affected by usual microbenchmarking problems and mistakes.

So let’s put it to use. JMH comes with a Maven archetype for a project, so setting everything up is even easier than one can imagine.

    <dependencies>
        <dependency>
            <groupId>org.openjdk.jmh</groupId>
            <artifactId>jmh-core</artifactId>
            <version>0.4.1</version>
        </dependency>
    </dependencies>

At the moment of writing this, the latest released version of JMH core is 0.4.1 which includes the @Param annotation to run a benchmark on a series of parameterized inputs. This leaves the pain of executing the pain of running the same benchmark multiple times manually and really simplifies obtaining the results.

Now every benchmark iteration gets its personal ForkJoinPool instance, this also helps us to lessen differences in instantiation of common ForkJoinPool in Java 8 and prior releases.

@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@Warmup(iterations = 5, time = 3, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 20, time = 3, timeUnit = TimeUnit.SECONDS)
@Fork(1)
@State(Scope.Benchmark)
public class FJPBenchmark {
 
  @Param({ "200", "400", "800", "1600"})
  public int N;
 
  public List<RecursiveTask<Double>> tasks;
  public ForkJoinPool pool = new ForkJoinPool();
 
  @Setup
  public void init() {
    Random r = new Random();
    r.setSeed(0x32106234567L);
    tasks = new ArrayList<RecursiveTask<Double>>(N * 3);
 
    for (int i = 0; i < N; i++) {
      tasks.add(new Sin(r.nextDouble()));
      tasks.add(new Cos(r.nextDouble()));
      tasks.add(new Tan(r.nextDouble()));
    }
  }
 
  @GenerateMicroBenchmark
  public double forkJoinTasks() {
    for (RecursiveTask<Double> task : tasks) {
      pool.submit(task);
    }
    double sum = 0;
    Collections.reverse(tasks);
    for (RecursiveTask<Double> task : tasks) {
      sum += task.join();
    }
    return sum;
  }
 
  @GenerateMicroBenchmark
  public double computeDirectly() {
    double sum = 0;
    for (RecursiveTask<Double> task : tasks) {
      sum += ((DummyComputableThing) task).dummyCompute();
    }
    return sum;
  }
}

Sin, Cos and Tan are instances of the RecursiveTask, where Sin and Cos don’t actually recurse into anything, but compute Math.sin(input) and Math.cos(input) respectively. Tan tasks actually recurse into a pair of Sin and Cos and return the result of a division.

JMH process the code of the project and generates benchmarks from the methods marked with a @GenerateMicroBenchmark annotations. Other annotations that you can see on top of the class specify benchmarking options: number of warmup iterations, number of iterations that contribute then to the final result, if we want to fork another JVM process for benchmark, and the thing we’re measuring. That would be a throughput of the code; how many times these methods can be executed in a period of time.

@Param says that we want to run the benchmark with several input sizes. All in all, the JMH is very straightforward and creating benchmarks doesn’t require fiddling with manually handling iterations, timing and crunching the results.

Running this benchmark with Java 7 and 8 produces following results. I’ve used respectively versions 1.7.0_40 and 1.8.0.

shelajev@shrimp ~/repo/blogposts/fork-join-blocking-perf » java -version
java version "1.7.0_40"
Java(TM) SE Runtime Environment (build 1.7.0_40-b43)
Java HotSpot(TM) 64-Bit Server VM (build 24.0-b56, mixed mode)

Benchmark                   (N)   Mode   Samples         Mean   Mean error    Units
o.s.FJPB.computeDirectly    200  thrpt        20       27.890        0.306   ops/ms
o.s.FJPB.computeDirectly    400  thrpt        20       14.046        0.072   ops/ms
o.s.FJPB.computeDirectly    800  thrpt        20        6.982        0.043   ops/ms
o.s.FJPB.computeDirectly   1600  thrpt        20        3.481        0.122   ops/ms
o.s.FJPB.forkJoinTasks      200  thrpt        20       11.530        0.121   ops/ms
o.s.FJPB.forkJoinTasks      400  thrpt        20        5.936        0.126   ops/ms
o.s.FJPB.forkJoinTasks      800  thrpt        20        2.931        0.027   ops/ms
o.s.FJPB.forkJoinTasks     1600  thrpt        20        1.466        0.012   ops/ms


shelajev@shrimp ~/repo/blogposts/fork-join-blocking-perf » java -version
java version "1.8.0"
Java(TM) SE Runtime Environment (build 1.8.0-b132)
Java HotSpot(TM) 64-Bit Server VM (build 25.0-b70, mixed mode)

Benchmark                   (N)   Mode   Samples         Mean   Mean error    Units
o.s.FJPB.computeDirectly    200  thrpt        20       27.680        2.050   ops/ms
o.s.FJPB.computeDirectly    400  thrpt        20       13.690        0.994   ops/ms
o.s.FJPB.computeDirectly    800  thrpt        20        6.783        0.548   ops/ms
o.s.FJPB.computeDirectly   1600  thrpt        20        3.364        0.304   ops/ms
o.s.FJPB.forkJoinTasks      200  thrpt        20       15.868        0.291   ops/ms
o.s.FJPB.forkJoinTasks      400  thrpt        20        8.060        0.222   ops/ms
o.s.FJPB.forkJoinTasks      800  thrpt        20        4.006        0.024   ops/ms
o.s.FJPB.forkJoinTasks     1600  thrpt        20        1.968        0.043   ops/ms

In a chart below the results are put, for your convenience, into a graphical form.

fj-results-graph

One can see that the baseline results, which show the throughput of running the math directly in a single thread do not differ between the JDK 7 and 8. However, when we include the overhead of managing recursive tasks and going through a ForkJoin execution then Java 8 is much faster. The numbers for this simple benchmark suggest that the overhead of managing ForkJoin tasks is around 35% more performant in the latest release.

The difference between the baseline and FJ computation comes because we elaborately created our recursive tasks extremely thin. They don’t do anything else than a single method call to a very optimised Math class. Because of that, the direct application of Math is much faster. A beefier tasks would change a picture, but they would lessen the overhead of the ForkJoin management, which we wanted to measure in the first place. In general though, you do want to have recursive tasks that do more than a method call.

Also, there’s a slight difference between the Java 7 and 8 results on the baseline tests. The difference is negligible and most probably is not caused by any changes there could be between the Math implementations in Java 7 vs. 8. It probably is a measurement artifact however well JMH tries to balance these out or avoid it.

disclaimer

Disclaimer: of course these results are synthetic, and you should take them with a grain of salt. However, in addition to discussing Java performance, I wanted to show how easily JMH allows you to create a benchmark that avoids some of the common benchmarking issues like not warming up the JVM. That doesn’t save you if the semantics of the benchmark is flawed, but definitely helps a lot. So if you see a flaw in the logic of the code, please mention it to me. I like getting smarter.

Conclusion

First of all, I want to say that source code for ForkJoinPool, ForkJoinPool.WorkQueue and ForkJoinTask classes is not exactly the easiest read, and it features quite a lot of the Unsafe mechanics and probably won’t answer your questions about the ForkJoin framework in 15 minutes.

fjp-code

However, these classes are well documented and provide a lot of information via internal comments. It is also probably one of the most interesting places in the JDK to dig into, when you have free time.

Another relevant finding is that performance of the ForkJoin pools is better in Java 8, at least on some use-cases. And while I cannot precisely describe what is the reason behind that, if I was map/reducing something with the ForkJoin, I’d take my chances and upgrade.

If you see a flaw in logic or can offer another explanation of the results, or test to run, don’t hesitate to leave a comment below or message me @shelajev.


P.S. Please take the survey yourself and share with friends–remember, each completed survey gives cash to Child’s Play, a charity that helps get children in hospitals games and entertainment to cheer them up. Great cause for them, and for us as developers–MOAR DATA!

SPEND 3 MINUTES HELPING KIDS :-)

 

  • edharned

    Saying that the F/J Classes are not the easiest is an understatement of great magnitude. The code is horrendous; it violates just about every principle of good programming including, but not limited to “spaghetti code”, “extreme complexity”, “instance variables used by other classes” and much more. It is an attempt to port Cilk to Java and it is a failure, more so with Java8. http://coopsoft.com/ar/Calamity2Article.html

    Now to your benchmark. The Java7 version is faster than the Java8 because of the way join() works. In Java7, the framework creates “continuation threads” to compensate for the wait() issued by the join(). In a production environment this is a disaster so in Java8 the threads continue fetching work from the deque until the deque is empty. Then they stall with a wait(). What you end up with is half the threads in a wait state. You can run the example code from my article, above, in a profiler and see for yourself. As above, reading the code is most difficult.

  • thmuders

    @edharned:disqus I read your very informative article. However you write “Now to your benchmark. The Java7 version is faster than the Java8..:”
    I think you misread the benchmark. The Java7 version is NOT faster than the Java 8 version.

  • AnonymousCoward

    Nice article, some nits:

    1. Could you label the axes in your main graph, i.e. dimension and units?

    2. You wrote: “…ForkJoin execution then Java 8 is much faster.” Yes, in the lower numbers on your X axis it is, but towards the right side the difference (as shown by your graph anyway) is less pronounced and perhaps not significant.

    3. “Performant” is not a word, even if it’s in wiktionary.

  • Eric Bresie

    So the Fork Join benchmark seems to show slower for the Java 8 verses the faster Java 7 version. Right?

  • TheJeebus

    No, the graph is poorly named. It shows throughput. Baseline is fastest (single thread), then FJ Java8.