Performant synchronized queue and stack data structures in Java

2 min readJun 12, 2020

While profiling Apache Tomcat trying to figure out why the request throughput under very high load is twice better on x86_64 than on aarch64 CPU architecture I’ve noticed that there is a high thread contention when using org.apache.tomcat.util.collections.SynchronizedQueue and org.apache.tomcat.util.collections.SynchronizedStack.

Executing Async-Profiler’s

./profiler.sh -e lock -d 60 <TOMCAT_PID>

gave me the following output:

Started [lock] profiling                                                                                                                       
    --- Execution profile ---
    Total samples       : 713834                                                                                                                    
    Frame buffer usage  : 0.0288%      
                                                                                                                                                                        
--- 1106165618177 ns (58.97%), 395609 samples
        [ 0] org.apache.tomcat.util.collections.SynchronizedStack
        [ 1] org.apache.tomcat.util.collections.SynchronizedStack.push
        [ 2] org.apache.tomcat.util.net.NioBlockingSelector.write
        [ 3] org.apache.tomcat.util.net.NioSelectorPool.write
        [ 4] org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper.doWrite
        [ 5] org.apache.tomcat.util.net.SocketWrapperBase.doWrite
        [ 6] org.apache.tomcat.util.net.SocketWrapperBase.flushBlocking
        [ 7] org.apache.tomcat.util.net.SocketWrapperBase.flush
        [ 8] org.apache.coyote.http11.Http11OutputBuffer$SocketOutputBuffer.end
        [ 9] org.apache.coyote.http11.filters.IdentityOutputFilter.end
        [10] org.apache.coyote.http11.Http11OutputBuffer.end
        [11] org.apache.coyote.http11.Http11Processor.finishResponse
        [12] org.apache.coyote.AbstractProcessor.action
        [13] org.apache.coyote.Response.action
        [14] org.apache.catalina.connector.OutputBuffer.close
        [15] org.apache.catalina.connector.Response.finishResponse
        [16] org.apache.catalina.connector.CoyoteAdapter.service
        [17] org.apache.coyote.http11.Http11Processor.service
        [18] org.apache.coyote.AbstractProcessorLight.process
        [19] org.apache.coyote.AbstractProtocol$ConnectionHandler.process
        [20] org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun
        [21] org.apache.tomcat.util.net.SocketProcessorBase.run
        [22] java.util.concurrent.ThreadPoolExecutor.runWorker
        [23] java.util.concurrent.ThreadPoolExecutor$Worker.run
        [24] org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run
        [25] java.lang.Thread.run
--- 769519847169 ns (41.03%), 314384 samples
        [ 0] org.apache.tomcat.util.collections.SynchronizedStack
        [ 1] org.apache.tomcat.util.collections.SynchronizedStack.pop
        [ 2] org.apache.tomcat.util.net.NioBlockingSelector.write
        [ 3] org.apache.tomcat.util.net.NioSelectorPool.write
        [ 4] org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper.doWrite
        [ 5] org.apache.tomcat.util.net.SocketWrapperBase.doWrite
        [ 6] org.apache.tomcat.util.net.SocketWrapperBase.flushBlocking
        [ 7] org.apache.tomcat.util.net.SocketWrapperBase.flush
        [ 8] org.apache.coyote.http11.Http11OutputBuffer$SocketOutputBuffer.end
        [ 9] org.apache.coyote.http11.filters.IdentityOutputFilter.end
        [10] org.apache.coyote.http11.Http11OutputBuffer.end
        [11] org.apache.coyote.http11.Http11Processor.finishResponse 
        ....

SynchronizedQueue and SynchronizedStack are custom implementations of queue and stack data structures (not implementing Java Collections interfaces!) which are used by Apache Tomcat to keep re-usable instances of classes which are shared between all HTTP processing threads.

These custom queue and stack implementations are backed by an array (Object[]) and the access to this array is synchronized, i.e. the public methods of SynchronizedQueue/Stack are synchronized, so only one thread can modify the array at any given time.

Since version 1.5 Java provides non-blocking ConcurrentLinkedQueue and since 1.7 ConcurrentLinkedDeque (that could be used as a stack).

Let’s measure which implementation performs better — Tomcat’s custom implementations or Java standard libraries!

Tomcat already has a performance tests — TesterPerformanceSynchronizedQueue.java and TesterPerformanceSynchronizedStack.java. Both of them start 4 threads and push/poll 1 million times the data structure under test. They just print the total time in nanoseconds.

I’ve prepared similar performance tests based on JMH (Java Microbenchmark Harness) — SynchronizedCollectionsBenchmark.java

SynchronizedCollectionsBenchmark.concurrentLinkedDeque  thrpt    5        5360524.349 ±   405803.655  ops/s
SynchronizedCollectionsBenchmark.concurrentLinkedQueue  thrpt    5   5935777.516 ±   619664.615  ops/s
SynchronizedCollectionsBenchmark.synchronizedQueue      thrpt    5  12134033.786 ± 3262089.933  ops/s
SynchronizedCollectionsBenchmark.synchronizedStack      thrpt    5  12727145.327 ± 2682451.900  ops/s

The results show that despite the locking done by SynchronizedQueue/Stack implementations the direct access by array index and the lack of CAS (compare-and-set) checks done by ConcurrentLinkedQueue/Deque the data structures used by Tomcat are 2–2.5 times faster!

Performant synchronized queue and stack data structures in Java

Written by Martin Grigorov

No responses yet