Performant synchronized queue and stack data structures in Java
While profiling Apache Tomcat trying to figure out why the request throughput under very high load is twice better on x86_64 than on aarch64 CPU architecture I’ve noticed that there is a high thread contention when using org.apache.tomcat.util.collections.SynchronizedQueue and org.apache.tomcat.util.collections.SynchronizedStack.
Executing Async-Profiler’s
./profiler.sh -e lock -d 60 <TOMCAT_PID>
gave me the following output:
Started [lock] profiling
--- Execution profile ---
Total samples : 713834
Frame buffer usage : 0.0288%
--- 1106165618177 ns (58.97%), 395609 samples
[ 0] org.apache.tomcat.util.collections.SynchronizedStack
[ 1] org.apache.tomcat.util.collections.SynchronizedStack.push
[ 2] org.apache.tomcat.util.net.NioBlockingSelector.write
[ 3] org.apache.tomcat.util.net.NioSelectorPool.write
[ 4] org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper.doWrite
[ 5] org.apache.tomcat.util.net.SocketWrapperBase.doWrite
[ 6] org.apache.tomcat.util.net.SocketWrapperBase.flushBlocking
[ 7] org.apache.tomcat.util.net.SocketWrapperBase.flush
[ 8] org.apache.coyote.http11.Http11OutputBuffer$SocketOutputBuffer.end
[ 9] org.apache.coyote.http11.filters.IdentityOutputFilter.end
[10] org.apache.coyote.http11.Http11OutputBuffer.end
[11] org.apache.coyote.http11.Http11Processor.finishResponse
[12] org.apache.coyote.AbstractProcessor.action
[13] org.apache.coyote.Response.action
[14] org.apache.catalina.connector.OutputBuffer.close
[15] org.apache.catalina.connector.Response.finishResponse
[16] org.apache.catalina.connector.CoyoteAdapter.service
[17] org.apache.coyote.http11.Http11Processor.service
[18] org.apache.coyote.AbstractProcessorLight.process
[19] org.apache.coyote.AbstractProtocol$ConnectionHandler.process
[20] org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun
[21] org.apache.tomcat.util.net.SocketProcessorBase.run
[22] java.util.concurrent.ThreadPoolExecutor.runWorker
[23] java.util.concurrent.ThreadPoolExecutor$Worker.run
[24] org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run
[25] java.lang.Thread.run
--- 769519847169 ns (41.03%), 314384 samples
[ 0] org.apache.tomcat.util.collections.SynchronizedStack
[ 1] org.apache.tomcat.util.collections.SynchronizedStack.pop
[ 2] org.apache.tomcat.util.net.NioBlockingSelector.write
[ 3] org.apache.tomcat.util.net.NioSelectorPool.write
[ 4] org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper.doWrite
[ 5] org.apache.tomcat.util.net.SocketWrapperBase.doWrite
[ 6] org.apache.tomcat.util.net.SocketWrapperBase.flushBlocking
[ 7] org.apache.tomcat.util.net.SocketWrapperBase.flush
[ 8] org.apache.coyote.http11.Http11OutputBuffer$SocketOutputBuffer.end
[ 9] org.apache.coyote.http11.filters.IdentityOutputFilter.end
[10] org.apache.coyote.http11.Http11OutputBuffer.end
[11] org.apache.coyote.http11.Http11Processor.finishResponse
....
SynchronizedQueue and SynchronizedStack are custom implementations of queue and stack data structures (not implementing Java Collections interfaces!) which are used by Apache Tomcat to keep re-usable instances of classes which are shared between all HTTP processing threads.
These custom queue and stack implementations are backed by an array (Object[]) and the access to this array is synchronized, i.e. the public methods of SynchronizedQueue/Stack are synchronized, so only one thread can modify the array at any given time.
Since version 1.5 Java provides non-blocking ConcurrentLinkedQueue and since 1.7 ConcurrentLinkedDeque (that could be used as a stack).
Let’s measure which implementation performs better — Tomcat’s custom implementations or Java standard libraries!
Tomcat already has a performance tests — TesterPerformanceSynchronizedQueue.java and TesterPerformanceSynchronizedStack.java. Both of them start 4 threads and push/poll 1 million times the data structure under test. They just print the total time in nanoseconds.
I’ve prepared similar performance tests based on JMH (Java Microbenchmark Harness) — SynchronizedCollectionsBenchmark.java
SynchronizedCollectionsBenchmark.concurrentLinkedDeque thrpt 5 5360524.349 ± 405803.655 ops/s
SynchronizedCollectionsBenchmark.concurrentLinkedQueue thrpt 5 5935777.516 ± 619664.615 ops/s
SynchronizedCollectionsBenchmark.synchronizedQueue thrpt 5 12134033.786 ± 3262089.933 ops/s
SynchronizedCollectionsBenchmark.synchronizedStack thrpt 5 12727145.327 ± 2682451.900 ops/s
The results show that despite the locking done by SynchronizedQueue/Stack implementations the direct access by array index and the lack of CAS (compare-and-set) checks done by ConcurrentLinkedQueue/Deque the data structures used by Tomcat are 2–2.5 times faster!