Performant synchronized queue and stack data structures in Java

Martin Grigorov
2 min readJun 12, 2020

--

While profiling Apache Tomcat trying to figure out why the request throughput under very high load is twice better on x86_64 than on aarch64 CPU architecture I’ve noticed that there is a high thread contention when using org.apache.tomcat.util.collections.SynchronizedQueue and org.apache.tomcat.util.collections.SynchronizedStack.

Executing Async-Profiler’s

./profiler.sh -e lock -d 60 <TOMCAT_PID>

gave me the following output:

Started [lock] profiling                                                                                                                       
--- Execution profile ---
Total samples : 713834
Frame buffer usage : 0.0288%

--- 1106165618177 ns (58.97%), 395609 samples
[ 0] org.apache.tomcat.util.collections.SynchronizedStack
[ 1] org.apache.tomcat.util.collections.SynchronizedStack.push
[ 2] org.apache.tomcat.util.net.NioBlockingSelector.write
[ 3] org.apache.tomcat.util.net.NioSelectorPool.write
[ 4] org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper.doWrite
[ 5] org.apache.tomcat.util.net.SocketWrapperBase.doWrite
[ 6] org.apache.tomcat.util.net.SocketWrapperBase.flushBlocking
[ 7] org.apache.tomcat.util.net.SocketWrapperBase.flush
[ 8] org.apache.coyote.http11.Http11OutputBuffer$SocketOutputBuffer.end
[ 9] org.apache.coyote.http11.filters.IdentityOutputFilter.end
[10] org.apache.coyote.http11.Http11OutputBuffer.end
[11] org.apache.coyote.http11.Http11Processor.finishResponse
[12] org.apache.coyote.AbstractProcessor.action
[13] org.apache.coyote.Response.action
[14] org.apache.catalina.connector.OutputBuffer.close
[15] org.apache.catalina.connector.Response.finishResponse
[16] org.apache.catalina.connector.CoyoteAdapter.service
[17] org.apache.coyote.http11.Http11Processor.service
[18] org.apache.coyote.AbstractProcessorLight.process
[19] org.apache.coyote.AbstractProtocol$ConnectionHandler.process
[20] org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun
[21] org.apache.tomcat.util.net.SocketProcessorBase.run
[22] java.util.concurrent.ThreadPoolExecutor.runWorker
[23] java.util.concurrent.ThreadPoolExecutor$Worker.run
[24] org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run
[25] java.lang.Thread.run

--- 769519847169 ns (41.03%), 314384 samples
[ 0] org.apache.tomcat.util.collections.SynchronizedStack
[ 1] org.apache.tomcat.util.collections.SynchronizedStack.pop
[ 2] org.apache.tomcat.util.net.NioBlockingSelector.write
[ 3] org.apache.tomcat.util.net.NioSelectorPool.write
[ 4] org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper.doWrite
[ 5] org.apache.tomcat.util.net.SocketWrapperBase.doWrite
[ 6] org.apache.tomcat.util.net.SocketWrapperBase.flushBlocking
[ 7] org.apache.tomcat.util.net.SocketWrapperBase.flush
[ 8] org.apache.coyote.http11.Http11OutputBuffer$SocketOutputBuffer.end
[ 9] org.apache.coyote.http11.filters.IdentityOutputFilter.end
[10] org.apache.coyote.http11.Http11OutputBuffer.end
[11] org.apache.coyote.http11.Http11Processor.finishResponse
....

SynchronizedQueue and SynchronizedStack are custom implementations of queue and stack data structures (not implementing Java Collections interfaces!) which are used by Apache Tomcat to keep re-usable instances of classes which are shared between all HTTP processing threads.

These custom queue and stack implementations are backed by an array (Object[]) and the access to this array is synchronized, i.e. the public methods of SynchronizedQueue/Stack are synchronized, so only one thread can modify the array at any given time.

Since version 1.5 Java provides non-blocking ConcurrentLinkedQueue and since 1.7 ConcurrentLinkedDeque (that could be used as a stack).

Let’s measure which implementation performs better — Tomcat’s custom implementations or Java standard libraries!

Tomcat already has a performance tests — TesterPerformanceSynchronizedQueue.java and TesterPerformanceSynchronizedStack.java. Both of them start 4 threads and push/poll 1 million times the data structure under test. They just print the total time in nanoseconds.

I’ve prepared similar performance tests based on JMH (Java Microbenchmark Harness) — SynchronizedCollectionsBenchmark.java

SynchronizedCollectionsBenchmark.concurrentLinkedDeque  thrpt    5        5360524.349 ±   405803.655  ops/s
SynchronizedCollectionsBenchmark.concurrentLinkedQueue thrpt 5 5935777.516 ± 619664.615 ops/s
SynchronizedCollectionsBenchmark.synchronizedQueue thrpt 5 12134033.786 ± 3262089.933 ops/s
SynchronizedCollectionsBenchmark.synchronizedStack thrpt 5 12727145.327 ± 2682451.900 ops/s

The results show that despite the locking done by SynchronizedQueue/Stack implementations the direct access by array index and the lack of CAS (compare-and-set) checks done by ConcurrentLinkedQueue/Deque the data structures used by Tomcat are 2–2.5 times faster!

--

--

No responses yet