Java’s ServiceLoader API + using native libraries => NOK
In this article I am going to share with you an interesting case of how an optional feature makes a software library completely unusable on non-AMD64 hardware.
The software library is Apache Parquet.
Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.
While storing data Parquet may use compression to reduce the size on the disk. The list of supported compression formats are listed at parquet.thrift — Snappy, Gzip, LZO, Brotli, LZ4 and ZSTD.
The codec implementations for all formats but Brotli are provided by Apache Hadoop Commons library. The implementation for Brotli is provided by another dependency — https://github.com/rdblue/brotli-codec. It implements the same APIs as the implementations in Hadoop Commons and it is easily pluggable!
I tried to build Apache Parquet on my ARM64 machine but it failed with:
$ mvn clean verify
Tests in error:
testReadWriteWithCountDeprecated(org.apache.parquet.hadoop.DeprecatedInputFormatTest): org.apache.hadoop.io.compress.CompressionCodec: Provider org.apache.hadoop.io.compress.BrotliCodec could not be instantiated
The weird thing in this test is that it does not even mention Brotli anywhere in its source code! Few of its test methods use the GZip compression codec though!
Apache Hadoop Commons CompressionCodecFactory uses Java’s ServiceLoader APIs to find and load all implementations of org.apache.hadoop.io.compress.CompressionCodec in the application’s classpath. The ServiceLoader API is the Java standard library way to implement plugin system, i.e. any dependency of a Java application may provide an implementation of the plugin (an interface class) and the application will see and use it at runtime without having a hard reference to the implementation classes.
The problem here is that by using ServiceLoader APIs CompressionCodecFactory loads all implementations very aggressively — at class loading time:
private static final ServiceLoader<CompressionCodec> CODEC_PROVIDERS = ServiceLoader.load(CompressionCodec.class);
And the second important part of the problem is that BrotliCodec (an implementation of CompressionCodec interface) uses native code to do its job. And it loads the native library in its constructor! So ServiceLoader.load(CompressionCodec.class) fails with UnsatisfiedLinkError when there is no native library for the current CPU architecture! brotli-codec library comes with just linux-x86-amd64 and darwin-x86-amd64 binaries.
So any attempt to use any compression format in Parquet on non Linux/MacOS x86_64 will fail and there is nothing you can do about it!
At the time of writing this article (20.02.2021, Apache Parquet 1.11.1) the only reliable solution is to remove manually brotli-codec.jar from the classpath.
I have created a Pull Request that works around this issue while building Apache Parquet on ARM64 hardware. Once it is merged one will be able to build Parquet on ARM64 without any manual extra work! But if you use the convenience jars deployed at Maven Central then you will still need to exclude brotli-codec from parquet-hadoop:
<version>1.11.1</version> <!-- or later version -->
<artifactId>brotli-codec</artifactId> </exclusion> </exclusions> </dependency>
I’ve sent a Pull Request to https://github.com/rdblue/brotli-codec migrating the Java Brotli bindings from jbrotli to Brotli4j because it is maintained and provides native binaries for several more CPU architectures. But Ryan Blue is no more interested in maintaining his library and he would prefer a new Brotli codec to be introduced.
I will contact Hadoop developers to see whether they are interested to add the new BrotliCodec implementation directly to hadoop-common project. If they are not then I will release it myself and suggest it as a replacement for parquet-hadoop project!