Compression in Parquet

Versions: Parquet 1.9.0

Last time we've discovered different encoding methods available in Apache Parquet. But the encoding is not the single technique helping to reduce the size of files. The other one, very similar, is the compression.

This post gives an insight on the compression provided by Apache Parquet. However the first part recalls the basics and focuses on compression as a process independent of any framework. Only the second section lists and explains the compression codecs available in Parquet. The last section shows some compression tests.

Compression definition

Data compression is a technique that the main purpose is the reduction of the logical size of the file. So reduced file should take less place on disk and be transferred faster over the network. Very often the compression works similarly to encoding. It identifies repetitive patterns and replaces them with more compact representations.

The difference between encoding and compression is subtle. The encoding changes data representation. For instance it can represent the word "abc" as "11" and often it can be based on the same principles as the compression. An example of that is Run-Length Encoding that replaces repeated symbols (e.g. "aaa") with shorter form ("3a"). In the other side, the compression deals with the whole information and takes advantage of data redundancy to reduce the total size. For instance, it can remove the parts of information that are useless (e.g. repeated patterns or pieces of information meaningless for the receiver).

Compression codecs in Parquet

The compression in Parquet is done per column. Except uncompressed representation, Apache Parquet comes with 3 compression codecs:

Parquet compression examples

In below tests, because of LZO installation complexity, only gzip and Snappy codecs are presented:

 
private static final CodecFactory CODEC_FACTORY = new CodecFactory(new Configuration(), 1024);
private static final CodecFactory.BytesCompressor GZIP_COMPRESSOR =
        CODEC_FACTORY.getCompressor(CompressionCodecName.GZIP);
private static final CodecFactory.BytesCompressor SNAPPY_COMPRESSOR =
        CODEC_FACTORY.getCompressor(CompressionCodecName.SNAPPY);

private static final BytesInput BIG_INPUT_INTS;
static {
  List<BytesInput> first4000Numbers =
    IntStream.range(1, 4000).boxed().map(number -> BytesInput.fromInt(number)).collect(Collectors.toList());
  BIG_INPUT_INTS = BytesInput.concat(first4000Numbers);
}

private static final BytesInput SMALL_INPUT_STRING = BytesInput.concat(
        BytesInput.from(getUtf8("test")), BytesInput.from(getUtf8("test1")), BytesInput.from(getUtf8("test2")),
        BytesInput.from(getUtf8("test3")), BytesInput.from(getUtf8("test4"))
);

private static final BytesInput INPUT_DATA_STRING;
private static final BytesInput INPUT_DATA_STRING_DISTINCT_WORDS;
static {
  URL url = Resources.getResource("lorem_ipsum.txt");
  try {
    String text = Resources.toString(url, Charset.forName("UTF-8"));
    INPUT_DATA_STRING = BytesInput.from(getUtf8(text));
    String textWithDistinctWords =  Stream.of(text.split(" ")).distinct().collect(Collectors.joining(", "));
    INPUT_DATA_STRING_DISTINCT_WORDS = BytesInput.from(getUtf8(textWithDistinctWords));
  } catch (IOException e) {
    throw new RuntimeException(e);
  }
}

@Test
public void should_compress_lorem_ipsum_more_efficiently_with_gzip_than_without_compression() throws IOException {
  BytesInput compressedLorem = GZIP_COMPRESSOR.compress(INPUT_DATA_STRING);

  System.out.println("Compressed size = "+compressedLorem.size() + " vs plain size = "+INPUT_DATA_STRING.size());
  assertThat(compressedLorem.size()).isLessThan(INPUT_DATA_STRING.size());
}

@Test
public void should_compress_text_without_repetitions_more_efficiently_with_gzip_than_without_compression() throws IOException {
  BytesInput compressedLorem = GZIP_COMPRESSOR.compress(INPUT_DATA_STRING_DISTINCT_WORDS);

  System.out.println("Compressed size = "+compressedLorem.size() +
          " vs plain size = "+INPUT_DATA_STRING_DISTINCT_WORDS.size());
  assertThat(compressedLorem.size()).isLessThan(INPUT_DATA_STRING_DISTINCT_WORDS.size());
}

@Test
public void should_compress_small_text_less_efficiently_with_gzip_than_without_compression() throws IOException {
  BytesInput compressedLorem = GZIP_COMPRESSOR.compress(SMALL_INPUT_STRING);

  // Compressed size is greater because GZIP adds some additional compression metadata as, header
  // It can be seen in java.util.zip.GZIPOutputStream.writeHeader()
  System.out.println("Compressed size = "+compressedLorem.size() + " vs plain size = "+SMALL_INPUT_STRING.size());
  assertThat(compressedLorem.size()).isGreaterThan(SMALL_INPUT_STRING.size());
}

@Test
public void should_compress_ints_more_efficiently_with_gzip_than_without_compression() throws IOException {
  BytesInput compressedLorem = GZIP_COMPRESSOR.compress(BIG_INPUT_INTS);

  System.out.println("Compressed size = "+compressedLorem.size() + " vs plain size = "+ BIG_INPUT_INTS.size());
  assertThat(compressedLorem.size()).isLessThan(BIG_INPUT_INTS.size());
}

@Test
public void should_compress_lorem_ipsum_more_efficiently_with_snappy_than_without_compression() throws IOException {
  BytesInput compressedLorem = SNAPPY_COMPRESSOR.compress(INPUT_DATA_STRING);

  System.out.println("Compressed size = "+compressedLorem.size() + " vs plain size = "+INPUT_DATA_STRING.size());
  assertThat(compressedLorem.size()).isLessThan(INPUT_DATA_STRING.size());
}

@Test
public void should_compress_text_without_repetitions_more_efficiently_with_snappy_than_without_compression() throws IOException {
  BytesInput compressedLorem = SNAPPY_COMPRESSOR.compress(INPUT_DATA_STRING_DISTINCT_WORDS);

  System.out.println("Compressed size = "+compressedLorem.size() +
      " vs plain size = "+INPUT_DATA_STRING_DISTINCT_WORDS.size());
  assertThat(compressedLorem.size()).isLessThan(INPUT_DATA_STRING_DISTINCT_WORDS.size());
}

@Test
public void should_compress_small_text_less_efficiently_with_snappy_than_without_compression() throws IOException {
  BytesInput compressedLorem = SNAPPY_COMPRESSOR.compress(SMALL_INPUT_STRING);

  // For snappy the difference is much less smaller than in the case of Gzip  (1 byte)
  // The difference comes from the fact that Snappy stores the length of compressed output after
  // the compressed values
  // You can see that in the Snappy implementation C++ header file:
  // https://github.com/google/snappy/blob/32d6d7d8a2ef328a2ee1dd40f072e21f4983ebda/snappy.h#L111
  System.out.println("Compressed size = "+compressedLorem.size() + " vs plain size = "+SMALL_INPUT_STRING.size());
  assertThat(compressedLorem.size()).isGreaterThan(SMALL_INPUT_STRING.size());
}

@Test
public void should_compress_ints_more_efficiently_with_snappy_than_without_compression() throws IOException {
  BytesInput compressedLorem = SNAPPY_COMPRESSOR.compress(BIG_INPUT_INTS);

  // Snappy is only slightly different than the plain encoding (1 byte less)
  System.out.println("Compressed size = "+compressedLorem.size() + " vs plain size = "+ BIG_INPUT_INTS.size());
  assertThat(compressedLorem.size()).isLessThan(BIG_INPUT_INTS.size());
}

@Test
public void should_fail_compress_with_lzo_when_the_native_library_is_not_loaded() throws IOException {
  assertThatExceptionOfType(RuntimeException.class).isThrownBy(() -> CODEC_FACTORY.getCompressor(CompressionCodecName.LZO))
    .withMessageContaining("native-lzo library not available");
}

private static final byte[] getUtf8(String text) {
  try {
    return text.getBytes("UTF-8");
  } catch (UnsupportedEncodingException e) {
    throw new RuntimeException(e);
  }
}

The compression is one of other Parquet's weapons for the battle against inefficient storage. It's complementary to already explained encoding methods and can be applied on column level as well. That brings the possibility to use different compression methods, according to the contained data. Apache Parquet provides 3 compression codecs detailed in the 2nd section: gzip, Snappy and LZO. Two first are included natively while the last requires some additional setup. As shown in the final section, the compression is not always positive. Sometimes the compressed data occupies more place than the uncompressed. We'd explain that by an extra overhead brought by encoding information.