Apache Parquet articles

4-day workshop Β· In-person or online

What would it take for you to trust your Databricks pipelines in production?

A 3-day bug hunt on a 3-person team costs up to €7,200 in lost engineering time. This workshop teaches you to prevent that β€” unit tests, data tests, and integration tests for PySpark and Databricks Lakeflow, including Spark Declarative Pipelines.

Unit, data & integration tests
Medallion architecture & Lakeflow SDP
Max 10 participants Β· production-ready templates
See the full curriculum β†’ €7,000 flat fee Β· cohort of up to 10
Bartosz Konieczny
Bartosz
Konieczny

Compression in Parquet

Last time we've discovered different encoding methods available in Apache Parquet. But the encoding is not the single technique helping to reduce the size of files. The other one, very similar, is the compression.

Continue Reading β†’

Nested data representation in Parquet

Working with nested structures appears as a problem in column-oriented storage. However, thanks to Google's Dremel solution, this task can be solved efficiently.

Continue Reading β†’

Schema versions in Parquet

When I've started to play with Apache Parquet I was surprised about 2 versions of writers. Before approaching the rest of planed topics, it's a good moment to explain these different versions better.

Continue Reading β†’

Encodings in Apache Parquet

An efficient data storage is one of success keys of a good storage format. One of methods helping to improve that is an appropriate encoding and Parquet comes with several different methods.

Continue Reading β†’

Data storage in Apache Parquet

Previously we focused on types available in Parquet. This time we can move forward and analyze how the framework stores the data in the files.

Continue Reading β†’

Data types in Apache Parquet

Data in Apache Parquet files is written against specific schema. And who tells schema, invokes automatically data types for the fields composing this schema.

Continue Reading β†’

Introduction to Apache Parquet

Very often an appropriate storage is as important as the data processing pipeline. And among different possibilities we can still store the data in files. Thanks to different formats, such as column-oriented ones, some of actions in reading path can be optimized.

Continue Reading β†’