Spark SQL schema articles

Apache Spark SQL and types resolution in semi-structured data

One of data governance goals is to ensure data consistency across different producers. Unfortunately, very often it's only a theory and especially when the data format is schemaless. It's why the data exploration is an important step in the process of data pipeline definition. In this post I wanted to do a small exercise and check how Apache Spark SQL behaves with inconsistent data.

Continue Reading β†’

Defining schemas in Apache Spark SQL with builder design pattern

Schemas are one of the key parts of Apache Spark SQL and its distinction point with old RDD-based API. When we deal with data coming from a structured data source as a relational database or schema-based file formats, we can let the framework to resolve the schema for us. But the things complicate when we're working with semi-structured data as JSON and we must define the schema by hand.

Continue Reading β†’

Schema projection

Even if it's always better to explicit things, in programming we have often the possibility to let the computer to guess. Spark SQL also has this level of intelligence, for example during schema resolving.

Continue Reading β†’

User Defined Type

Spark SQL schema is very flexible. It supports global data types, as booleans, integers, strings, but it also supports custom data types called User Defined Type (UDT).

Continue Reading β†’

Schemas

Spark SQL - even if the SQL suffix makes automatically think about RDBMS - works well with other data sources, as even plain CSVs or JSON files. This relation would be difficult to achieve without the concept of schema.

Continue Reading β†’