Introduction to HDFS

HDFS is one of most popular distributed file systems in our days. It changes from other older distributed file systems thanks to its reliability.

4-day workshop · In-person or online

What would it take for you to trust your Databricks pipelines in production?

A 3-day bug hunt on a 3-person team costs up to €7,200 in lost engineering time. This workshop teaches you to prevent that — unit tests, data tests, and integration tests for PySpark and Databricks Lakeflow, including Spark Declarative Pipelines.

Unit, data & integration tests
Medallion architecture & Lakeflow SDP
Max 10 participants · production-ready templates
See the full curriculum → €7,000 flat fee · cohort of up to 10
Bartosz Konieczny
Bartosz
Konieczny

In this post we'll discover HDFS in a big picture. The first part describes distributed file systems. The second part focuses on main characteristics of HDFS while the third on Java API brought by Hadoop's file system.

Distributed file system

Distributed file system is an application working in client/server mode. It means that the communication between them is done through the network.

Client ignores where files are physically located because all communication is transparent. It means that the client can use the same commands as it was communicating with files stores locally, for instance ls to list files in a directory. In additional, client can make exactly the same operations, as creating file/directory, moving directory or deleting file/directory.

But classical distributed file systems had some drawbacks. They didn't take replication into account. It means that often one server held all data and if it was down, nobody could access the files. It also led to the network congestion when too many clients were working on files.

HDFS as distributed file system

HDFS (Hadoop Distributed File System) is a distributed file system solving the issues described in the previous paragraph. It's reliable and because of that, it works on cluster. The other points describing HDFS are:

Java API

An intrinsic part of HDFS is its API developed in Java. It won't be presented here in details and only some important classes are listed.

The first one is org.apache.hadoop.conf.Configuration. As the name indicates, it construct HDFS configuration and it's similar to XML configuration files.

The second important class represents file system and is represented by abstract class org.apache.hadoop.fs.FileSystem. Through its methods we'll programatically modify the file system (create files, list directories content etc.). Another class related to file system is org.apache.hadoop.fs.Path. It represents the name of a file or a directory in FileSystem seen previously.

Finally, two classes are used as an abstraction for introduced and retrieved data to/from HDFS: org.apache.hadoop.fs.FSInputStream and org.apache.hadoop.fs.FSDataOutputStream.

This post introduces a topic of HDFS. Its first part describes some generalities about distributed file systems. The second specifies more HDFS as a big step forward regarding to previous distributed file systems. The last part introduces some important classes in programmatic HDFS management.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems contact@waitingforcode.com đź“©