Introduction to HDFS on waitingforcode.com

Versions: Hadoop 2.7.2

HDFS is one of most popular distributed file systems in our days. It changes from other older distributed file systems thanks to its reliability.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩

In this post we'll discover HDFS in a big picture. The first part describes distributed file systems. The second part focuses on main characteristics of HDFS while the third on Java API brought by Hadoop's file system.

Distributed file system

Distributed file system is an application working in client/server mode. It means that the communication between them is done through the network.

Client ignores where files are physically located because all communication is transparent. It means that the client can use the same commands as it was communicating with files stores locally, for instance ls to list files in a directory. In additional, client can make exactly the same operations, as creating file/directory, moving directory or deleting file/directory.

But classical distributed file systems had some drawbacks. They didn't take replication into account. It means that often one server held all data and if it was down, nobody could access the files. It also led to the network congestion when too many clients were working on files.

HDFS as distributed file system

HDFS (Hadoop Distributed File System) is a distributed file system solving the issues described in the previous paragraph. It's reliable and because of that, it works on cluster. The other points describing HDFS are:

Java-based - HDFS is programmed in Java, so naturally it exposes an API in this language. In additional, other languages are also supported through 3-part projects.
master/slave - HDFS works in master/slave architecture where a master is a node coordinating the work of file system, stores physically in slaves. A master node is called NameNode and its slaves DataNodes. They will be presented more in details in one of the next posts. By now we can simplify and tell that NameNode stores files metadata as server location and DataNodes store files physically.
batch oriented - HDFS is based on disk I/O operations. Naturally, it's more adapted to batch tasks than to streaming ones, where often data is stored in memory guaranteeing faster access.
blocks abstraction - files are stored in blocks. Each block has a default size of 128MB that can be overridden at the moment of creating file. These blocks are stored on DataNodes. But the blocks of one file won't be necessarily stored in the same DataNode.
fault tolerance - HDFS offers a reliability for stored files. It achieves that thanks to blocks replication and periodic checkpoints. Prior to 2.0 version, the single point of failure was NameNode. But this release brought a possibility to run the second and redundant NameNode in passive mode, and activate it when main NameNode goes down.
TCP/IP communication - the communication in HDFS is done with TCP/IP protocol.
commodity hardware - HDFS is built to work in commodity, widely available hardware.
write-only - at the beginning, created files were write-once, ie. once created, nobody could change them. Some time later HDFS allowed clients to append data or truncate files. But other modifications aren't possible and the single chance to edit a file in some other manner is to delete the old one and upload the new.

Java API

An intrinsic part of HDFS is its API developed in Java. It won't be presented here in details and only some important classes are listed.

The first one is org.apache.hadoop.conf.Configuration. As the name indicates, it construct HDFS configuration and it's similar to XML configuration files.

The second important class represents file system and is represented by abstract class org.apache.hadoop.fs.FileSystem. Through its methods we'll programatically modify the file system (create files, list directories content etc.). Another class related to file system is org.apache.hadoop.fs.Path. It represents the name of a file or a directory in FileSystem seen previously.

Finally, two classes are used as an abstraction for introduced and retrieved data to/from HDFS: org.apache.hadoop.fs.FSInputStream and org.apache.hadoop.fs.FSDataOutputStream.

This post introduces a topic of HDFS. Its first part describes some generalities about distributed file systems. The second specifies more HDFS as a big step forward regarding to previous distributed file systems. The last part introduces some important classes in programmatic HDFS management.

Consulting

With nearly 16 years of experience, including 8 as data engineer, I offer expert consulting to design and optimize scalable data solutions. As an O’Reilly author, Data+AI Summit speaker, and blogger, I bring cutting-edge insights to modernize infrastructure, build robust pipelines, and drive data-driven decision-making. Let's transform your data challenges into opportunities—reach out to elevate your data engineering game today!

👉 contact@waitingforcode.com
🔗 past projects