Introduction to HDFS

Versions: Hadoop 2.7.2

HDFS is one of most popular distributed file systems in our days. It changes from other older distributed file systems thanks to its reliability.

Looking for a better data engineering position and skills?

You have been working as a data engineer but feel stuck? You don't have any new challenges and are still writing the same jobs all over again? You have now different options. You can try to look for a new job, now or later, or learn from the others! "Become a Better Data Engineer" initiative is one of these places where you can find online learning resources where the theory meets the practice. They will help you prepare maybe for the next job, or at least, improve your current skillset without looking for something else.

👉 I'm interested in improving my data engineering skillset

See you there, Bartosz

In this post we'll discover HDFS in a big picture. The first part describes distributed file systems. The second part focuses on main characteristics of HDFS while the third on Java API brought by Hadoop's file system.

Distributed file system

Distributed file system is an application working in client/server mode. It means that the communication between them is done through the network.

Client ignores where files are physically located because all communication is transparent. It means that the client can use the same commands as it was communicating with files stores locally, for instance ls to list files in a directory. In additional, client can make exactly the same operations, as creating file/directory, moving directory or deleting file/directory.

But classical distributed file systems had some drawbacks. They didn't take replication into account. It means that often one server held all data and if it was down, nobody could access the files. It also led to the network congestion when too many clients were working on files.

HDFS as distributed file system

HDFS (Hadoop Distributed File System) is a distributed file system solving the issues described in the previous paragraph. It's reliable and because of that, it works on cluster. The other points describing HDFS are:

Java API

An intrinsic part of HDFS is its API developed in Java. It won't be presented here in details and only some important classes are listed.

The first one is org.apache.hadoop.conf.Configuration. As the name indicates, it construct HDFS configuration and it's similar to XML configuration files.

The second important class represents file system and is represented by abstract class org.apache.hadoop.fs.FileSystem. Through its methods we'll programatically modify the file system (create files, list directories content etc.). Another class related to file system is org.apache.hadoop.fs.Path. It represents the name of a file or a directory in FileSystem seen previously.

Finally, two classes are used as an abstraction for introduced and retrieved data to/from HDFS: org.apache.hadoop.fs.FSInputStream and org.apache.hadoop.fs.FSDataOutputStream.

This post introduces a topic of HDFS. Its first part describes some generalities about distributed file systems. The second specifies more HDFS as a big step forward regarding to previous distributed file systems. The last part introduces some important classes in programmatic HDFS management.

If you liked it, you should read:

📚 Newsletter Get new posts, recommended reading and other exclusive information every week. SPAM free - no 3rd party ads, only the information about waitingforcode!