Handling small files in HDFS on waitingforcode.com

Versions: Hadoop 2.7.2

HDFS is not well suited tool to store a lot of small files. Even if that's true, some methods exist to handle small files better.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩

This post covers some of these techniques. The first part describes the method using Hadoop Archives - maybe the most often quoted to solve the problem of small files. The second part shows how to group small files with the help of SequenceFile format. The third part lists several alternative solutions issued from scientific world.

Small files problem

By small files we consider in this post files smaller than HDFS block size. Thus, why they are so problematic for HDFS ? There are some reasons of that. Since NameNode stores file system image in the memory, it has more data to store with small files rather than with big ones. It has also more work to do when dealing with edit logs and FSImages since more files are supposed to be written.

This point related to transaction logs brings another one - performance on startup. Edit logs and FSImage are big, more metadata is stored by NameNode, so naturally it takes more time to start.

A lot of small files brings also a communication overhead - NameNode handles information of more blocks, so it communicates more often with DataNodes.

Small files also influence bad MapReduce jobs. They make that the number of map tasks is bigger which is not necessarily a good thing for usually not significant input. In additional, small files will require more disk seeks.

Hadoop Archive

The first solution, maybe the more often quoted, is Hadoop Archive (HAR) file. It's an immutable layered file system built on top of HDFS. Each archive is suffixed with har extension. Inside it we can find:

metadata - _index and _masterindex files
the content of archived files - as part-* files

One of drawbacks of HAR is related to read performance. Files aren't read directly as usually but through 2 stored index. The first index, _index, contains directory structure. The second index, _masterindex, contains content locations.

The other drawback is the fact that HAR creates a copy of original file. Thus, it could need additional work for cleaning old files.

Let's create a simple archive to illustrate the talk:

hdfs dfs -mkdir /archivable_dir

hadoop fs -copyFromLocal ~/tested_file.txt /archivable_dir/1.txt
hadoop fs -copyFromLocal ~/tested_file.txt /archivable_dir/2.txt
hadoop fs -copyFromLocal ~/tested_file.txt /archivable_dir/3.txt

hadoop fs -ls  /archivable_dir
Found 3 items
-rw-r--r--   1 bartosz supergroup          7 2016-12-16 13:03 /archivable_dir/1.txt
-rw-r--r--   1 bartosz supergroup          7 2016-12-16 13:03 /archivable_dir/2.txt
-rw-r--r--   1 bartosz supergroup          7 2016-12-16 13:03 /archivable_dir/3.txt

hadoop fs -cat  /archivable_dir/1.txt
Test 1

The HAR is created with below command:

# 3_small_files.har - the name of HAR
# -p /archivable_dir parent directory of small files to archive
# . - archives all files in /archivable_dir
# small_files - the directory of created HAR
hadoop archive -archiveName 3_small_files.har -p /archivable_dir -r 1 . small_files

hadoop fs -ls  /user/bartosz/small_files/3_small_files.har
Found 4 items
-rw-r--r--   1 bartosz supergroup          0 2016-12-16 13:11 /user/bartosz/small_files/3_small_files.har/_SUCCESS
-rw-r--r--   5 bartosz supergroup        266 2016-12-16 13:11 /user/bartosz/small_files/3_small_files.har/_index
-rw-r--r--   5 bartosz supergroup         23 2016-12-16 13:11 /user/bartosz/small_files/3_small_files.har/_masterindex
-rw-r--r--   1 bartosz supergroup         21 2016-12-16 13:11 /user/bartosz/small_files/3_small_files.har/part-0

hadoop fs -cat  /user/bartosz/small_files/3_small_files.har/part-0
Test 1
Test 1
Test 1

hadoop fs -cat  /user/bartosz/small_files/3_small_files.har/_index
%2F dir 1481717019765+493+bartosz+supergroup 0 0 1.txt 2.txt 3.txt 
%2F1.txt file part-0 0 7 1481716999330+420+bartosz+supergroup 
%2F2.txt file part-0 7 7 1481717016200+420+bartosz+supergroup 
%2F3.txt file part-0 14 7 1481717019748+420+bartosz+supergroup

hadoop fs -cat  /user/bartosz/small_files/3_small_files.har/_masterindex
3 
0 1394155366 0 266

Sequence files

Another known solution for small files problem are sequence files. The idea is to use small file name as a key in sequence file and the content as the value. It could give something like in below schema:

# 3 small files, file1.txt, file2.txt, file3.txt represented as 
# sequence file
file1.txt -> "test 1"
file2.txt -> "test 2"
file3.txt -> "test3"

Sequence files could be used as the output for file consolidation task grouping all small files to one or several sequence files before saving them to HDFS. They can also be consolidated after the save with the help of MapReduce task.

Below you can find an example of merging the content of 3 small text files to 1 sequence file:

private FileSystem fileSystem = HdfsConfiguration.getFileSystem();

@Before
public void openFileSystem() throws IOException {
  fileSystem = HdfsConfiguration.getFileSystem();
  FileUtils.writeStringToFile(new File("./1.txt"), "Test1");
  FileUtils.writeStringToFile(new File("./2.txt"), "Test2");
  FileUtils.writeStringToFile(new File("./3.txt"), "Test3");
}

@After
public void closeFileSystem() throws IOException {
  fileSystem.close();
  FileUtils.deleteQuietly(new File("./1.txt"));
  FileUtils.deleteQuietly(new File("./2.txt"));
  FileUtils.deleteQuietly(new File("./3.txt"));
}

@Test
public void should_merge_3_small_text_files_to_one_sequence_file() throws IOException {
  Path filePath = new Path("sequence_file_example");
  SequenceFile.Writer writer = SequenceFile.createWriter(HdfsConfiguration.get(),
    SequenceFile.Writer.file(filePath), SequenceFile.Writer.keyClass(Text.class),
    SequenceFile.Writer.valueClass(Text.class));
  for (int i = 1; i <= 3; i++) {
    String fileName = i+".txt";
    writer.append(new Text(fileName), new Text(FileUtils.readFileToString(new File("./"+fileName))));
  }
  writer.close();

  SequenceFile.Reader reader = new SequenceFile.Reader(HdfsConfiguration.get(),
    SequenceFile.Reader.file(filePath));

  Text value = new Text();
  Text key = new Text();
  String[] keys = new String[3];
  String[] values = new String[3];
  int i = 0;
  while (reader.next(key, value)) {
    keys[i] = key.toString();
    values[i] = value.toString();
    i++;
  }
  reader.close();
  assertThat(keys).containsOnly("1.txt", "2.txt", "3.txt");
  assertThat(values).containsOnly("Test1", "Test2", "Test3");
}

Other methods

The problem of small files is so interesting that scientific world has worked on it. In the article Dealing with Small Files Problem in Hadoop Distributed File System the authors analyze and test some alternatives to well-known HAR and sequence file solutions:

Improved HDFS (IHDFS) - in this mechanism, the client is responsible for merging small files from the same directory into bigger file. Each big file contains an index with the length and offset of each contained small file.

In additional IHDFS introduces new cache management on DataNode's side reserved to small files. When a small file is accessed, firstly its content is searched in the cache. If it's present, it's returned to the client. Otherwise, the standard disk read is done to get it.
Extended HDFS (EHDFS) - it works similarly to IHDFS sinc it also combines small files into bigger ones. Each block of bigger file contains a table index lists small files contained in given block. Entry has a format of small file's offset and length as in below picture:
Thanks to that The NameNode stores only big files metadata. But apart of that, it stores also a separated ConstituentFileMap. This structure maps small file to its logical block number. NameNode also keeps the mapping between small files and table entries added at the beginning of file blocks.
New Hadoop Archive (NHAR) - unlike traditional HAR, its improved version allows new file addition to already created archive. It also tunes file access by creating an 1-level index instead of traditional 2-level one. This single index contains a hash table which is splitted among multiple smaller indexes (index_0, index_1,...,index_n). The hashes are derived from small file names and the number of indexed files.

Files storage is the same as in the traditionnal HAR. Small files are stored as members of part-n files.

Below picture shows NHAR logic:
CombineFileInputFormat - its purpose is more oriented to MapReduce processing because it merges several multiple small files into CombineFileInputFormat. Thanks to that map task has more data to process and thus, minimalize processing time.

Small files can reveal a real problem for HDFS. They can overload NameNode with data that can be handled in smarter way. The first method to handle small files consists on grouping them in Hadoop Archive (HAR). However, it can lead to read performance problems. The other solution was SequenceFiles with file names as keys and content as values. It also needs some additional consolidation work. The last part shows other alternatives for already covered solutions: extending either HDFS or HAR features. Some of them are based on similar concept of merging small files to a big one and handling the mapping on NameNode side.

Consulting

With nearly 16 years of experience, including 8 as data engineer, I offer expert consulting to design and optimize scalable data solutions. As an O’Reilly author, Data+AI Summit speaker, and blogger, I bring cutting-edge insights to modernize infrastructure, build robust pipelines, and drive data-driven decision-making. Let's transform your data challenges into opportunities—reach out to elevate your data engineering game today!

👉 contact@waitingforcode.com
🔗 past projects