Files in HDFS are different from files from local file system. They're fault-tolerant, can be stored in different abstractions and are based on quite big blocks comparing to blocks in local file system.
Data Engineering Design Patterns
Looking for a book that defines and solves most common data engineering problems? I'm currently writing
one on that topic and the first chapters are already available in π
Early Release on the O'Reilly platform
I also help solve your data engineering problems π contact@waitingforcode.com π©
This article describes files in HDFS. The first part shows how files are treated by HDFS. The second part describes some of file formats we can met in HDFS-based applications. The last part shows s to manipulate files through Java API.
Storage
Files abstraction in HDFS is based on
By default, a block has 128MB (64MB in Hadoop 1) but the size can be overridden at the moment of file creation. The same rule applies to the number of file replicas. This parameter also can be defined specificaly for a file and it's called replication factor.
Why blocks are so large ?
It's for many reasons. First of all, block metadata is stored in NameNode and a lot of small blocks would add a lot of additional data to manage. Secondly, a big block size reduces seek time. "Hadoop: The Definitive Guide" gives an example of a disk with transfer rate of 100MB/s and seek time of 10ms. In such case, the transfer of 100mb is composed in only 1% of seek time.
But what happens if the file has smaller size than the block size ? Files are physically stored on underlying file system. This file system considers HDFS files as any other files and stores them as increments on its own block size. So even if a file is smaller than HDFS block size, the file won't take the whole block size physically.
However HDFS is not well suited to handle files smaller than blocks. It's because of memory and disk seeks. NameNode holds information about files locations in memory with objects having around 150 bytes. Thus, less files are created, less memory is used. The same logic applies to physical files. More small files are stored, more disk seeks must be done to read the data.
Every HDFS block is also accompanied by a meta file containing CRC32 checksum of this block. This metadata file is used to detect block as corrupted .
What is corrupted block ?
Corrupted block - a block is considered as corrupted when at least 1 of its replicas aren't correct. It's an indicator telling that given block has a great chance to become unavailable soon or later.
File types
One of solutions for too small files are sequence files represented by SequenceFile class in Java API. This kind of files can be thought as a map having file names as keys and file contents as values. Thanks to them a lot of small files can be collected and put into this single one SequenceFile. SequenceFile also offers a lot of advantages for MapReduce processing, especially splitability helping to parallelize work on file parts in map stage.
The second solution is another type of files - Hadoop Archives (HAR). The archive brings another file layer on the top of current HDFS's one. Thanks to it files are grouped together, reducing the total number of all files. However, the use of HAR can decrease performances because it requires the access to 2 files (index and data).
Another interesting file type stored in HDFS is columnar file format as Parquet or RCFile. They're especially useful when data processor needs to read only a subset of row information, for example 2 attributes from 10.
But globally, HDFS supports any file format. Only keep in mind before that HDFS is widely used in data processing context and not all file formats are adapted to that. A good HDFS file format should have a quick data access, be splittable (partials reads, parallelization) and, ideally, support compression and schema evolution.
Operations examples
Below test cases show some of basic operations on files:
@Test public void should_put_text_file_on_hdfs() throws IOException { Path fileToCreate = new Path("/hello_world.txt"); FSDataOutputStream helloWorldFile = fileSystem.create(fileToCreate); helloWorldFile.writeUTF("Hello world"); helloWorldFile.close(); assertThat(fileSystem.exists(fileToCreate)).isTrue(); } @Test public void should_fail_on_creating_file_with_block_smaller_than_configured_minimum() throws IOException { Path fileToCreate = new Path("/hello_world_custom.txt"); boolean overrideExistent = true; // in bytes int bufferSize = 4096; short replicationFactor = 1; // in bytes long blockSize = 10L; try { fileSystem.create(fileToCreate, overrideExistent, bufferSize, replicationFactor, blockSize); fail("File creation should fail when the block size is lower than specified minimum"); } catch (RemoteException re) { assertThat(re.getMessage()).contains("Specified block size is less than configured minimum value"); } } @Test public void should_create_file_with_custom_block_size_and_replication_factor() throws IOException { Path fileToCreate = new Path("/hello_world_custom.txt"); boolean overrideExistent = true; // in bytes int bufferSize = 4096; short replicationFactor = 2; // in bytes (twice minimum block size) long blockSize = 1048576L * 2L; FSDataOutputStream helloWorldFile = fileSystem.create(fileToCreate, overrideExistent, bufferSize, replicationFactor, blockSize); helloWorldFile.writeUTF("Hello world"); helloWorldFile.close(); assertThat(fileSystem.exists(fileToCreate)).isTrue(); FileStatus status = fileSystem.getFileStatus(fileToCreate); assertThat(status.getReplication()).isEqualTo(replicationFactor); assertThat(status.getBlockSize()).isEqualTo(blockSize); } @Test public void should_create_sequence_file() throws IOException { Path filePath = new Path("sequence_file_example"); Writer writer = SequenceFile.createWriter(HdfsConfiguration.get(), Writer.file(filePath), Writer.keyClass(IntWritable.class), Writer.valueClass(Text.class)); writer.append(new IntWritable(1), new Text("A")); writer.append(new IntWritable(2), new Text("B")); writer.append(new IntWritable(3), new Text("C")); writer.close(); SequenceFile.Reader reader = new SequenceFile.Reader(HdfsConfiguration.get(), SequenceFile.Reader.file(filePath)); Text value = new Text(); IntWritable key = new IntWritable(); int[] keys = new int[3]; String[] values = new String[3]; int i = 0; while (reader.next(key, value)) { keys[i] = key.get(); values[i] = value.toString(); i++; } reader.close(); assertThat(keys).containsOnly(1, 2, 3); assertThat(values).containsOnly("A", "B", "C"); }
This post introduces the main concepts about files by focusing on block storage. The first part explains how the file is stored on disk. The second part talks shortly about small files and explains why HDFS is not well suited for them. The last part shows some basic operations on files through JUnit tests.