Kelvin Tan - Solr/Elasticsearch Consultant - Exploring Hadoop SequenceFile

Hadoop's SequenceFile is at the heart of the Hadoop io package. Both MapFile (disk-backed Map) and ArrayFile (disk-backed Array) are built on top of SequenceFile.

So what exactly is SequenceFile? Its class javadoc tells us: Support for flat files of binary key/value pairs.– not very helpful.

Let's dig through the code and find out more:

supports key/value pair, where key and value are any arbitrary classes which implement org.apache.hadoop.io.Writable
contains 3 inner classes: Reader, Writer and Sorter
Sorts are performed as external merge-sorts
designed to be modified in batch, i.e. does NOT support appends or incremental updates. Modifications/appends involving creating a new SequenceFile, copying from the old->new, adding/changing values along the way
compresses values using java.util.zip.Deflater after version 3
from code comments: Inserts a globally unique 16-byte value every few entries, so that one can seek into the middle of a file and then synchronize with record starts and ends by scanning for this value

You might also be interested in a recent post on using Hadoop IPC/RPC for distributed applications.

Comments Off on Exploring Hadoop SequenceFile

Supermind Search Consulting Blog Solr - Elasticsearch - Big Data

Exploring Hadoop SequenceFile

Supermind Search Consulting Blog
Solr - Elasticsearch - Big Data