Exploring Hadoop SequenceFile
Posted by Kelvin on 03 Jan 2007 at 12:14 am | Tagged as: Lucene / Solr / Elasticsearch / Nutch
Hadoop's SequenceFile is at the heart of the Hadoop io package. Both MapFile (disk-backed Map) and ArrayFile (disk-backed Array) are built on top of SequenceFile.
So what exactly is SequenceFile? Its class javadoc tells us: Support for flat files of binary key/value pairs.– not very helpful.
Let's dig through the code and find out more:
- supports key/value pair, where key and value are any arbitrary classes which implement org.apache.hadoop.io.Writable
- contains 3 inner classes: Reader, Writer and Sorter
- Sorts are performed as external merge-sorts
- designed to be modified in batch, i.e. does NOT support appends or incremental updates. Modifications/appends involving creating a new SequenceFile, copying from the old->new, adding/changing values along the way
- compresses values using java.util.zip.Deflater after version 3
- from code comments: Inserts a globally unique 16-byte value every few entries, so that one can seek into the middle of a file and then synchronize with record starts and ends by scanning for this value
You might also be interested in a recent post on using Hadoop IPC/RPC for distributed applications.
Comments Off on Exploring Hadoop SequenceFile