Practical introduction to Nutch MapReduce
Posted by Kelvin on 28 Sep 2005 at 04:04 pm | Tagged as: Lucene / Solr / Elasticsearch / Nutch, crawling, work
Some terminology first:
- Mapper
- Performs the map() function. The name will make sense when you look at it as "mapping" a function/operation to elements in a list.
- Reducer
- Performs the reduce() function. Merges multiple input values with the same key to produce a single output value.
- OutputFormat/InputFormat
- Classes which tell Nutch how to process input files and what output format the MapReduce job is to produce.
Currently implemented classes: MapFile (output only), Text and SequenceFile. What this basically means is that out-of-box, Nutch support SequenceFiles and text files as input formats for MapReduce jobs.
- Partitioner
- Splits up input into
n
different partitions, wheren
is the number of map tasks.- Combiner
- Javadoc says: Implements partial value reduction during mapping.. There is no special interface for a Combiner. Rather, it is a Reducer which is configured to be used during the mapping phase.
To complete a MapReduce job successfully, Nutch requires at least
- a directory containing input files (in the format determined by JobConf.setInputFormat, which defaults to newline-terminated text files)
- a mapper or reducer (even though technically the job will still complete even if none is provided, as the HelloWorld example shows)
- output format
To map or to reduce?
Rule of thumb: when performing operation on one input key/value, use Mapper; when requiring multiple input, or merging files, use Reducer.
Technically, the implementation of MapReduce in Nutch allows for the map function to be performed by only the Reducer, although this would seem to be a rather inappropriate use of a Reducer.
Comments Off on Practical introduction to Nutch MapReduce