Kelvin Tan - Solr/Elasticsearch Consultant - Practical introduction to Nutch MapReduce

Some terminology first:

Mapper

Performs the map() function. The name will make sense when you look at it as "mapping" a function/operation to elements in a list.

Reducer

Performs the reduce() function. Merges multiple input values with the same key to produce a single output value.

OutputFormat/InputFormat

Classes which tell Nutch how to process input files and what output format the MapReduce job is to produce.

Currently implemented classes: MapFile (output only), Text and SequenceFile. What this basically means is that out-of-box, Nutch support SequenceFiles and text files as input formats for MapReduce jobs.

Partitioner

Splits up input into n different partitions, where n is the number of map tasks.

Combiner

Javadoc says: Implements partial value reduction during mapping.. There is no special interface for a Combiner. Rather, it is a Reducer which is configured to be used during the mapping phase.

To complete a MapReduce job successfully, Nutch requires at least

a directory containing input files (in the format determined by JobConf.setInputFormat, which defaults to newline-terminated text files)
a mapper or reducer (even though technically the job will still complete even if none is provided, as the HelloWorld example shows)
output format

To map or to reduce?
Rule of thumb: when performing operation on one input key/value, use Mapper; when requiring multiple input, or merging files, use Reducer.

Technically, the implementation of MapReduce in Nutch allows for the map function to be performed by only the Reducer, although this would seem to be a rather inappropriate use of a Reducer.

Comments Off on Practical introduction to Nutch MapReduce

Supermind Search Consulting Blog Solr - Elasticsearch - Big Data

Practical introduction to Nutch MapReduce

Supermind Search Consulting Blog
Solr - Elasticsearch - Big Data