17. MapReduce

CS/클라우드컴퓨팅

17. MapReduce

호프 2023. 12. 8. 01:18

MapReduce

Big Data -> It cannot be processed in a single machine -> parallize computation on multiple machines

👉 MapReduce

Map-Reduce Framework

High-level programming paradigm

want to process lots of data, parallize across hundreds/thousands of nodes
- automatic parallelization & distribution
want to handle failing nodes
- fault-tolerant, status and monitoring tools
want to make it easy
- clean abstraction for programmers
processes large data by
- applying a function to each logical record in the input 👉 Map
- categorize and combine the intermediate results into summary values 👉 Reduce

Why MapReduce is Used?

Distributed computation with MapReduce, the focus shifts to writing the Map and Reduce functions to process and analyze data effectively
Developers can focus on the problem, let library deal with messy details

Map-Reduce Steps

Read data
Map: extract some information of interest in (key, value) form
Shuffle and sort - send same keys to the same reduce process
Reduce: operate on the values of t he same key
Output the results (key, final-results)

Input is typically (key, value) pairs

Map Function

Input: (key, value)
Output: (output-key, intermediate-value) list
apply a function to each key-value input pairs

Reduce Function

Input: (output-key, list-of-intermediate-values-for-this-key)
Output: (output-key, final-value) list
Combines those intermediate values into one or more final values for that same output key

Parallelism of MapReduce

Parallelism of MapReduce

Map functions run in parallel, creating different intermediate values from different input data sets
Reduce functions also run in parallel, each working on a different output key
All values are processed independently
But, Reduce can't start until map is completely finished

More about MapReduce

Handling Fault Tolerance in MapReduce

Worker failure
- The master sends heartbeat to each worker node
- If a worker node fails, the master reschedules the tasks
Master failure
- The whole MapReduce job gets restarted through a different master

MapReduce

Master program divides up tasks based on the location of data to exploit locality
- tries to have map tasks on the same machine as physical file data, or at least same rack
Actual implementation is in C++, using a MapReduce library
- Bindings for Python and Java exist via interfaces
MapReduce is good for off-line batch jobs on large data sets
MapReduce is bad for jobs on small datasets and jobs that require low-latency response

Apache Hadoop examples

Apache Hadoop

Apache Hadoop project develops open-source software for reliable, scalable, distributed, computing
Apache Hadop software library is framework that allows for the distributed processing of large datasets across clusters of computers using simple programming model
includes bsic modules:
- Hadoop Common: the common library and utilities
- Hadoop Distributed File System (HDFS)
- Hadoop YARN: a framework for job scheduling and resource management
- Hadoop MapReduce

Use Cases of Apache Hadoop

Log data analysis: most common, fits perfectly for HDFS scenario

WordCount Example

class WordCountMapper

public class WordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
  public void map(LongWritable key, Text value,
   OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
    String line = value.toString();
    StringTokenizer tokenizer = new StringTokenizer(line);
    while(tokenizer.hasMoreTokens()) {
    Text outputKey = new Text(tokenizer.nextToken());
    output.collect(outputKey, new IntWritable(1));
    }
  }
}

One map task for each InutSplit generated by the InputFormat for the job is spawned
- InputFormat describes the input-specification for MapReduce job
Map() method is called once for each key/value pair in the InputSplit
OutputCollector collects data output by either the Mapper or the Reducer

class WordCountReducer

public class WordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
  public void reduce(Text key, Iterator<IntWritable> values,
   OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
    int sum = 0;
    while(values.hasNext()) {
    sum += values.next().get();
    }
    output.collect(key, new IntWritable(sum));
  }
}

Reduce() method called once for each key
Reduce has 3 primary phases: Shuffle -> Sort -> Reduce - call reduce() method

class WordCount

public class WordCount {
  public static void main(String[] args) throws IOException {
    // 1. configuration Mapper & Reducer of Hadoop
    JobConf conf = new JobConf();
    conf.setJobName("wordcount");
    conf.setMapperClass(WordCountMapper.class);
    conf.setReducerClass(WordCountReducer.class);

    // 2. final output key type & value type
    conf.setOutputKeyClass(Text.class);
    conf.setOutputValueClass(IntWritable.class);

    // 3. in/output format
    conf.setInputFormat(TextInputFormat.class);
    conf.setOutputFormat(TextOutputFormat.class);

    // 4. set the path of file for read files
    // input path : args[0]
    // output path : args[1]
    FileInputFormat.setInputPaths(conf, new Path(args[0]));
    FileOutputFormat.setOutputPath(conf, new Path(args[1]));

    // 5. run job
    JobClient.runJob(conf);
  }
}

JobConf is the primary interface for a user to descripbe a map-reduce job to the Hadoop framework for execution
OutputFormat, InutFormat: describes the output/input-specification
TextOutputFormat: writes plain text files
TextInputFormat: plain text files, files are broken into lines, keys = position in the file, values = line of text
setInputPaths(): sets the array of Paths as the list of inputs for the MapReduce job
setOutputPaths(): sets the Path of the output directory for the MapReduce job
JobClient: primary library for the user-job to interact with the cluster
runJob: submits the job and returns only after the job has completed

Sparse Dot Product Example

Sparse vectors are specified in a sparse format
- (offset, value) pairs are recorded only when value is non-zero
perform dot product on two sparse vectors with same size
Solve this problems with MapReduce is not efficiently -> just example

저작자표시 (새창열림)