CS/클라우드컴퓨팅

17. MapReduce

호프 2023. 12. 8. 01:18

MapReduce

Big Data -> It cannot be processed in a single machine -> parallize computation on multiple machines

👉 MapReduce

Map-Reduce Framework

High-level programming paradigm

  • want to process lots of data, parallize across hundreds/thousands of nodes
    • automatic parallelization & distribution
  • want to handle failing nodes
    • fault-tolerant, status and monitoring tools
  • want to make it easy
    • clean abstraction for programmers
  • processes large data by
    • applying a function to each logical record in the input 👉 Map
    • categorize and combine the intermediate results into summary values 👉 Reduce

Why MapReduce is Used?

  • Distributed computation with MapReduce, the focus shifts to writing the Map and Reduce functions to process and analyze data effectively
  • Developers can focus on the problem, let library deal with messy details

 

Map-Reduce Steps

  1. Read data
  2. Map: extract some information of interest in (key, value) form
  3. Shuffle and sort - send same keys to the same reduce process
  4. Reduce: operate on the values of t he same key
  5. Output the results (key, final-results)
  • Input is typically (key, value) pairs

 

Map Function

  • Input: (key, value)
  • Output: (output-key, intermediate-value) list
  • apply a function to each key-value input pairs

Reduce Function

  • Input: (output-key, list-of-intermediate-values-for-this-key)
  • Output: (output-key, final-value) list
  • Combines those intermediate values into one or more final values for that same output key

 

Parallelism of MapReduce

Parallelism of MapReduce

  • Map functions run in parallel, creating different intermediate values from different input data sets
  • Reduce functions also run in parallel, each working on a different output key
  • All values are processed independently
  • But, Reduce can't start until map is completely finished

 

More about MapReduce

Handling Fault Tolerance in MapReduce

  • Worker failure
    • The master sends heartbeat to each worker node
    • If a worker node fails, the master reschedules the tasks
  • Master failure
    • The whole MapReduce job gets restarted through a different master

MapReduce

  • Master program divides up tasks based on the location of data to exploit locality
    • tries to have map tasks on the same machine as physical file data, or at least same rack
  • Actual implementation is in C++, using a MapReduce library
    • Bindings for Python and Java exist via interfaces
  • MapReduce is good for off-line batch jobs on large data sets
  • MapReduce is bad for jobs on small datasets and jobs that require low-latency response

Apache Hadoop examples

Apache Hadoop

  • Apache Hadoop project develops open-source software for reliable, scalable, distributed, computing
  • Apache Hadop software library is framework that allows for the distributed processing of large datasets across clusters of computers using simple programming model
  • includes bsic modules:
    • Hadoop Common: the common library and utilities
    • Hadoop Distributed File System (HDFS)
    • Hadoop YARN: a framework for job scheduling and resource management
    • Hadoop MapReduce

Use Cases of Apache Hadoop

  • Log data analysis: most common, fits perfectly for HDFS scenario

 

WordCount Example

class WordCountMapper

public class WordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
  public void map(LongWritable key, Text value,
   OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
    String line = value.toString();
    StringTokenizer tokenizer = new StringTokenizer(line);
    while(tokenizer.hasMoreTokens()) {
    Text outputKey = new Text(tokenizer.nextToken());
    output.collect(outputKey, new IntWritable(1));
    }
  }
}
  • One map task for each InutSplit generated by the InputFormat for the job is spawned
    • InputFormat describes the input-specification for MapReduce job
  • Map() method is called once for each key/value pair in the InputSplit
  • OutputCollector collects data output by either the Mapper or the Reducer

 

class WordCountReducer

public class WordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
  public void reduce(Text key, Iterator<IntWritable> values,
   OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
    int sum = 0;
    while(values.hasNext()) {
    sum += values.next().get();
    }
    output.collect(key, new IntWritable(sum));
  }
}
  • Reduce() method called once for each key
  • Reduce has 3 primary phases: Shuffle -> Sort -> Reduce - call reduce() method

 

class WordCount

public class WordCount {
  public static void main(String[] args) throws IOException {
    // 1. configuration Mapper & Reducer of Hadoop
    JobConf conf = new JobConf();
    conf.setJobName("wordcount");
    conf.setMapperClass(WordCountMapper.class);
    conf.setReducerClass(WordCountReducer.class);

    // 2. final output key type & value type
    conf.setOutputKeyClass(Text.class);
    conf.setOutputValueClass(IntWritable.class);

    // 3. in/output format
    conf.setInputFormat(TextInputFormat.class);
    conf.setOutputFormat(TextOutputFormat.class);

    // 4. set the path of file for read files
    // input path : args[0]
    // output path : args[1]
    FileInputFormat.setInputPaths(conf, new Path(args[0]));
    FileOutputFormat.setOutputPath(conf, new Path(args[1]));

    // 5. run job
    JobClient.runJob(conf);
  }
}
  • JobConf is the primary interface for a user to descripbe a map-reduce job to the Hadoop framework for execution
  • OutputFormat, InutFormat: describes the output/input-specification
  • TextOutputFormat: writes plain text files
  • TextInputFormat: plain text files, files are broken into lines, keys = position in the file, values = line of text
  • setInputPaths(): sets the array of Paths as the list of inputs for the MapReduce job
  • setOutputPaths(): sets the Path of the output directory for the MapReduce job
  • JobClient: primary library for the user-job to interact with the cluster
  • runJob: submits the job and returns only after the job has completed

 

Sparse Dot Product Example

  • Sparse vectors are specified in a sparse format
    • (offset, value) pairs are recorded only when value is non-zero
  • perform dot product on two sparse vectors with same size
  • Solve this problems with MapReduce is not efficiently -> just example