MapReduce
Big Data -> It cannot be processed in a single machine -> parallize computation on multiple machines
👉 MapReduce
Map-Reduce Framework
High-level programming paradigm
- want to process lots of data, parallize across hundreds/thousands of nodes
- automatic parallelization & distribution
- want to handle failing nodes
- fault-tolerant, status and monitoring tools
- want to make it easy
- clean abstraction for programmers
- processes large data by
- applying a function to each logical record in the input 👉 Map
- categorize and combine the intermediate results into summary values 👉 Reduce
Why MapReduce is Used?
- Distributed computation with MapReduce, the focus shifts to writing the Map and Reduce functions to process and analyze data effectively
- Developers can focus on the problem, let library deal with messy details
Map-Reduce Steps
- Read data
- Map: extract some information of interest in (key, value) form
- Shuffle and sort - send same keys to the same reduce process
- Reduce: operate on the values of t he same key
- Output the results (key, final-results)
- Input is typically (key, value) pairs
Map Function
- Input: (key, value)
- Output: (output-key, intermediate-value) list
- apply a function to each key-value input pairs
Reduce Function
- Input: (output-key, list-of-intermediate-values-for-this-key)
- Output: (output-key, final-value) list
- Combines those intermediate values into one or more final values for that same output key
Parallelism of MapReduce
Parallelism of MapReduce
- Map functions run in parallel, creating different intermediate values from different input data sets
- Reduce functions also run in parallel, each working on a different output key
- All values are processed independently
- But, Reduce can't start until map is completely finished
More about MapReduce
Handling Fault Tolerance in MapReduce
- Worker failure
- The master sends heartbeat to each worker node
- If a worker node fails, the master reschedules the tasks
- Master failure
- The whole MapReduce job gets restarted through a different master
MapReduce
- Master program divides up tasks based on the location of data to exploit locality
- tries to have map tasks on the same machine as physical file data, or at least same rack
- Actual implementation is in C++, using a MapReduce library
- Bindings for Python and Java exist via interfaces
- MapReduce is good for off-line batch jobs on large data sets
- MapReduce is bad for jobs on small datasets and jobs that require low-latency response
Apache Hadoop examples
Apache Hadoop
- Apache Hadoop project develops open-source software for reliable, scalable, distributed, computing
- Apache Hadop software library is framework that allows for the distributed processing of large datasets across clusters of computers using simple programming model
- includes bsic modules:
- Hadoop Common: the common library and utilities
- Hadoop Distributed File System (HDFS)
- Hadoop YARN: a framework for job scheduling and resource management
- Hadoop MapReduce
Use Cases of Apache Hadoop
- Log data analysis: most common, fits perfectly for HDFS scenario
WordCount Example
class WordCountMapper
public class WordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while(tokenizer.hasMoreTokens()) {
Text outputKey = new Text(tokenizer.nextToken());
output.collect(outputKey, new IntWritable(1));
}
}
}
- One map task for each InutSplit generated by the InputFormat for the job is spawned
- InputFormat describes the input-specification for MapReduce job
- Map() method is called once for each key/value pair in the InputSplit
- OutputCollector collects data output by either the Mapper or the Reducer
class WordCountReducer
public class WordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
int sum = 0;
while(values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
- Reduce() method called once for each key
- Reduce has 3 primary phases: Shuffle -> Sort -> Reduce - call
reduce()
method
class WordCount
public class WordCount {
public static void main(String[] args) throws IOException {
// 1. configuration Mapper & Reducer of Hadoop
JobConf conf = new JobConf();
conf.setJobName("wordcount");
conf.setMapperClass(WordCountMapper.class);
conf.setReducerClass(WordCountReducer.class);
// 2. final output key type & value type
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
// 3. in/output format
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
// 4. set the path of file for read files
// input path : args[0]
// output path : args[1]
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
// 5. run job
JobClient.runJob(conf);
}
}
- JobConf is the primary interface for a user to descripbe a map-reduce job to the Hadoop framework for execution
- OutputFormat, InutFormat: describes the output/input-specification
- TextOutputFormat: writes plain text files
- TextInputFormat: plain text files, files are broken into lines, keys = position in the file, values = line of text
- setInputPaths(): sets the array of Paths as the list of inputs for the MapReduce job
- setOutputPaths(): sets the Path of the output directory for the MapReduce job
- JobClient: primary library for the user-job to interact with the cluster
- runJob: submits the job and returns only after the job has completed
Sparse Dot Product Example
- Sparse vectors are specified in a sparse format
- (offset, value) pairs are recorded only when value is non-zero
- perform dot product on two sparse vectors with same size
- Solve this problems with MapReduce is not efficiently -> just example
'CS > 클라우드컴퓨팅' 카테고리의 다른 글
19. MLOps (0) | 2023.12.10 |
---|---|
18. Cloud-Based AI (0) | 2023.12.09 |
16. DevOps (0) | 2023.12.08 |
15. Reliability and Availability (1) | 2023.12.07 |
14. Container (0) | 2023.12.07 |