hadoop Learning Notes IV: Introduction to MapReduce
What is MapReduce?
MapReduce is a distributed computing model that solves the problem of computing massive amounts of data. Derived from Google's MapReduce paper, Hadoop Mapreduce is a clone of Google MapReduce. MapReduce abstracts the entire parallel computing process into 2 functions: map() and reduce(). A simple MapReduce program only needs to specify map(), reduce(), input, and format, and the framework will do the rest. The data is transferred in the form of Key/Value pairs. Writing distributed parallel programs based on the MapReduce computing model is very simple, and the main coding effort of the programmer is to implement the Map and Reduce functions. Other complexities of parallel programming, such as distributed storage, job scheduling, load balancing, fault-tolerant processing, network communication, etc., are handled by the YARN framework.
map: A specified operation is performed on each element of a list of a number of independent elements, which can be highly parallelized.
reduce : Merge the elements of a list
distinction
1.Easy to program
2.Good scalability
3.High fault tolerance
4.Suitable for offline processing of massive data of petabytes or more
Processing
1.Splits the file into splits (slices) and splits the file into key/value pairs by row. This step is done automatically by the MapReduce framework, where the offset (i.e. the key value) includes the number of characters occupied by the carriage return (the number of characters varies between Windows/Linux environments)
2.The split key/value pair is given to the user-defined map method for processing to generate a new key/value pair.
3.After getting the key/value pairs output by the map method, the mapper will sort them by key value and execute the Combine procedure to compute the values with the same key value to get the final output of the Mapper.
4.The Reducer first sorts the data received from the mapper and then hands it over to the user-defined reduce method to be processed to obtain a new key/value pair that is used as the output of the MapReduce program.
The whole process is divided into 2 stages
Mapper phase (consisting of a certain number of Map Task)
1.Input data format parsing (InputFormat)
2.Input data processing (map())
3.Data grouping (Partitioner)
Reducer phase (consisting of a certain number of Reduce Task)
1.Remote copy of data (copy)
2.Data is sorted by key (sort)
3.Data processing (reduce())
4.Data output format (OutputFormat)
MapReduce Programming Template (Optimized)
First you need to create a MapReduce main class that inherits configured (the base class for the configuration file configuration) and implements Tool (the optimization tool class for submitting MapReduce tasks and executing MapReduce)
Writing the internal class Mapper
Writing the internal class Reducer
Create a Job and set the input and output
Setting up job tasks
Performing job tasks
Full code github address: https://github.com/PeaceBao/Project01