Hadoop 源码详解之InputFormat

1.源码

InputFormat describes the input-specification for a Map-Reduce job.

The Map-Reduce framework relies on the InputFormat of the job to:
1.Validate the input-specification of the job.
2.Split-up the input file(s) into logical InputSplits, each of which is then assigned to an
individual Mapper.
3.Provide the RecordReader implementation to be used to glean input records from the logical InputSplit for processing by the Mapper.

The default behavior of file-based InputFormats, typically sub-classes of FileInputFormat, is to split the input into logical InputSplits based on the total size, in bytes, of the input files. However, the FileSystem blocksize of the input files is treated as an upper bound for input splits. A lower bound on the split size can be set via mapreduce.input.fileinputformat.split.minsize Clearly, logical splits based on input-size is insufficient for many applications since record boundaries are to respected. In such cases, the application has to also implement a RecordReader on whom lies the responsibility to respect record-boundaries and present a record-oriented view of the logical InputSplit to the individual task.

InputFormat 规定了 MapReduce 框架中map任务的输入规范。

Map-Reduce 框架上运行的任务在下面几点依赖于 InputFormat:

  • 1.验证作业的输入规范
  • 2.将(物理)输入文件分割成逻辑输入文件,每个逻辑文件相应的被分发到各自的Mapper上。
  • 3.为了被 Mapper 处理,提供 RecordReader 实现为了从逻辑输入块中收集输入记录

基于文件的InputFormats 的默认行为,典型的子类是FileInputFormat,用于分割输入文件到逻辑的InputSplits基于输入文件的总字节大小。然而,输入文件的文件系统块大小被认为是一个输出分割的上限。 下限能够通过设置mapreduce.input.fileinputformat.split.minsize去显式控制,基于输入大小的逻辑层面的文件分割对于大多数应用程序来说都是不够的 ,因为文件的很多记录边界需要被处理。 在这种情况下,应用程序可以实现一个 RecordReader ,它负责处理记录边界,并且展示源纪录视图逻辑的InputSplit到单独的作业。

2. 源码详解

public abstract class InputFormat<K, V> {
   
    ... 
} 

3. 方法详解

InputFormat有两个抽象方法,分别是:getSplitscreateRecordReader方法

3.1 getSplits
  • 方法释义

Logically split the set of input files for the job.
#将作业的整个输入文件逻辑分解

Each InputSplit is then assigned to an individual Mapper for processing
#每个输入切片被分发到各自的Mapper上处理。【如何分发?】

Note: The split is a logical split of the inputs and the input files are not physically
split into chunks.
For e.g. a split could be <input-file-path, start, offset> tuple. The InputFormat also
creates the RecordReader to read the InputSplit.
#注释:分割是输入的逻辑分割,而输入文件不会被物理分割成块。
例如,一个切片可以是<输入文件,起始位置,偏移量>这样的一个三月组。同时,InputFormat会创建一个RecordReader
去读取它(InputSplit)。【这就是下面所介绍的第二个抽象方法】

@param context job configuration. 
@return an array of InputSplits for the job. 
 
  public abstract  
    List<InputSplit> getSplits(JobContext context 
                               ) throws IOException, InterruptedException; 
3.1 createRecordReader
  • 方法释义

Create a record reader for a given split. The framework will call RecordReader.initialize(InputSplit, TaskAttemptContext) before the split is used.
#为每一个(逻辑)切片创建一个record reader(记录阅读器)。map-reduce框架将会在切片被使用之前调用
Record.initilize(InputSplit,TaskAttemptContext)方法。

  • 方法源码
@param split: the split to be read 
@param context: the information about the task 
@return: a new record reader 
@throws IOException 
@throws InterruptedException 
  public abstract  
    RecordReader<K,V> createRecordReader(InputSplit split, 
                                         TaskAttemptContext context 
                                        ) throws IOException,  
                                                 InterruptedException; 
 
} 

3. 总结

  • 01.job.setInputFormatClass(xxxx.class); 这里的xxx必须是继承自 InputFormat 的类。
  • 02.所有的输入格式类都继承自 InputFormat 类,这是一个抽象类。
  • 03.不同的 InputFormat 会各自实现不同的文件读取方式以及分片方式,每个输入分片会被单独的map task作为数据源
  • 04.Mappers 的输入是一个一个的输入分片,称InputSplit。

评论关闭
IT虾米网

微信公众号号:IT虾米 (左侧二维码扫一扫)欢迎添加!

MapReduce之TableMapper类