失败任务的诊断信息显示 Application application_1634012636701_0009 failed 2 times due to AM Container for appattempt_1634012636701_0009_000002 exited with exitCode: 0,去机器上运行 yarn logs -applicationId application_1634012636701_0009 查看日志,也没有发现报错信息,任务运行成功,也有正常的输出:
Setting up env variables Setting up job resources Copying debugging information Launching container Number of Maps = 10 Samples per Map = 10 Wrote input for Map #0 Wrote input for Map #1 Wrote input for Map #2 Wrote input for Map #3 Wrote input for Map #4 Wrote input for Map #5 Wrote input for Map #6 Wrote input for Map #7 Wrote input for Map #8 Wrote input for Map #9 Starting Job Job Finished in 39.979 seconds Estimated value of Pi is 3.20000000000000000000
... 2021-10-12 06:44:53,819 INFO mapreduce.JobSubmissionFiles: Permissions on staging directory /tmp/hadoop-yarn/staging/root/.staging are incorrect: rwxrwxrwx. Fixing permissions to correct value rwx------ 2021-10-12 06:44:54,867 INFO input.FileInputFormat: Total input files to process : 10 2021-10-12 06:44:55,166 INFO mapreduce.JobSubmitter: number of splits:10 2021-10-12 06:44:55,579 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1634012636701_0010 2021-10-12 06:44:55,706 INFO mapreduce.JobSubmitter: Executing with tokens: [Kind: YARN_AM_RM_TOKEN, Service: , Ident: (appAttemptId { application_id { id: 9 cluster_timestamp: 1634012636701 } attemptId: 1 } keyId: 2035624994)] 2021-10-12 06:44:56,240 INFO conf.Configuration: resource-types.xml not found 2021-10-12 06:44:56,241 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'. 2021-10-12 06:44:56,926 INFO impl.YarnClientImpl: Submitted application application_1634012636701_0010 2021-10-12 06:44:56,977 INFO mapreduce.Job: The url to track the job: http://resourcemanager:8088/proxy/application_1634012636701_0010/ 2021-10-12 06:44:56,977 INFO mapreduce.Job: Running job: job_1634012636701_0010 ...
所以,在第一次任务尝试中,Container 提交了一个新任务,任务 ID 为 job_1634012636701_0010,这也就是我们在 UI 上看到的第一个额外任务。在这个任务的日志中,我们才能看到实际的 Application Master 的启动:
2021-10-12 06:47:34,910 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Created MRAppMaster for application appattempt_1634012636701_0010_000001 2021-10-12 06:47:35,032 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster:
回到 application_1634012636701_0009 这个任务,因为这个任务只是运行了一句 hadoop jar 命令,并未实际启动 Application Master,RM 没有看到 AM 的启动,就会把这次任务尝试标记为失败。随后,第二次任务尝试开始,一个新的 Container 又运行了一遍 hadoop jar 命令,启动了一个新的任务 application_1634012636701_0011,随后退出。两次尝试均失败后,任务会被标记为失败。这就是 MapReduce 出现重复任务,并且原任务失败的原因。
由此我们可以看到,YARN REST API 类似于 RPC API,它的目的是启动 Application Master,而不是任务本身。因此,如果采用上面的方式提交 MapReduce 任务,都会出现这种现象。