Spark Job Scheduling

最近由于项目需要在研究spark相关的内容，形成了一些技术性文档，发布这记录下，懒得翻译了。

There are some spaces the official documents didn't explain very clearly, especially on some details. Here are given some more explanations base on the practices I did and the source codes I read these days.

(The official document link is http://spark.apache.org/docs/latest/job-scheduling.html)

There are two different schedulers in current spark implementation, FIFO is the default setting and the initial way that spark implement.
Both FIFO and FAIR schedulers can support the basic functionality that multiple parallel jobs run simultaneously, the prerequisite is that they are submitted from separate threads. (i.e., in single thread, the jobs are executed in order)
In FIFO Scheduler, the jobs which are submitted earlier has higher priority and possibility than those later jobs. But it doesn't mean that the first job will execute first, it is also possible that later jobs run before the earlier ones if the resources of the whole cluster are not occupied. However, the FIFO scheduler will cause the worst case: if the first jobs are large, the later jobs maybe suffer significant delay.
The FAIR Scheduler is the way corresponding to Hadoop FAIR scheduler and enhancement for FIFO. In FIFO fashion, there is only one factor Priority will be considered in SchedulableQueue; While in FAIR fashion, more factors will be considered including minshare, runningtasks, weight (You can reference the code below if interest).Similarly, the jobs don't always run by following the rules by FairSchedulingAlgorithm strictly, while as a whole, the FAIR scheduler really alleviate largely the delay time for small jobs by adjusting the parameters which were delayed significantly in FIFO fashion in my observation through the concurrent JMeter tests。

private[spark] class FIFOSchedulingAlgorithm extends SchedulingAlgorithm {override def comparator(s1: Schedulable, s2: Schedulable): Boolean = {val priority1 = s1.priorityval priority2 = s2.priorityvar res = math.signum(priority1 - priority2)if (res == 0) {val stageId1 = s1.stageIdval stageId2 = s2.stageIdres = math.signum(stageId1 - stageId2)}if (res < 0) {true} else {false}}
}

private[spark] class FairSchedulingAlgorithm extends SchedulingAlgorithm {override def comparator(s1: Schedulable, s2: Schedulable): Boolean = {val minShare1 = s1.minShareval minShare2 = s2.minShareval runningTasks1 = s1.runningTasksval runningTasks2 = s2.runningTasksval s1Needy = runningTasks1 < minShare1val s2Needy = runningTasks2 < minShare2val minShareRatio1 = runningTasks1.toDouble / math.max(minShare1, 1.0).toDoubleval minShareRatio2 = runningTasks2.toDouble / math.max(minShare2, 1.0).toDoubleval taskToWeightRatio1 = runningTasks1.toDouble / s1.weight.toDoubleval taskToWeightRatio2 = runningTasks2.toDouble / s2.weight.toDoublevar compare: Int = 0if (s1Needy && !s2Needy) {return true} else if (!s1Needy && s2Needy) {return false} else if (s1Needy && s2Needy) {compare = minShareRatio1.compareTo(minShareRatio2)} else {compare = taskToWeightRatio1.compareTo(taskToWeightRatio2)}if (compare < 0) {true} else if (compare > 0) {false} else {s1.name < s2.name}}

　　5.The pools in FIFO and FAIR schedulers

转载于:https://www.cnblogs.com/taxuexunmei/p/4991250.html

Spark Job Scheduling相关推荐

spark性能优化 -- spark工作原理
从本篇文章开始,将开启 spark 学习和总结之旅,专门针对如何提高 spark 性能进行总结,力图总结出一些干货. 无论你是从事算法工程师,还是数据分析又或是其他与数据相关工作,利用 spark 进 ...
spark总结——转载
转载自: spark总结第一个Spark程序 /** * 功能:用spark实现的单词计数程序 * 环境:spark 1.6.1, scala 2.10.4 */ // 导入相关类库impor ...
sparkcore分区_Spark学习：Spark源码和调优简介 Spark Core (二）
本文基于 Spark 2.4.4 版本的源码,试图分析其 Core 模块的部分实现原理,其中如有错误,请指正.为了简化论述,将部分细节放到了源码中作为注释,因此正文中是主要内容. 第一部分内容见: S ...
Spark Streaming实践和优化
2019独角兽企业重金招聘Python工程师标准>>> Spark Streaming实践和优化博客分类: spark 在流式计算领域,Spark Streaming和Storm时 ...
Spark源码阅读02-Spark核心原理之作业执行原理
概述 Spark的作业调度主要是指基于RDD的一系列操作构成的一个作业,在Executor中执行的过程.其中,在Spark作业调度中最主要的是DAGScheduler和TaskScheduler两个调 ...
Spark源码分析 – DAGScheduler
DAGScheduler的架构其实非常简单, 1. eventQueue, 所有需要DAGScheduler处理的事情都需要往eventQueue中发送event 2. eventLoop Threa ...
Spark学习之路（十五）SparkCore的源码解读（一）启动脚本
讨论QQ:1586558083 目录一.启动脚本分析 1.1 start-all.sh 1.2 start-master.sh 1.3 spark-config.sh(1.2的第5步) 1.4 lo ...
Spark Streaming学习笔记
特点: Spark Streaming能够实现对实时数据流的流式处理,并具有很好的可扩展性.高吞吐量和容错性. Spark Streaming支持从多种数据源提取数据,如:Kafka.Flume.Tw ...
Spark详解（六）：Spark集群资源调度算法原理
1. 应用程序之间在Standalone模式下,Master提供里资源管理调度功能.在调度过程中,Master先启动等待列表中应用程序的Driver,这个Driver尽可能分散在集群的Worker节 ...

Spark Job Scheduling

Spark Job Scheduling相关推荐

最新文章

热门文章