coalesce

repartition

  1. 如果读取的是hdfs,那么有多少个block,就有多少个partition
spark读取dhfs的文件block步骤
1.拿出1-5号5个小文件(也就是5个partition) 分别给3个executor读取(spark调度会以vcore为单位,实际就是5个executor,10个task读10个partition)
2.如果3个executor执行速度相同,再拿6-10号文件 依次给这3个executor读取
3.但是实际执行速度不会完全相同,那就是哪个task先执行完,哪个task领取下一个partition读取执行,以此类推。
4所以这样往往读取文件的调度时间大于读取文件本身,而且会频繁打开关闭文件句柄,浪费较为宝贵的io资源,执行效率也大大降低。
  1. repartition的底层调用

repartition方法其实就是调用了coalesce方法,shuffle为true的情况(默认shuffle是fasle).
源码如下

  def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {coalesce(numPartitions, shuffle = true)}

coalesce

源码

/*** Return a new RDD that is reduced into `numPartitions` partitions.** This results in a narrow dependency, e.g. if you go from 1000 partitions* to 100 partitions, there will not be a shuffle, instead each of the 100* new partitions will claim 10 of the current partitions.** However, if you're doing a drastic coalesce, e.g. to numPartitions = 1,* this may result in your computation taking place on fewer nodes than* you like (e.g. one node in the case of numPartitions = 1). To avoid this,* you can pass shuffle = true. This will add a shuffle step, but means the* current upstream partitions will be executed in parallel (per whatever* the current partitioning is).** Note: With shuffle = true, you can actually coalesce to a larger number* of partitions. This is useful if you have a small number of partitions,* say 100, potentially with a few partitions being abnormally large. Calling* coalesce(1000, shuffle = true) will result in 1000 partitions with the* data distributed using a hash partitioner.*/def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = null): RDD[T] = withScope {if (shuffle) {/** Distributes elements evenly across output partitions, starting from a random partition. 注意,键的哈希代码就是键本身。HashPartitioner将用分区的总数对它进行修改。*/val distributePartition = (index: Int, items: Iterator[T]) => {var position = (new Random(index)).nextInt(numPartitions)items.map { t =>// Note that the hash code of the key will just be the key itself. The HashPartitioner// will mod it with the number of total partitions.position = position + 1(position, t)}} : Iterator[(Int, T)]// include a shuffle step so that our upstream tasks are still distributed 包含一个shuffle步骤,以便我们的上游任务仍然是分布式的。new CoalescedRDD(new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),new HashPartitioner(numPartitions)),numPartitions).values} else {new CoalescedRDD(this, numPartitions)}}
  1. coalesce默认不会产生shuffle
def coalesce(numPartitions: Int): DataFrame = withPlan {Repartition(numPartitions, shuffle = false, logicalPlan)}

连接百度

shuffle

  • coleasce
  • repartition

小狐瞎扯coalesce与repartition相关推荐

  1. Spark中 RDD之coalesce与repartition区别

    Spark中 RDD之coalesce与repartition区别 coalesce def coalesce(numPartitions: Int, shuffle: Boolean = false ...

  2. spark hint中Broadcast Hints、COALESCE and REPARTITION Hints

    spark默认的hint只有以下5种 COALESCE and REPARTITION Hints(两者区别比较) Spark SQL 2.4 added support for COALESCE a ...

  3. 关于广州小狐科技有限公司

    小狐科技始创于2004年,小狐科技集团目前下属网联科技和小狐科技,15年来专注于公司定位为 "互联网+全程解决方案提供商",即通过互联网技术和服务,帮助企业规划和构建基于互联网的在 ...

  4. Spark算子:RDD基本转换操作–coalesce、repartition

    1. coalesce def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = n ...

  5. 设置分区的三种方法coalesce、repartition、partitionBy

    coalesce[ˌkəʊəˈles]:改变 RDD 的分区数 /* * false:不产生 shuffle * true:产生 shuffle * 如果重分区的数量大于原来的分区数量,必须设置为 t ...

  6. 第四篇:coalesce 和repartition 在shuffle 和并行度之间的选择(spark2.3持续更新中...........)

    coalesce:不需要shuffle, 最大线程数,并行度受分区数的影响,如果合并成的分区数过少,可以采用repartition def coalesce(numPartitions: Int): ...

  7. java coalesce_【Spark Java API】Transformation(4)—coalesce、repartition

    coalesce 官方文档描述: Return a new RDD that is reduced into `numPartitions` partitions. 函数原型: def coalesc ...

  8. 飞天小狐(GMS2)ARPG学习笔记 2

    人物状态切换 在之前的新手教程中,我们是添加一个Animation End的脚本,在里面更改人物的标记变量,然后再在Step里更改人物状态,但是判断条件比较复杂,这是代码的优化问题,所以在这次学习中就 ...

  9. 代达罗斯之殇-大数据领域小文件问题解决攻略

    : 点击上方蓝色字体,选择"设为星标" 回复"资源"获取更多惊喜 大数据技术与架构 点击右侧关注,大数据开发领域最强公众号! 大数据真好玩 点击右侧关注,大数据 ...

最新文章

  1. Piccure Plus 3.1中文汉化版,Piccure Plus 3.1破解版,模糊照片变清晰神器,让你不再害怕手抖了
  2. 用于精确导航和场景重建的 3D 配准方法(ICRA 2021)
  3. [置顶]       安全-用户身份验证
  4. html2canvas 截图div_浏览器端网页截图方案详解
  5. jquery 手型 鼠标穿过时_专业电竞鼠标有什么独到之处?看完核技瑞你就知道了
  6. mySQL建表及练习题(下)
  7. 必须Mark下,2019 年度中国质量协会质量技术优秀奖
  8. Convolutional Sequence to Sequence Learning笔记
  9. Android 系统(206)---Android 学习网站汇总
  10. ESP32开发 0.windows Vscode开发环境搭建,基于esp-idf-V4.2 | Cmake | Vscode插件
  11. CSS 二十年发展简史
  12. my eclipse 破解通用步骤
  13. Win1909+vs2019+Windows 10 WDK 2004(10.0.19041.1) + Windows 10 SDK 2004(10.0.19041.1)环境搭建
  14. 2017双11技术揭秘—双十一海量数据下EagleEye的使命和挑战
  15. WRSC无人帆船航行基本原理
  16. 首屏加载从11s到1s,详解前端性能优化
  17. linux修改sda3时间,Linux服务器磁盘占满问题解决(/dev/sda3 满了)
  18. 二手房买卖中“跳单”行为之探讨
  19. E420笔记本升级固态硬盘
  20. java 实例变量是类的成员变量吗_JAVA中成员变量,局部变量,实例变量,类变量,有什么关系,,?...

热门文章

  1. 期间成本法与销售成本法
  2. 锐捷睿易RAP100全新上市 WALL AP也有超高性能
  3. 题解 The Blocks Problem(UVa101)紫书P110vector的应用
  4. PDF 加密 - 在线 PDF 加密软件
  5. [Arduino] Arduino猪头笔记
  6. windows系统文件名不能包含哪些字符
  7. 扫码支付吃个煎饼,街边摊支付的背后也要有大数据运营
  8. 就这样,我走完了程序员的前五年,共勉!
  9. 深圳市十堰商友会在2019迎新联谊会上为帮扶老乡创业就业携手献策
  10. word文档损坏怎么恢复