【spark系列10】spark logicalPlan Statistics (逻辑计划阶段的统计信息)

背景

本文版本是spark 3.0.1

分析

逻辑阶段的统计信息，对于逻辑阶段的优化也是很重要的，比如broadcathashJoin,dynamic partitions pruning，本文分析一下spark 是怎么获取stastatics信息的
直接到LogicalPlanStats:

trait LogicalPlanStats { self: LogicalPlan =>/*** Returns the estimated statistics for the current logical plan node. Under the hood, this* method caches the return value, which is computed based on the configuration passed in the* first time. If the configuration changes, the cache can be invalidated by calling* [[invalidateStatsCache()]].*/def stats: Statistics = statsCache.getOrElse {if (conf.cboEnabled) {statsCache = Option(BasicStatsPlanVisitor.visit(self))} else {statsCache = Option(SizeInBytesOnlyStatsPlanVisitor.visit(self))}statsCache.get}/** A cache for the estimated statistics, such that it will only be computed once. */protected var statsCache: Option[Statistics] = None/** Invalidates the stats cache. See [[stats]] for more information. */final def invalidateStatsCache(): Unit = {statsCache = Nonechildren.foreach(_.invalidateStatsCache())}
}

该stats方法用来计算statistics,如果开启了cbo,则用BasicStatsPlanVisitor的visit，否则调用SizeInBytesOnlyStatsPlanVisitor的visit方法。我们可以看一下SizeInBytesOnlyStatsPlanVisitor.visit方法，因为BasicStatsPlanVisitor的很多方法都是调用SizeInBytesOnlyStatsPlanVisitor方法。而我们可以重点看一下default方法:

override def default(p: LogicalPlan): Statistics = p match {case p: LeafNode => p.computeStats()case _: LogicalPlan => Statistics(sizeInBytes = p.children.map(_.stats.sizeInBytes).product)}

因为统计信息都是一层一层从叶子节点往上传递的,当匹配到叶子节点的时候，则直接调用该computeStats方法,对于不同版本的dataSource是有区别的：

对于v1版本的，拿hiveTableRelation举例：

override def computeStats(): Statistics = {tableMeta.stats.map(_.toPlanStats(output, conf.cboEnabled || conf.planStatsEnabled)).orElse(tableStats).getOrElse {throw new IllegalStateException("table stats must be specified.")}}

直接从元数据中获取信息，如果开启了cbo或者planstats,则还会获取行信息和列的统计信息

对于v2版本的，拿DataSourceV2Relation举例：

 override def computeStats(): Statistics = {if (Utils.isTesting) {// when testing, throw an exception if this computeStats method is called because stats should// not be accessed before pushing the projection and filters to create a scan. otherwise, the// stats are not accurate because they are based on a full table scan of all columns.throw new IllegalStateException(s"BUG: computeStats called before pushdown on DSv2 relation: $name")} else {// when not testing, return stats because bad stats are better than failing a querytable.asReadable.newScanBuilder(options) match {case r: SupportsReportStatistics =>val statistics = r.estimateStatistics()DataSourceV2Relation.transformV2Stats(statistics, None, conf.defaultSizeInBytes)case _ =>Statistics(sizeInBytes = conf.defaultSizeInBytes)}}

直接调用table.newScanBuilder.如果继承了SupportsReportStatistics，则调用该estimateStatistics方法，这里涉及到的Table SupportsRead SupportsReportStatistics 都是spark 3引入的新类,我们直接看ParquetScan,默认是继承FileScan的estimateStatistics方法：

override def estimateStatistics(): Statistics = {new Statistics {override def sizeInBytes(): OptionalLong = {val compressionFactor = sparkSession.sessionState.conf.fileCompressionFactorval size = (compressionFactor * fileIndex.sizeInBytes).toLongOptionalLong.of(size)}override def numRows(): OptionalLong = OptionalLong.empty()}}

其实可以看出v2版本的没有列统计信息，至少目前是没有,而v1版本的部分是有列统计信息的, 毕竟统计每一列的信息是耗时的.

【spark系列10】spark logicalPlan Statistics (逻辑计划阶段的统计信息)相关推荐

【大数据Spark系列】Spark教程：详细全部
Spark作为Apache顶级的开源项目,是一个快速.通用的大规模数据处理引擎,和Hadoop的MapReduce计算框架类似,但是相对于MapReduce,Spark凭借其可伸缩.基于内存计算等特点 ...
oracle收集统计计划,oracle收集统计信息之analyze
oracle收集统计信息之analyze 1.analyze 收集表,索引的统计信息,现在oracle不推荐用analyze收集统计信息收集表的统计信息Analyze table tablename ...
Spark系列之Spark体系架构
title: Spark系列第四章 Spark体系架构 4.1 Spark核心功能 Alluxio 原来叫 tachyon 分布式内存文件系统 Spark Core提供Spark最基础的最核心的功能 ...
Spark系列之Spark在不同集群中的架构
title: Spark系列第十二章 Spark在不同集群中的架构 Spark 注重建立良好的生态系统,它不仅支持多种外部文件存储系统,提供了多种多样的集群运行模式.部署在单台机器上时,既可以用 ...
Spark系列之Spark的资源调优
title: Spark系列第十一章 Spark的资源调优 11.1 概述在开发完Spark作业之后,就该为作业配置合适的资源了.Spark的资源参数,基本都可以在sparksubmit命令中 ...
Spark系列之Spark启动与基础使用
title: Spark系列第三章 Spark启动与基础使用 3.1 Spark Shell 3.1.1 Spark Shell启动安装目录的bin目录下面,启动命令: spark-shell $ ...
Spark系列之Spark应用程序运行机制
声明: 文章中代码及相关语句为自己根据相应理解编写,文章中出现的相关图片为自己实践中的截图和相关技术对应的图片,若有相关异议,请联系删除.感谢.转载请注明出处,感谢. By luoye ...
Spark系列之Spark概述
title: Spark系列 What is Apache Spark™? Apache Spark™ is a multi-language engine for executing data en ...
大数据Spark系列之Spark单机环境搭建
1. 下载spark与scala Spark下载地址 http://mirrors.hust.edu.cn/apache/spark/spark-2.4.5/spark-2.4.5-bin-hadoo ...

【spark系列10】spark logicalPlan Statistics (逻辑计划阶段的统计信息)

背景

分析

【spark系列10】spark logicalPlan Statistics (逻辑计划阶段的统计信息)相关推荐

最新文章

热门文章