Spark String Decimal类型引起的问题
问题背景
从Spark 2 到 Spark3 这期间, Spark 对于 String 和 Decimal 类型的比较会自动转换为Double 类型。这样会导致转换后的Filter 无法进行 Data Filter Pushed. 社区相关Ticket:
[SPARK-17913][SQL] compare atomic and string type column may return confusing result
[SPARK-22469][SQL] Accuracy problem in comparison with string and numeric
SPARK-29274: Should not coerce decimal type to double type when it’s join column
Test Query
Query 1
withTable("t1") {sql("CREATE TABLE t1 USING PARQUET " +"SELECT cast(id + 0.1 as decimal(13,2)) as salary FROM range(0, 100)")sql("select * from t1 where salary = '12.1' ").collect()
}
Query这样会因为Filter 将Decimal类型转换成Double类型,而无法进行数据下推
== Physical Plan ==
*(1) Project [salary#276]
+- *(1) Filter (isnotnull(salary#276) AND (cast(salary#276 as double) = 12.1))+- *(1) ColumnarToRow+- FileScan parquet default.t1[salary#276] Batched: true, DataFilters: [isnotnull(salary#276), (cast(salary#276 as double) = 12.1)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/wakun/ws/ebay/spark3/spark-warehouse/org.apache.spark.sql.execution..., PartitionFilters: [], PushedFilters: [IsNotNull(salary)], ReadSchema: struct<salary:decimal(13,2)>, UsedIndexes: []
详细Plan转化过程
=== Applying Rule org.apache.spark.sql.catalyst.analysis.TypeCoercion$PromoteStrings ==='Project [*] 'Project [*]
!+- 'Filter (salary#277 = 12.1) +- Filter (cast(salary#277 as double) = cast(12.1 as double))+- SubqueryAlias spark_catalog.default.t1 +- SubqueryAlias spark_catalog.default.t1+- Relation default.t1[id#276L,salary#277] parquet +- Relation default.t1[id#276L,salary#277] parquet=== Applying Rule org.apache.spark.sql.catalyst.optimizer.ConstantFolding ===Project [id#276L, salary#277] Project [id#276L, salary#277]
!+- Filter (cast(salary#277 as double) = cast(12.1 as double)) +- Filter (cast(salary#277 as double) = 12.1)+- Relation default.t1[id#276L,salary#277] parquet +- Relation default.t1[id#276L,salary#277] parquet
Query 2
sql("select * from t1 where salary = cast('12.1' as decimal) ").collect()
Query 这种写法是错误的,这样是将 12.1 cast 成 decimal(10,0)
类型,结果也就是 12.00,所以数据结果错误
== Physical Plan ==
*(1) Project [salary#276]
+- *(1) Filter (isnotnull(salary#276) AND (salary#276 = 12.00))+- *(1) ColumnarToRow+- FileScan parquet default.t1[salary#276] Batched: true, DataFilters: [isnotnull(salary#276), (salary#276 = 12.00)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/wakun/ws/ebay/spark3/spark-warehouse/org.apache.spark.sql.execution..., PartitionFilters: [], PushedFilters: [IsNotNull(salary), EqualTo(salary,12.00)], ReadSchema: struct<salary:decimal(13,2)>, UsedIndexes: []
Query 3
sql("select * from t1 where salary = cast('12.1' as decimal(13,2)) ").collect()
Query 这样写才是对的
== Physical Plan ==
*(1) Project [salary#276]
+- *(1) Filter (isnotnull(salary#276) AND (salary#276 = 12.10))+- *(1) ColumnarToRow+- FileScan parquet default.t1[salary#276] Batched: true, DataFilters: [isnotnull(salary#276), (salary#276 = 12.10)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/wakun/ws/ebay/spark3/spark-warehouse/org.apache.spark.sql.execution..., PartitionFilters: [], PushedFilters: [IsNotNull(salary), EqualTo(salary,12.10)], ReadSchema: struct<salary:decimal(13,2)>, UsedIndexes: []
详细Plan转化过程
17:59:55.133 ERROR org.apache.spark.sql.internal.BaseSessionStateBuilder$$anon$1:
=== Applying Rule org.apache.spark.sql.catalyst.analysis.DecimalPrecision ==='Project [*] 'Project [*]
!+- 'Filter (salary#277 = 12.1) +- Filter (cast(salary#277 as decimal(13,2)) = cast(12.1 as decimal(13,2)))+- SubqueryAlias spark_catalog.default.t1 +- SubqueryAlias spark_catalog.default.t1+- Relation default.t1[id#276L,salary#277] parquet +- Relation default.t1[id#276L,salary#277] parquet17:59:55.150 ERROR org.apache.spark.sql.internal.BaseSessionStateBuilder$$anon$2:
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.ConstantFolding ===Project [id#276L, salary#277] Project [id#276L, salary#277]
!+- Filter (cast(salary#277 as decimal(13,2)) = cast(12.1 as decimal(13,2))) +- Filter (cast(salary#277 as decimal(13,2)) = 12.10)+- Relation default.t1[id#276L,salary#277] parquet +- Relation default.t1[id#276L,salary#277] parquet17:59:55.159 ERROR org.apache.spark.sql.internal.BaseSessionStateBuilder$$anon$2:
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.SimplifyCasts ===Project [id#276L, salary#277] Project [id#276L, salary#277]
!+- Filter (cast(salary#277 as decimal(13,2)) = 12.10) +- Filter (salary#277 = 12.10)+- Relation default.t1[id#276L,salary#277] parquet +- Relation default.t1[id#276L,salary#277] parquet
Spark String Decimal类型引起的问题相关推荐
- mysql decimal 上限_关于mysql的decimal类型的外键的一个特殊限制
一.问题描述 在oracle, postgresql正常运行的Hibenate/JPA应用程序,切换到mysql时却在插入数据时报错:"MySQLIntegrityConstraintVio ...
- python 科学计数法转字符_转换科学计数法的数值字符串为decimal类型的方法
转换科学计数法的数值字符串为decimal类型的方法 在操作数据库时,需要将字符串转换成decimal类型. 代码如下: select cast('0.12' as decimal(18,2)); s ...
- php读取decimal显示00000,对decimal 类型的数据进行获取调整
Decimal为SQL Server.MySql等数据库的一种数据类型,不属于浮点数类型,可以在定义时划定整数部分以及小数部分的位数. 好处:使用精确小数类型不仅能够保证数据计算更为精确,还可以节省储 ...
- Hive中的DECIMAL类型
(Decimal)小数点 Hive中的DECIMAL类型与Java的Big Decimal格式相同.它用于表示不变的任意精度.语法和示例如下: DECIMAL(precision, scale)dec ...
- mysql中decimal与float_MySQL中的float和decimal类型有什么区别
decimal 类型可以精确地表示非常大或非常精确的小数.大至 1028(正或负)以及有效位数多达 28 位的数字可以作为 decimal类型存储而不失其精确性.该类型对于必须避免舍入错误的应用程序( ...
- [C#学习笔记]C#中的decimal类型——《CLR via C#》
System.Decimal是非常特殊的类型.在CLR中,Decimal类型不是基元类型.这就意味着CLR没有知道如何处理Decimal的IL指令. 在文档中查看Decimal类型,可以看到它提供了一 ...
- c# webservice生成客户端及使用时碰到decimal类型时的特殊处理
1.生成: VS2005命令: 开始>运行>CMD 命令示例: C:\Program Files\Microsoft Visual Studio 8\SDK\v2.0\Bin>wsd ...
- C#中的Decimal类型
C#中的Decimal类型 这种类型又称财务类型,起源于有效数字问题. FLOAT 单精度,有效数字7位. 有效数字是整数部分和小数部分加起来一共多少位. 当使用科学计数法的,FLOAT型会出现很严重 ...
- mysql decimal型转化为float_5分钟搞懂MySQL数据类型之数值型DECIMAL类型
速成指南 5分钟搞懂MySQL数据类型 之数值型--DECIMAL类型 DECIMAL类型的语法:DECIMAL[(M[,D])] [UNSIGNED] [ZEROFILL].其中M指定的是数字的总位 ...
最新文章
- SSRS报表连接超时的问题
- MySql 修改外键 支持级联删除
- dNet命令行编译命令CSC使用详细图解
- 简述C语言的标准发展,简述C语言的发展历史
- java2d游戏代码_Java 2D游戏图形
- Final Cut Pro做拜年视频的basic lay out!
- Android6.0的SMS(短信)源码分析--短信接收
- Cisco IOS Unicast NAT 工作原理 [一]
- 深度学习可解释性!深度taylor分解
- js数组依据下标删除元素
- java和基岩怎么联机_JAVA和基岩版要同步了
- 刚刚,爱奇艺发布重磅开源项目!
- python速学_60分钟Python快速学习(给发哥一个交代)
- 消灭内存不能为read或written等错误
- Early stopping conditioned on metric `acc` which is not available. Available metrics are: loss,val_l
- 樊登读书搞定读后感_樊登读书法的读后感。
- 科普:ARM的授权方式
- 帅爆! 赛博朋克特效实现
- 分散染料对涤纶织物染色步骤
- ESP32-C3入门教程 WiFi篇③——WiFi SoftAP 模式开启AP热点
热门文章
- 01-AI矢量图和位图
- 第三代半导体将写入“十四五规划”,这些公司有涉及
- PDF如何转换成HTML?这个方法真的简单!
- 护士成绩用计算机改卷,解密!2020年卫生资格/护士人机对话考试如何评分?成绩如何核算?...
- 女人对男人最有效的情话
- 求生之路2正版中国服务器ip,教程:如何使用steam正版玩本站L4D2服务器?
- svn报:验证位置时发生错误(url,用户名密码无错)
- 微信 关闭安全代理服务器,微信这几个设置一定要关闭,注意上网安全
- postman接口参数化设置
- 数商云:击败HM,颠覆ZARA,千亿巨头Shein如何快速崛起?