问题背景

从Spark 2 到 Spark3 这期间, Spark 对于 String 和 Decimal 类型的比较会自动转换为Double 类型。这样会导致转换后的Filter 无法进行 Data Filter Pushed. 社区相关Ticket:

[SPARK-17913][SQL] compare atomic and string type column may return confusing result
[SPARK-22469][SQL] Accuracy problem in comparison with string and numeric
SPARK-29274: Should not coerce decimal type to double type when it’s join column

Test Query

Query 1

withTable("t1") {sql("CREATE TABLE t1 USING PARQUET " +"SELECT cast(id + 0.1 as decimal(13,2)) as salary FROM range(0, 100)")sql("select * from t1 where salary = '12.1' ").collect()
}

Query这样会因为Filter 将Decimal类型转换成Double类型,而无法进行数据下推

 == Physical Plan ==
*(1) Project [salary#276]
+- *(1) Filter (isnotnull(salary#276) AND (cast(salary#276 as double) = 12.1))+- *(1) ColumnarToRow+- FileScan parquet default.t1[salary#276] Batched: true, DataFilters: [isnotnull(salary#276), (cast(salary#276 as double) = 12.1)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/wakun/ws/ebay/spark3/spark-warehouse/org.apache.spark.sql.execution..., PartitionFilters: [], PushedFilters: [IsNotNull(salary)], ReadSchema: struct<salary:decimal(13,2)>, UsedIndexes: []

详细Plan转化过程

=== Applying Rule org.apache.spark.sql.catalyst.analysis.TypeCoercion$PromoteStrings ==='Project [*]                                               'Project [*]
!+- 'Filter (salary#277 = 12.1)                             +- Filter (cast(salary#277 as double) = cast(12.1 as double))+- SubqueryAlias spark_catalog.default.t1                  +- SubqueryAlias spark_catalog.default.t1+- Relation default.t1[id#276L,salary#277] parquet         +- Relation default.t1[id#276L,salary#277] parquet=== Applying Rule org.apache.spark.sql.catalyst.optimizer.ConstantFolding ===Project [id#276L, salary#277]                                   Project [id#276L, salary#277]
!+- Filter (cast(salary#277 as double) = cast(12.1 as double))   +- Filter (cast(salary#277 as double) = 12.1)+- Relation default.t1[id#276L,salary#277] parquet              +- Relation default.t1[id#276L,salary#277] parquet

Query 2

sql("select * from t1 where salary = cast('12.1' as decimal) ").collect()

Query 这种写法是错误的,这样是将 12.1 cast 成 decimal(10,0) 类型,结果也就是 12.00,所以数据结果错误

== Physical Plan ==
*(1) Project [salary#276]
+- *(1) Filter (isnotnull(salary#276) AND (salary#276 = 12.00))+- *(1) ColumnarToRow+- FileScan parquet default.t1[salary#276] Batched: true, DataFilters: [isnotnull(salary#276), (salary#276 = 12.00)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/wakun/ws/ebay/spark3/spark-warehouse/org.apache.spark.sql.execution..., PartitionFilters: [], PushedFilters: [IsNotNull(salary), EqualTo(salary,12.00)], ReadSchema: struct<salary:decimal(13,2)>, UsedIndexes: []

Query 3

sql("select * from t1 where salary = cast('12.1' as decimal(13,2)) ").collect()

Query 这样写才是对的

== Physical Plan ==
*(1) Project [salary#276]
+- *(1) Filter (isnotnull(salary#276) AND (salary#276 = 12.10))+- *(1) ColumnarToRow+- FileScan parquet default.t1[salary#276] Batched: true, DataFilters: [isnotnull(salary#276), (salary#276 = 12.10)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/wakun/ws/ebay/spark3/spark-warehouse/org.apache.spark.sql.execution..., PartitionFilters: [], PushedFilters: [IsNotNull(salary), EqualTo(salary,12.10)], ReadSchema: struct<salary:decimal(13,2)>, UsedIndexes: []

详细Plan转化过程

17:59:55.133 ERROR org.apache.spark.sql.internal.BaseSessionStateBuilder$$anon$1:
=== Applying Rule org.apache.spark.sql.catalyst.analysis.DecimalPrecision ==='Project [*]                                               'Project [*]
!+- 'Filter (salary#277 = 12.1)                             +- Filter (cast(salary#277 as decimal(13,2)) = cast(12.1 as decimal(13,2)))+- SubqueryAlias spark_catalog.default.t1                  +- SubqueryAlias spark_catalog.default.t1+- Relation default.t1[id#276L,salary#277] parquet         +- Relation default.t1[id#276L,salary#277] parquet17:59:55.150 ERROR org.apache.spark.sql.internal.BaseSessionStateBuilder$$anon$2:
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.ConstantFolding ===Project [id#276L, salary#277]                                                 Project [id#276L, salary#277]
!+- Filter (cast(salary#277 as decimal(13,2)) = cast(12.1 as decimal(13,2)))   +- Filter (cast(salary#277 as decimal(13,2)) = 12.10)+- Relation default.t1[id#276L,salary#277] parquet                            +- Relation default.t1[id#276L,salary#277] parquet17:59:55.159 ERROR org.apache.spark.sql.internal.BaseSessionStateBuilder$$anon$2:
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.SimplifyCasts ===Project [id#276L, salary#277]                           Project [id#276L, salary#277]
!+- Filter (cast(salary#277 as decimal(13,2)) = 12.10)   +- Filter (salary#277 = 12.10)+- Relation default.t1[id#276L,salary#277] parquet      +- Relation default.t1[id#276L,salary#277] parquet

Spark String Decimal类型引起的问题相关推荐

  1. mysql decimal 上限_关于mysql的decimal类型的外键的一个特殊限制

    一.问题描述 在oracle, postgresql正常运行的Hibenate/JPA应用程序,切换到mysql时却在插入数据时报错:"MySQLIntegrityConstraintVio ...

  2. python 科学计数法转字符_转换科学计数法的数值字符串为decimal类型的方法

    转换科学计数法的数值字符串为decimal类型的方法 在操作数据库时,需要将字符串转换成decimal类型. 代码如下: select cast('0.12' as decimal(18,2)); s ...

  3. php读取decimal显示00000,对decimal 类型的数据进行获取调整

    Decimal为SQL Server.MySql等数据库的一种数据类型,不属于浮点数类型,可以在定义时划定整数部分以及小数部分的位数. 好处:使用精确小数类型不仅能够保证数据计算更为精确,还可以节省储 ...

  4. Hive中的DECIMAL类型

    (Decimal)小数点 Hive中的DECIMAL类型与Java的Big Decimal格式相同.它用于表示不变的任意精度.语法和示例如下: DECIMAL(precision, scale)dec ...

  5. mysql中decimal与float_MySQL中的float和decimal类型有什么区别

    decimal 类型可以精确地表示非常大或非常精确的小数.大至 1028(正或负)以及有效位数多达 28 位的数字可以作为 decimal类型存储而不失其精确性.该类型对于必须避免舍入错误的应用程序( ...

  6. [C#学习笔记]C#中的decimal类型——《CLR via C#》

    System.Decimal是非常特殊的类型.在CLR中,Decimal类型不是基元类型.这就意味着CLR没有知道如何处理Decimal的IL指令. 在文档中查看Decimal类型,可以看到它提供了一 ...

  7. c# webservice生成客户端及使用时碰到decimal类型时的特殊处理

    1.生成: VS2005命令: 开始>运行>CMD 命令示例: C:\Program Files\Microsoft Visual Studio 8\SDK\v2.0\Bin>wsd ...

  8. C#中的Decimal类型

    C#中的Decimal类型 这种类型又称财务类型,起源于有效数字问题. FLOAT 单精度,有效数字7位. 有效数字是整数部分和小数部分加起来一共多少位. 当使用科学计数法的,FLOAT型会出现很严重 ...

  9. mysql decimal型转化为float_5分钟搞懂MySQL数据类型之数值型DECIMAL类型

    速成指南 5分钟搞懂MySQL数据类型 之数值型--DECIMAL类型 DECIMAL类型的语法:DECIMAL[(M[,D])] [UNSIGNED] [ZEROFILL].其中M指定的是数字的总位 ...

最新文章

  1. SSRS报表连接超时的问题
  2. MySql 修改外键 支持级联删除
  3. dNet命令行编译命令CSC使用详细图解
  4. 简述C语言的标准发展,简述C语言的发展历史
  5. java2d游戏代码_Java 2D游戏图形
  6. Final Cut Pro做拜年视频的basic lay out!
  7. Android6.0的SMS(短信)源码分析--短信接收
  8. Cisco IOS Unicast NAT 工作原理 [一]
  9. 深度学习可解释性!深度taylor分解
  10. js数组依据下标删除元素
  11. java和基岩怎么联机_JAVA和基岩版要同步了
  12. 刚刚,爱奇艺发布重磅开源项目!
  13. python速学_60分钟Python快速学习(给发哥一个交代)
  14. 消灭内存不能为read或written等错误
  15. Early stopping conditioned on metric `acc` which is not available. Available metrics are: loss,val_l
  16. 樊登读书搞定读后感_樊登读书法的读后感。
  17. 科普:ARM的授权方式
  18. 帅爆! 赛博朋克特效实现
  19. 分散染料对涤纶织物染色步骤
  20. ESP32-C3入门教程 WiFi篇③——WiFi SoftAP 模式开启AP热点

热门文章

  1. 01-AI矢量图和位图
  2. 第三代半导体将写入“十四五规划”,这些公司有涉及
  3. PDF如何转换成HTML?这个方法真的简单!
  4. 护士成绩用计算机改卷,解密!2020年卫生资格/护士人机对话考试如何评分?成绩如何核算?...
  5. 女人对男人最有效的情话
  6. 求生之路2正版中国服务器ip,教程:如何使用steam正版玩本站L4D2服务器?
  7. svn报:验证位置时发生错误(url,用户名密码无错)
  8. 微信 关闭安全代理服务器,微信这几个设置一定要关闭,注意上网安全
  9. postman接口参数化设置
  10. 数商云:击败HM,颠覆ZARA,千亿巨头Shein如何快速崛起?