Group Aggregation

Batch Streaming

Like most data systems, Apache Flink supports aggregate functions; both built-in and user-defined. User-defined functions must be registered in a catalog before use.
与大多数数据系统一样,Apache Flink支持聚合函数;内置和用户定义。用户定义的函数必须在使用前在目录中注册。

An aggregate function computes a single result from multiple input rows. For example, there are aggregates to compute the COUNT, SUM, AVG (average), MAX (maximum) and MIN (minimum) over a set of rows.
聚合函数从多个输入行计算单个结果。例如,聚合计算一组行的COUNT、SUM、AVG(平均值)、MAX(最大值)和MIN(最小值)。

SELECT COUNT(*) FROM Orders

For streaming queries, it is important to understand that Flink runs continuous queries that never terminate. Instead, they update their result table according to the updates on its input tables. For the above query, Flink will output an updated count each time a new row is inserted into the Orders table.
对于流式查询,重要的是要了解Flink运行的是永不终止的持续查询。相反,它们根据输入表的更新更新结果表。对于上面的查询,每次在Orders表中插入新行时,Flink都会输出更新的计数。

Apache Flink supports the standard GROUP BY clause for aggregating data.
Apache Flink支持聚合数据的标准GROUP BY子句。

SELECT COUNT(*)
FROM Orders
GROUP BY order_id

For streaming queries, the required state for computing the query result might grow infinitely. State size depends on the number of groups and the number and type of aggregation functions. For example MIN/MAX are heavy on state size while COUNT is cheap. You can provide a query configuration with an appropriate state time-to-live (TTL) to prevent excessive state size. Note that this might affect the correctness of the query result. See query configuration for details.
对于流式查询,计算查询结果所需的状态可能会无限增长。状态大小取决于分组的数量以及聚合函数的数量和类型。例如,MIN/MAX对状态大小影响比较大,而COUNT影响较小。您可以为查询配置提供适当的状态生存时间(TTL) ,以防止状态大小过大。请注意,这可能会影响查询结果的正确性。有关详细信息,请参阅查询配置。

Apache Flink provides a set of performance tuning ways for Group Aggregation, see more Performance Tuning.
Apache Flink为分组聚合提供了一组性能调优方法,请参阅更多性能调优。

DISTINCT Aggregation

Distinct aggregates remove duplicate values before applying an aggregation function. The following example counts the number of distinct order_ids instead of the total number of rows in the Orders table.
Distinct聚合在应用聚合函数之前会删除重复的值。下面的示例统计不同order_id的数量,而不是Orders表中的总行数。

SELECT COUNT(DISTINCT order_id) FROM Orders

For streaming queries, the required state for computing the query result might grow infinitely. State size is mostly depends on the number of distinct rows and the time that a group is maintained, short lived group by windows are not a problem. You can provide a query configuration with an appropriate state time-to-live (TTL) to prevent excessive state size. Note that this might affect the correctness of the query result. See query configuration for details.
对于流式查询,计算查询结果所需的状态可能会无限增长。状态大小主要取决于不同行的数量和维护组的时间,短时间的逐窗口分组不是问题。您可以为查询配置提供适当的状态生存时间(TTL),以防止状态大小过大。请注意,这可能会影响查询结果的正确性。有关详细信息,请参阅查询配置。

GROUPING SETS

Grouping sets allow for more complex grouping operations than those describable by a standard GROUP BY. Rows are grouped separately by each specified grouping set and aggregates are computed for each group just as for simple GROUP BY clauses.
Grouping sets允许比标准GROUP BY描述的操作更复杂的分组操作。行按每个指定的分组集单独分组,并为每个组计算聚合,就像简单的GROUP BY子句一样。

SELECT supplier_id, rating, COUNT(*) AS total
FROM (VALUES('supplier1', 'product1', 4),('supplier1', 'product2', 3),('supplier2', 'product3', 3),('supplier2', 'product4', 4))
AS Products(supplier_id, product_id, rating)
GROUP BY GROUPING SETS ((supplier_id, rating), (supplier_id), ())

Results:

+-------------+--------+-------+
| supplier_id | rating | total |
+-------------+--------+-------+
|   supplier1 |      4 |     1 |
|   supplier1 | (NULL) |     2 |
|      (NULL) | (NULL) |     4 |
|   supplier1 |      3 |     1 |
|   supplier2 |      3 |     1 |
|   supplier2 | (NULL) |     2 |
|   supplier2 |      4 |     1 |
+-------------+--------+-------+

Each sublist of GROUPING SETS may specify zero or more columns or expressions and is interpreted the same way as though it was used directly in the GROUP BY clause. An empty grouping set means that all rows are aggregated down to a single group, which is output even if no input rows were present.
GROUPING SETS的每个子列表可以指定零个或多个列或表达式,其解释方式与GROUP BY子句中直接使用的相同。空的分组集意味着所有行都被聚合到一个组中,即使没有输入行也会输出该组。

References to the grouping columns or expressions are replaced by null values in result rows for grouping sets in which those columns do not appear.
对分组列或表达式的引用将被结果行中的空值替换,这些空值用于对不显示这些列的集合进行分组。

For streaming queries, the required state for computing the query result might grow infinitely. State size depends on number of group sets and type of aggregation functions. You can provide a query configuration with an appropriate state time-to-live (TTL) to prevent excessive state size. Note that this might affect the correctness of the query result. See query configuration for details.
对于流式查询,计算查询结果所需的状态可能会无限增长。状态大小取决于分组集的数量和聚合函数的类型。您可以为查询配置提供适当的状态生存时间(TTL),以防止状态大小过大。请注意,这可能会影响查询结果的正确性。有关详细信息,请参阅查询配置。

ROLLUP

ROLLUP is a shorthand notation for specifying a common type of grouping set. It represents the given list of expressions and all prefixes of the list, including the empty list.
ROLLUP是用于指定通用类型分组集的简写符号。它表示给定的表达式列表和列表的所有前置列,包括空列表。

For example, the following query is equivalent to the one above.
例如,下面的查询等同于上面的查询。

SELECT supplier_id, rating, COUNT(*)
FROM (VALUES('supplier1', 'product1', 4),('supplier1', 'product2', 3),('supplier2', 'product3', 3),('supplier2', 'product4', 4))
AS Products(supplier_id, product_id, rating)
GROUP BY ROLLUP (supplier_id, rating)

CUBE

CUBE is a shorthand notation for specifying a common type of grouping set. It represents the given list and all of its possible subsets - the power set.
CUBE是用于指定常用分组集类型的简写符号。它表示给定的列表及其所有可能的子集-幂集。

For example, the following two queries are equivalent.
例如,以下两个查询是等效的。

SELECT supplier_id, rating, product_id, COUNT(*)
FROM (VALUES('supplier1', 'product1', 4),('supplier1', 'product2', 3),('supplier2', 'product3', 3),('supplier2', 'product4', 4))
AS Products(supplier_id, product_id, rating)
GROUP BY CUBE (supplier_id, rating, product_id)SELECT supplier_id, rating, product_id, COUNT(*)
FROM (VALUES('supplier1', 'product1', 4),('supplier1', 'product2', 3),('supplier2', 'product3', 3),('supplier2', 'product4', 4))
AS Products(supplier_id, product_id, rating)
GROUP BY GROUPING SET (( supplier_id, product_id, rating ),( supplier_id, product_id         ),( supplier_id,             rating ),( supplier_id                     ),(              product_id, rating ),(              product_id         ),(                          rating ),(                                 )
)

HAVING

HAVING eliminates group rows that do not satisfy the condition. HAVING is different from WHERE: WHERE filters individual rows before the GROUP BY while HAVING filters group rows created by GROUP BY. Each column referenced in condition must unambiguously reference a grouping column unless it appears within an aggregate function.
HAVING消除不满足条件的组行。HAVING与WHERE不同:WHERE在GROUP BY之前过滤单个行,而HAVING则过滤GROUP B创建的组行。条件中引用的每个列必须明确引用分组列,除非它出现在聚合函数中。

SELECT SUM(amount)
FROM Orders
GROUP BY users
HAVING SUM(amount) > 50

The presence of HAVING turns a query into a grouped query even if there is no GROUP BY clause. It is the same as what happens when the query contains aggregate functions but no GROUP BY clause. The query considers all selected rows to form a single group, and the SELECT list and HAVING clause can only reference table columns from within aggregate functions. Such a query will emit a single row if the HAVING condition is true, zero rows if it is not true.
即使没有GROUP BY子句,HAVING的存在也会将查询转换为分组查询。这与查询包含聚合函数但不包含GROUP BY子句时的情况相同。查询将所有选定行视为一个组,SELECT列表和HAVING子句只能引用聚合函数中的表列。如果HAVING条件为真,这样的查询将发出一行,如果不为真,则发出零行。

Flink SQL:Queries(Group Aggregation)相关推荐

  1. Flink SQL:Queries(Joins)

    Joins Batch Streaming Flink SQL supports complex and flexible join operations over dynamic tables. T ...

  2. Flink SQL:Queries(Windowing TVF)

    Windowing table-valued functions (Windowing TVFs) Batch Streaming Windows are at the heart of proces ...

  3. Flink SQL:Queries(Hints)

    SQL Hints Batch Streaming SQL hints can be used with SQL statements to alter execution plans. This c ...

  4. Flink SQL Client的Rolling Aggregation实验解析

    基本概念 stddev 这个stddev是Strandard Deviation的缩写 下面来分析一个FLINK SQL 执行Rolling Aggregation的例子 如下: SELECT mea ...

  5. Demo:基于 Flink SQL 构建流式应用

    摘要:上周四在 Flink 中文社区钉钉群中直播分享了<Demo:基于 Flink SQL 构建流式应用>,直播内容偏向实战演示.这篇文章是对直播内容的一个总结,并且改善了部分内容,比如除 ...

  6. Flink SQL 在字节跳动的优化与实践

    简介:Flink 在字节的应用实战 整理 | Aven (Flink 社区志愿者) 摘要:本文由 Apache Flink Committer,字节跳动架构研发工程师李本超分享,以四个章节来介绍 Fl ...

  7. 最佳实践|如何写出简单高效的 Flink SQL?

    摘要:本文整理自阿里巴巴高级技术专家.Apache Flink PMC 贺小令,在Flink Forward Asia 2022 生产实践专场的分享.本篇内容主要分为三个部分: Flink SQL I ...

  8. flink sql 知其所以然(八):flink sql tumble window 的奇妙解析之路

    感谢您的小爱心(关注  +  点赞 + 再看),对博主的肯定,会督促博主持续的输出更多的优质实战内容!!! 1.序篇-本文结构 大数据羊说 用数据提升美好事物发生的概率~ 34篇原创内容 公众号 源码 ...

  9. 万字详述 Flink SQL 4 种时间窗口语义!(收藏)

    DML:窗口聚合 大家好我是老羊,由于窗口涉及到的知识内容比较多,所以博主先为大家说明介绍下面内容时的思路,大家跟着思路走.思路如下: ⭐ 先介绍 Flink SQL 支持的 4 种时间窗口 ⭐ 分别 ...

最新文章

  1. Charles的功能操作
  2. Go连接及操作MySQL
  3. [转]xargs详解
  4. 记录ionic 最小化应用时所遇的问题
  5. 鸿蒙引领着未来,华为智慧屏V65图赏:鸿蒙OS引领未来
  6. excel单元格内加空格_Excel基础知识,你懂多少?
  7. Atitti 文本分类  以及 垃圾邮件 判断原理 以及贝叶斯算法的应用解决方案
  8. 微服务调用组件Ribbon底层调用流程分析
  9. 如何使用Bootbox?
  10. 关于训练误差、测试误差、泛化误差
  11. PowerPoint 在播放时自动运行宏
  12. Virtual Friends HDU - 3172(并查集)
  13. SAP SD客户寄售案例教程1
  14. D. Flood Fill
  15. substr()函数的用法
  16. jnz和djnz_djnz指令的应用方法
  17. 大龄程序员求职四处碰壁,不知今后该怎么办!网友:老码农有咩用
  18. 帮您解决开发SPI4W常见问题
  19. CTP程序化交易入门系列之四:行情订阅常见问题解答
  20. iOS内购代码(苹果支付ApplePay)

热门文章

  1. java获取时间戳和随机数
  2. 7-32 验证“哥德巴赫猜想” (20 分)
  3. 谈软件工程各环节中的辅助工具
  4. 打造 NestJS 日志系统
  5. Spring事务管理四大特性
  6. C# 静态类与非静态类、静态成员的区别
  7. 华为音量键只能调通话_原来华为手机的音量键要这么用,别只当做调音量,不然要吃大亏...
  8. 定制 Jenkins 镜像说明
  9. Linux红帽8.2系统中密钥加装和日志管理
  10. dapper mysql 批量_Dapper批量更新