
Azure Data Explorer (Kusto), is one of the most dedicated relational databases in the market. The whole system is running in SSD and memory, to offer fast and responsive data analysis. It could be a good option to serve as warm-path data storage.

Azure数据资源管理器( Kusto )是市场上最专用的关系数据库之一。 整个系统在SSD和内存中运行,以提供快速响应的数据分析。 用作热路径数据存储可能是一个不错的选择。

Due to various reasons, such as mal-function client, imperfect data pipeline, etc. data could be ingested into Kusto multiple times. It leads to data duplication issue. This problem could be even more severe if the ingested data is summary data, such as overall revenue for a group of stores, etc.

由于各种原因,例如故障的客户端,不完善的数据管道等,可能会将数据多次提取到Kusto中​​。 这导致数据重复问题。 如果摄取的数据是摘要数据(例如,一组商店的总收入等),则此问题甚至可能更加严重。

Data duplication could mess-up all the following data analysis, People may make the wrong decision based on that. Therefore, data cleaning/deduplication is necessary. Before that, we need to first confirm, whether the current Kusto table having a duplication issue.

数据重复可能会使以下所有数据分析混乱,人们可能会据此做出错误的决定。 因此,数据清理/重复数据删除是必要的。 在此之前,我们需要首先确认当前的Kusto表是否存在重复问题。

The confirmation step is the main focus of this article.


The main idea contains the following steps:


  1. connect to the Kusto cluster.连接到Kusto群集。
  2. query table schema.查询表架构。
  3. create unique identification per row每行创建唯一的标识
  4. count rows with the same identification计算具有相同标识的行
  5. Find any identification value with count > 1, mark as duplication.查找计数> 1的任何标识值,将其标记为重复。

连接到Kusto群集 (Connect to Kusto Cluster)

Python has packages to connect to Kusto: Azure Data Explorer Python SDK. Here, we use package: azure-kusto-data.

Python具有连接到Kusto的软件包: Azure数据资源管理器Python SDK 。 在这里,我们使用包:azure-kusto-data。

The following code snippet would allow us to create the KustoClient. It is used to query Kusto Cluster. Before we connect to Kusto, we need to create the AppId and register it with the Kusto Cluster.

以下代码段将使我们能够创建KustoClient。 它用于查询Kusto群集。 连接到Kusto之前,我们需要创建AppId并将其注册到Kusto群集。

查询表架构 (Query Table Schema)

getschema will return the table schema. KustoClient will return the table schema as our familiar Pandas DataFrame. It is easy for us to do further processing.

getschema将返回表模式。 KustoClient将表格式作为我们熟悉的Pandas DataFrame返回。 我们很容易进行进一步处理。

Unique Identification Per Row


Assume, the table is a summary table. There are no sub-set of columns that could uniquely identify the row. Therefore, we will use all the columns in the schema to create the identification. The identification would be the concatenation of all the column values.

假设该表是一个汇总表。 没有可唯一标识行的列子集。 因此,我们将使用架构中的所有列来创建identification 。 标识将是所有列值的串联。

Therefore, we will convert all the non-string data to a string, by using the tostring() operator. That is the purpose for schema.apply( axis = 1), where axis =1 will go over the table row by row.

因此,我们将使用tostring()运算符将所有非字符串数据转换为字符串。 这就是schema.apply(axis = 1)的目的,其中axis = 1将逐行遍历表。

At last, the strcat() from Kusto will concatenate all the columns based the operations defined with hashOp.


If for another table, we know a subset of columns could uniquely identify the row, such as a combination of user_id and order_id. In that case, we could use the second hashKusto case.

如果对于另一个表,我们知道列的子集可以唯一地标识该行,例如user_id和order_id的组合。 在这种情况下,我们可以使用第二个hashKusto情况。

相同的标识值计数并查找重复项 (Same Identification Value Count and Find Duplications)

Notice the hashKusto value we created above, is used as extensions in Kusto query. That will create an additional column, hash, in the KustoTable. We later use summarize to get the count for each identification hash.

请注意,我们在上面创建的hashKusto值在Kusto查询中用作扩展。 这将在KustoTable中创建另一个列hash 。 以后我们使用summary来获取每个标识哈希的计数。

At last, the duplicated records are the ones with recordsCount > 1.

最后,重复的记录是recordsCount> 1的记录。

带走: (Take away:)

By using Python, we establish a simple and straight forward way to verify and identify duplicated rows within a Kusto table. This would offer a solid ground for all the following data analysis.

通过使用Python,我们建立了一种简单直接的方法来验证和识别Kusto表中的重复行。 这将为以下所有数据分析提供坚实的基础。

