kusto使用

Azure Data Explorer (Kusto), is one of the most dedicated relational databases in the market. The whole system is running in SSD and memory, to offer fast and responsive data analysis. It could be a good option to serve as warm-path data storage.

Azure数据资源管理器( Kusto )是市场上最专用的关系数据库之一。整个系统在SSD和内存中运行，以提供快速响应的数据分析。用作热路径数据存储可能是一个不错的选择。

Due to various reasons, such as mal-function client, imperfect data pipeline, etc. data could be ingested into Kusto multiple times. It leads to data duplication issue. This problem could be even more severe if the ingested data is summary data, such as overall revenue for a group of stores, etc.

由于各种原因，例如故障的客户端，不完善的数据管道等，可能会将数据多次提取到Kusto中。这导致数据重复问题。如果摄取的数据是摘要数据(例如，一组商店的总收入等)，则此问题甚至可能更加严重。

Data duplication could mess-up all the following data analysis, People may make the wrong decision based on that. Therefore, data cleaning/deduplication is necessary. Before that, we need to first confirm, whether the current Kusto table having a duplication issue.

数据重复可能会使以下所有数据分析混乱，人们可能会据此做出错误的决定。因此，数据清理/重复数据删除是必要的。在此之前，我们需要首先确认当前的Kusto表是否存在重复问题。

The confirmation step is the main focus of this article.

确认步骤是本文的重点。

The main idea contains the following steps:

主要思想包含以下步骤：

connect to the Kusto cluster.连接到Kusto群集。
query table schema.查询表架构。
create unique identification per row每行创建唯一的标识
count rows with the same identification计算具有相同标识的行
Find any identification value with count > 1, mark as duplication.查找计数> 1的任何标识值，将其标记为重复。

连接到Kusto群集 (Connect to Kusto Cluster)

Python has packages to connect to Kusto: Azure Data Explorer Python SDK. Here, we use package: azure-kusto-data.

Python具有连接到Kusto的软件包： Azure数据资源管理器Python SDK 。在这里，我们使用包：azure-kusto-data。

The following code snippet would allow us to create the KustoClient. It is used to query Kusto Cluster. Before we connect to Kusto, we need to create the AppId and register it with the Kusto Cluster.

以下代码段将使我们能够创建KustoClient。它用于查询Kusto群集。连接到Kusto之前，我们需要创建AppId并将其注册到Kusto群集。

查询表架构 (Query Table Schema)

getschema will return the table schema. KustoClient will return the table schema as our familiar Pandas DataFrame. It is easy for us to do further processing.

getschema将返回表模式。 KustoClient将表格式作为我们熟悉的Pandas DataFrame返回。我们很容易进行进一步处理。

Unique Identification Per Row

每行唯一标识

Assume, the table is a summary table. There are no sub-set of columns that could uniquely identify the row. Therefore, we will use all the columns in the schema to create the identification. The identification would be the concatenation of all the column values.

假设该表是一个汇总表。没有可唯一标识行的列子集。因此，我们将使用架构中的所有列来创建identification 。标识将是所有列值的串联。

Therefore, we will convert all the non-string data to a string, by using the tostring() operator. That is the purpose for schema.apply( axis = 1), where axis =1 will go over the table row by row.

因此，我们将使用tostring()运算符将所有非字符串数据转换为字符串。这就是schema.apply(axis = 1)的目的，其中axis = 1将逐行遍历表。

At last, the strcat() from Kusto will concatenate all the columns based the operations defined with hashOp.

最后，来自Kusto的strcat()将基于hashOp定义的操作连接所有列。

If for another table, we know a subset of columns could uniquely identify the row, such as a combination of user_id and order_id. In that case, we could use the second hashKusto case.

如果对于另一个表，我们知道列的子集可以唯一地标识该行，例如user_id和order_id的组合。在这种情况下，我们可以使用第二个hashKusto情况。

相同的标识值计数并查找重复项 (Same Identification Value Count and Find Duplications)

Notice the hashKusto value we created above, is used as extensions in Kusto query. That will create an additional column, hash, in the KustoTable. We later use summarize to get the count for each identification hash.

请注意，我们在上面创建的hashKusto值在Kusto查询中用作扩展。这将在KustoTable中创建另一个列hash 。以后我们使用summary来获取每个标识哈希的计数。

At last, the duplicated records are the ones with recordsCount > 1.

最后，重复的记录是recordsCount> 1的记录。

带走： (Take away:)

By using Python, we establish a simple and straight forward way to verify and identify duplicated rows within a Kusto table. This would offer a solid ground for all the following data analysis.

通过使用Python，我们建立了一种简单直接的方法来验证和识别Kusto表中的重复行。这将为以下所有数据分析提供坚实的基础。

翻译自: https://towardsdatascience.com/use-data-brick-to-verify-azure-explore-kusto-data-duplication-issue-36abd238d582

kusto使用

查看全文

http://www.taodudu.cc/news/show-863652.html

使用GridSearchCV和RandomizedSearchCV进行超参数调整
rust面向对象_面向初学者的Rust操作员综合教程
深度学习术语_您应该意识到这些（通用）深度学习术语和术语
问题解决方案_问题
airflow使用_使用AirFlow，SAS Viya和Docker像Pro一样自动化ML模型
迁移学习 nlp_NLP的发展-第3部分-使用ULMFit进行迁移学习
情感分析朴素贝叶斯_朴素贝叶斯推文的情感分析
梯度下降优化方法'原理_优化梯度下降的新方法
DengAI —数据预处理
k 最近邻_k最近邻与维数的诅咒
使用Pytorch进行密集视频字幕
5g与edge ai_使用OpenVINO部署AI Edge应用
法庭上认可零和博弈的理论吗_从零开始的本征理论
极限学习机和支持向量机_极限学习机I
如何在不亏本的情况下构建道德数据科学系统？
ann人工神经网络_深度学习-人工神经网络（ANN）
唐宇迪机器学习课程数据集_最受欢迎的数据科学和机器学习课程-2020年8月
r中如何求变量的对数转换_对数转换以求阳性。
美团脱颖而出的经验_使数据科学项目脱颖而出的6种方法
aws rds同步_将数据从Python同步到AWS RDS
扫描二维码读取文档_使用深度学习读取和分类扫描的文档
电路分析导论_生存分析导论
强化学习-第3部分
范数在机器学习中的作用_设计在机器学习中的作用
贝叶斯深度神经网络_深度学习为何胜过贝叶斯神经网络
模型监控psi_PSI和CSI：前2个模型监控指标
flask渲染图像_用于图像推荐的Flask应用
pytorch贝叶斯网络_贝叶斯神经网络：2个在TensorFlow和Pytorch中完全连接
稀疏组套索_Python中的稀疏组套索
deepin中zz_如何解决R中的FizzBuzz问题

kusto使用_Python查找具有数据重复问题的Kusto表相关推荐

java list 重复数据_java 查找list中重复数据实例详解
java 查找list中重复数据实例详解需求: 查找一个List集合中所有重复的数据,重复的数据可能不止一堆,比如:aa, bb, aa, bb, cc , dd, aa这样的数据.如果有重复数据, ...
java 找出重复的数据_java 查找list中重复数据实例详解
java 查找list中重复数据实例详解需求: 查找一个list集合中所有重复的数据,重复的数据可能不止一堆,比如:aa, bb, aa, bb, cc , dd, aa这样的数据.如果有重复数据, ...
查找数据库中重复数据T-SQL
查找数据库中重复数据T-SQL ========第一篇========= 在一张表中某个字段下面有重复记录,有很多方法,但是有一个方法,是比较高效的,如下语句: select data_guid fr ...
access重复数据累计_在 Access 中查找并删除重复记录
如果您的 Access 数据库包含从多个源导入的数据,或者您继承了已经使用多年而且没有得到正确设置的数据库,那么该数据库中可能包含需要清除的重复记录. 要确定 Access 表中是否存在重复记录,可以 ...
excel删除重复数据保留一条_Excel怎么快速查找和删除重复数据
我们用excel表格记录了大量的数据,当要做数据整理时候发现很多重复数据,那么怎么筛选删除呢? ---------------------------------------------------- ...
python查找excel中内容_python excel表格数据-python 如何读取 excel 指定单元格内容
python 怎么从excel中读取数据 VLOOKUP是函数,给定一个查找的目标,它就能从指定的查找中查找返回想找到的值.它的基本语法为: VLOOKUP(查找目标,查找范围,返回值的列数,精确OR ...
Python基础_第5章_Python中的数据序列
Python基础_第5章_Python中的数据序列文章目录 Python基础_第5章_Python中的数据序列 Python中的数据序列一.字典--Python中的==查询==神器 1.为什么需要 ...
mysql三表查询数据重复_解决mybatis三表连接查询数据重复的问题
此问题的产生,主要是数据库的字段名一样导致三张表 DOCTOR JOB OBJECT 有问题的查询语句和查询结果是: SELECT d.*,j.*,o.* from (select d.*,rown ...
Oracle几种查找和删除重复记录的方法总结
转载自:http://www.csdn.net/article/1970-01-01/278287 平时工作中可能会遇到当试图对库表中的某一列或几列创建唯一索引时,系统提示 ORA-01452 :不能 ...

kusto使用_Python查找具有数据重复问题的Kusto表

连接到Kusto群集 (Connect to Kusto Cluster)

查询表架构 (Query Table Schema)

相同的标识值计数并查找重复项 (Same Identification Value Count and Find Duplications)

带走： (Take away:)

相关文章：

kusto使用_Python查找具有数据重复问题的Kusto表相关推荐

最新文章

热门文章