Data Wrangling数据处理

  • `|`的用法
  • sed:stream editor
  • wc
  • sort
  • uniq
  • paste
  • awk
  • R
  • BC
  • gnuplot
  • tee
  • Wrangling binary data
  • Regular expressions
    • capture groups
  • Exercise
    • 1. Take this [short interactive regex tutorial](https://regexone.com/).
    • 2.Find the number of words
    • 3. Use `man sed`
    • 4. boot time
    • 5.
    • 6.

https://missing.csail.mit.edu/2020/data-wrangling/

|的用法

  • ssh myserver journalctl | grep sshd | less | 将ssh myserver journalctl 的输出传送为grep的输入,再将grep的输出传送给less。less 是一个 Unix 命令行工具,用于浏览和搜索文本文件。它可以让你在命令行界面中分页查看文本文件,并提供许多方便的功能,如搜索、跳转到指定行、调整字体大小等。你可以在 less 中使用方向键移动光标、按 q 退出、按 / 开始搜索等。
  • ssh myserver 'journalctl | grep sshd | grep "Disconnected from"' > ssh.log 为什么使用单引号’ ‘将journalctl | grep sshd | grep “Disconnected from”’ > ssh.log引起来?Do the filtering on the remote server, and then massage the data locally.

sed:stream editor

  In it, you basically give short commands for how to modify the file, rather than manipulate its contents directly (although you can do that too).There are tons of commands, but one of the most common ones is s: substitution.For example, we can write:

ssh myserver journalctl| grep sshd| grep "Disconnected from"| sed 's/.*Disconnected from //'

  The s command is written on the form: s/REGEX/SUBSTITUTION/, where REGEX is the regular expression you want to search for, and SUBSTITUTION is the text you want to substitute matching text with.
  sed’s regular expressions are somewhat weird(奇怪、过时), and will require you to put a \ before most of these to give them their special meaning. Or you can pass -E.比如:sed -E 's/(ab)//g'sed 's/\(ab\)//g' 去除所有的ab字符串。
  sed can do all sorts of other interesting things, like injecting text (with the i command), explicitly printing lines (with the p command), selecting lines by index, and lots of other things. Check man sed!

wc

  一个word count程序,用于统计字数。
wc -l表示统计行数。

sort

  用于排序,默认将每行的内容按照字典顺序排列。
sort -n表示按照数字顺序排序。
-k1,1 means “sort by only the first whitespace-separated column”. The ,n part says “sort until the n-th field, where the default is the end of the line.
-r表示反序排列

uniq

  对于一个已排序的行列表,uniq会使内容唯一化,删除重复内容。
uniq -c还会统计每种内容重复的次数。以重复次数 重复内容的格式输出。

paste

  paste -sd,lets you combine lines (-s) by a given single-character delimiter (-d; , in this case).

awk

| awk '$1 == 1 && $2 ~ /^c[^ ]*e$/ { print $2 }' First, what does {print $2} do? awk programs take the form of an optional pattern plus a block saying what to do if the pattern matches a given line. The default pattern (which we used above) matches all lines. Inside the block, $0 is set to the entire line’s contents, and $1 through $n are set to the n-th field of that line, when separated by the awk field separator (whitespace by default, change with -F). In this case, we’re saying that, for every line, print the contents of the second field!
  First, notice that we now have a pattern (the stuff that goes before {...}). The pattern says that the first field of the line should be equal to 1 (that’s the count from uniq -c), and that the second field should match the given regular expression. And the block just says to print the username.

awk 'BEGIN { rows = 0 } $1 == 1 && $2 ~ /^c[^ ]*e$/ { rows += $1 } END { print rows }BEGIN is a pattern that matches the start of the input (and END matches the end). Now, the per-line block just adds the count from the first field (although it’ll always be 1 in this case), and then we print it out at the end.

R

  another (weird) programming language
| awk '{print $1}' | R --no-echo -e 'x <- scan(file="stdin", quiet=TRUE); summary(x)'summary prints summary statistics for a vector, and we created a vector containing the input stream of numbers

BC

  全称"Berkeley Calculater"。
echo "1+2" | bc 计算1+2,输出3

gnuplot

  简单画图工具
| sort -nk1,1 | tail -n10 | gnuplot -p -e 'set boxwidth 0.5; plot "-" using 1:xtic(2) with boxes'

tee

  把输出保存到文件中,同时在屏幕上看到输出内容。格式:tee [选项] 文件名
-a 追加写入文件,而非覆盖写入
-i 忽略中断信号

Wrangling binary data

  For example, we can use ffmpeg to capture an image from our camera, convert it to grayscale, compress it, send it to a remote machine over SSH, decompress it there, make a copy, and then display it.ffmpeg -loglevel panic -i /dev/video0 -frames 1 -f image2 - | convert - -colorspace gray - | gzip | ssh mymachine 'gzip -d | tee copy.jpg | env DISPLAY=:0 feh -'

Regular expressions

  Regular expressions are usually (though not always) surrounded by /。例如/.*Disconnected from /就是一个 Regular expressions。正则表达式debug网站

  • .匹配除换行符以外的任意字符means “any single character” except newline
  • *匹配前面的字符0次或多次zero or more of the preceding match
  • +匹配前面的字符1次或多次one or more of the preceding match
  • ?表示匹配前面的字符0次或1次

  * and + are, by default, “greedy”. They will match as much text as they can.
   比如: Jan 17 03:13:00 thesquareplanet.com sshd[2631]: Disconnected from invalid user Disconnected from 46.97.239.16 port 55920 [preauth] 这条字符串经过 sed 's/.*Disconnected from //'后变为46.97.239.16 port 55920 [preauth]
  In some regular expression implementations, you can just suffix * or + with a ? to make them non-greedy, but sadly sed doesn’t support that. We could switch to perl’s command-line mode though, which does support that construct:perl -pe 's/.*?Disconnected from //'

  • \d任意数字
  • \w任意字母或数字
  • \s任意空白字符(包括空格、制表符、换行符等)

大写表示,如\D表示非数字

  • [abc] any one character of a, b, and c.[a-z]所有小写字母,[^0-9]所有非数字字符(包括换行号)
  • [^ ]是正则表达式中的一种特殊字符,表示非空格字符。在方括号[]中,^表示,所以[^ ]就是非空格字符的意思。
  • (RX1|RX2) either something that matches RX1 or RX2。|表示或的关系
  • ^ $行首、行尾。^a匹配行首的a,a$匹配行尾的a。(在vim中,^是跳转到行首第一个非空字符,0才是跳到行尾)
  • \b表示匹配一个单词的边界( matches the boundary between a word and a non-word character),在匹配整个单词时特别有用,比如\w+\b
  • {}表示匹配前面的字符特定的次数。b{4}表示bbbb,a{2,4}表示2到4个a,c{3,}表示3个或更多个c
  • ()capture groups
  • ()? non-capturing group

capture groups

  Any text matched by a regex surrounded by parentheses(括号) is stored in a numbered capture group. These are available in the substitution (and in some engines, even in the pattern itself!) as \1, \2, \3, etc.
| sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/' \2表示(.*)所匹配的内容。

Exercise

1. Take this short interactive regex tutorial.

2.Find the number of words

  Find the number of words(in /usr/share/dict/words) that contain at least three as and don’t have a 's ending. What are the three most common last two letters of those words? sed’s y command, or the tr program, may help you with case insensitivity. How many of those two-letter combinations are there? And for a challenge: which combinations do not occur?

  1. cat words | awk '$0 ~ /(.*?[aA]){3,}[A-Za-z]*$/{print $0 }' | sed -E 's/.*?(\w\w)/\1/' | sort | uniq -c | sort -n -k1,1 -r | wc -l输出两字母组合的种类,去掉末尾的|wc -l可以输出两字母组合按出现次数排列的列表。
  2. challenge:comm -23 <(echo {a..z}{a..z} | tr ' ' '\n' | sort) <(grep -i -E '(.*?[aA]){3,}[A-Za-z]*$' words | rev | cut -c1-2 | sort | uniq)

3. Use man sed

  To do in-place substitution it is quite tempting to do something like sed s/REGEX/SUBSTITUTION/ input.txt > input.txt. However this is a bad idea, why? Is this particular to sed? Use man sed to find out how to accomplish this.

  sed s/REGEX/SUBSTITUTION/ input.txt > input.txt原来的文件会被覆盖,导致数据丢失。
  使用 -i 参数,这样可以在文件本身上进行替换,如sed -i 's/REGEX/SUBSTITUTION/' input.txt
  使用 -i[SUFFIX]的形式, 其中SUFFIX是可选的并且表示备份文件的后缀,默认是没有后缀。 如:sed -i.bak 's/old/new/' input.txt会替换文件input.txt中的匹配到的的 “old” 为 “new”,并且将修改前的文件备份到一个名为input.txt.bak的文件.这样就可以在不丢失数据的情况下完成文件内替换操作。
  这种问题不仅限于使用sed命令,其他类似的命令,如awk,perl,grep 等也有类似的问题,需要注意不要把文件重定向到自身。

4. boot time

  Find your average, median, and max system boot time over the last ten boots. Use journalctl on Linux and log show on macOS, and look for log timestamps near the beginning and end of each boot. On Linux, they may look something like:
Logs begin at ...andsystemd[577]: Startup finished in ...
On macOS, look for:
=== system boot:andPrevious shutdown cause: 5

journalctl --list-boots | sed -E 's/^.{51}(.{8}).{20}(.{8}).{4}$/\1 \2/'|awk '{split($1, a, ":"); split($2, b, ":"); if (b[3]<a[3]) {b[2]--; b[3]+=60} if (b[2]<a[2]) {b[1]--; b[2]+=60} if (b[1]<a[1]) {b[1]+=24} print ((b[1]-a[1])*3600 + (b[2]-a[2])*60 + (b[3]-a[3]))}' | sort -nr|awk '{a[NR]=$0;s++}END{if (s%2==1) print a[int(s/2)+1]; else print (a[s/2]+a[s/2+1])/2}'
  In awk, NR is a predefined variable that stands for “number of records” and it keeps track of the number of input records that have been processed so far. It is incremented by one for each input record read, starting at 1 for the first record. For example, when awk reads the first line of an input file, NR is set to 1, when it reads the second line, NR is set to 2, and so on. In the above command, I’m using NR as an index to store the numbers in an array a and also using it to get the middle element of the array. For example, if the input is a list of numbers separated by new line, awk read the first number, NR will be 1, it stores the number in the first element of the array a[1], when it reads the second number, NR will be 2, it stores the number in the second element of the array a[2], and so on.

5.

Look for boot messages that are not shared between your past three reboots (see journalctl’s -b flag). Break this task down into multiple steps. First, find a way to get just the logs from the past three boots. There may be an applicable flag on the tool you use to extract the boot logs, or you can use sed ‘0,/STRING/d’ to remove all lines previous to one that matches STRING. Next, remove any parts of the line that always varies (like the timestamp). Then, de-duplicate the input lines and keep a count of each one (uniq is your friend). And finally, eliminate any line whose count is 3 (since it was shared among all the boots).

6.

Find an online data set. Fetch it using curl and extract out just two columns of numerical data. If you’re fetching HTML data, pup might be helpful. For JSON data, try jq. Find the min and max of one column in a single command, and the difference of the sum of each column in another.

Data Wrangling数据处理相关推荐

  1. Data Wrangling

    数据整理(Data Wrangling) 数据整理(Data Wrangling)可归纳为以下三步: - 数据收集(Gather) - 数据评估(Assess) - 数据清理(Clean) 数据收集( ...

  2. python选课系统_【精选】在Monash读Data Science,人人都拥有这样一份选课指南。

    点击上方"蓝字",关注最适合你的学习咨询 前言 1.课程难度因人而异,课程作业也可能每学期变动,所以大家结合个人实际情况参考借鉴. 2.本指南系列只描述了比较最主流的课,冷门课程资 ...

  3. ssis组件_用于SSIS的Melissa Data Quality免费组件

    ssis组件 In this article, we will talk briefly about data quality in SQL Server. Then, we will give a ...

  4. python解题时间_1小时还是30秒?Python给你的另一种数据处理选择

    原标题:1小时还是30秒?Python给你的另一种数据处理选择 引子 想象一下,你每周都要手动重复同一过程,比如从多个来源复制数据并粘贴到一个电子表格中,用于后续处理.这项任务可能每周都需要花费一两个 ...

  5. R语言统计入门课程推荐——生物科学中的数据分析Data Analysis for the Life Sciences

    Data Analysis for the Life Sciences是哈佛大学PH525x系列课程--生物医学中的数据分析(PH525x series - Biomedical Data Scien ...

  6. RxSwift之深入解析URLSession的数据请求和数据处理

    一.请求网络数据 ① 通过 rx.response 请求数据 如下所示,通过豆瓣提供的音乐频道列表接口获取数据,并将返回结果输出到控制台中: "https://www.douban.com/ ...

  7. 数据科学家数据分析师_使您的分析师和数据科学家在数据处理方面保持一致

    数据科学家数据分析师 According to a recent survey conducted by Dimensional Research, only 50 percent of data a ...

  8. data mining (foreign blogs)

    出处:http://blog.csdn.net/shuimuqingyi/article/details/8698607 国外数据挖掘方面的经典博客 总体感觉数据挖掘行业在国内尚没有收到足够重视,国内 ...

  9. data mining blog (foreign)

    国外数据挖掘方面的经典博客 总体感觉数据挖掘行业在国内尚没有收到足够重视,国内的相关博客的内容也不够丰富,下面列出了一些国外数据挖掘方面的经典博客.数据挖掘是一个有趣的以及具有足够学术价值和商业价值的 ...

最新文章

  1. 什么是三层架构?简单的介绍三层架构!
  2. 如何删除chrome地址栏里面曾经输错的地址
  3. OpenCASCADE:拓扑 API之圆角和倒角
  4. python pip下载安装教程_Python下的常用下载安装工具pip的安装方法
  5. Android布局控件之LinearLayout详解
  6. bottomTagFragment
  7. kindEditor文本编辑器
  8. 【习题 6-8 UVA - 806】Spatial Structures
  9. springboot10-springcloud-eureka 服务注册与发现,负载均衡客户端(ribbon,feign)调用
  10. 基于php046学校固定资产管理系统
  11. EPS清华三维软件操作与数据检查常见问题与解决问题
  12. easydarwin php,EasyDarwin返回401 Unauthorized解决方法
  13. MCU芯片通信接口设计
  14. ArcGIS制图中参考比例尺
  15. 大数据技术架构_独家解读!阿里首次披露自研飞天大数据平台技术架构
  16. 戴尔/外星人笔记本C盘空间占用疑难问题记录
  17. 2022下半年软考什么时候开始报名?
  18. Surface电池寿命延长
  19. 主存/内存/外存 区分
  20. SSE和WebSocket的用法和比较

热门文章

  1. Scrapy 学习记录
  2. 【速记】openwrt - 编译、刷固件(资料整理)
  3. 计算机产品使用环境分析,电脑周边产品市场环境分析报告.doc
  4. python球鞋怎么样_抢球鞋?预测股市走势?淘宝秒杀?Python表示要啥有啥
  5. 使用 Elasticsearch 时间点读取器获得随时间推移而保持一致的数据视图
  6. excel转置怎么操作_Excel技巧分享:这几个小技巧简单实用,让你效率更高
  7. Java获取当前时间年月日的方法
  8. 远程连接桌面出现内部错误
  9. 关于太阳能充电器折叠包的那些事,您知道多少?
  10. C语言无符号整型转换字符串,字符串转换无符号整型