perf使用实例详解
perf
架构图总览
Events
事件主要有哪些
hardware events:CPU performance monitoring counters
software events: 基于kernel counters的低水平事件,比如cpu迁移、minor faults、major faults等等
kernel tracepoint events:编码嵌入在内核中的内核级别静态测试点
User statically-defined tracing(USDT): 用户级别的静态测试点
Dynamic Traceing:动态软件测试点,可以在任何地方创建。内核态使用kprobes框架,用户态使用uprobes工具
Timed Profiling:perf-record -F Hz可以按照指定的频率进行监测,这个常被用于监测CPU使用率以及创建定时的中断事件
Hardware [Cache] Events: CPU相关计数器CPU周期、指令失效,内存间隔周期、L2CACHE miss等These instrument low-level processor activity based on CPU performance counters. For example, CPU cycles, instructions retired, memory stall cycles, level 2 cache misses, etc. Some will be listed as Hardware Cache Events.Software Events: 内核相关计数器These are low level events based on kernel counters. For example, CPU migrations, minor faults, major faults, etc.Tracepoint Events: 内核ftrace框架相关,例如系统调用,TCP事件,文件系统IO事件,块设备事件等。 根据LIBRARY归类。如sock表示socket事件。 This are kernel-level events based on the ftrace framework. These tracepoints are placed in interesting and logical locations of the kernel, so that higher-level behavior can be easily traced. For example, system calls, TCP events, file system I/O, disk I/O, etc. These are grouped into libraries of tracepoints; eg, "sock:" for socket events, "sched:" for CPU scheduler events.Dynamic Tracing: 动态跟踪,可以在代码中的任何位置创建事件跟踪节点。很好很强大。 内核跟踪使用kprobe,user-level跟踪使用uprobe。 Software can be dynamically instrumented, creating events in any location. For kernel software, this uses the kprobes framework. For user-level software, uprobes.Timed Profiling: 采样频度,按指定频率采样,被用于perf record。 Snapshots can be collected at an arbitrary frequency, using perf record -FHz. This is commonly used for CPU usage profiling, and works by creating custom timed interrupt events.
PMU:
Most processors nowadays have special, on‐chip hardware that monitors micro architectural events like elapsed cycles, cache hits, cache miss etc.It is a subsystem which helps in analyzing how an application or operating systems are performing on the processor.
The Performance Monitoring Events can be broadly categorized in two types
• Hardware
Ex: CPU‐Cycles, Instructions, Cache References
• Software
Ex: Page Fault, Context Switch, etc
page fault
Linux 内核给每个进程都提供了一个独立的虚拟地址空间,并且这个地址空间是连续的。这样,进程就可以很方便地访问内存,更确切地说是访问虚拟内存。虚拟地址空间的内部又被分为内核空间和用户空间两部分。并不是所有的虚拟内存都会分配物理内存,只有那些实际使用的虚拟内存才分配物理内存,并且分配后的物理内存,是通过内存映射来管理的。
内存映射,其实就是将虚拟内存地址映射到物理内存地址。为了完成内存映射,内核为每个进程都维护了一张页表,记录虚拟地址与物理地址的映射关系。页表实际上存储在 CPU 的内存管理单元 MMU 中。而当进程访问的虚拟地址在页表中查不到时,系统会产生一个缺页异常,进入内核空间分配物理内存、更新进程页表,最后再返回用户空间,恢复进程的运行,这是一个次缺页异常(minor page fault)。minor page fault 也称为 soft page fault, 指需要访问的内存不在虚拟地址空间,但是在物理内存中,只需要MMU建立物理内存和虚拟地址空间的映射关系即可。
major page fault指需要访问的内存不在虚拟地址空间,也不在物理内存中,进入内核空间分配物理内存,更新进程页表,还需要swap从磁盘中读取数据换入物理内存中。
当进程访问它的虚拟地址空间中的PAGE时,如果这个PAGE目前还不在物理内存中,此时CPU是不能干活的,Linux会产生一个hard page fault中断。系统需要从慢速设备(如磁盘)将对应的数据PAGE读入物理内存,并建立物理内存地址与虚拟地址空间PAGE的映射关系。然后进程才能访问这部分虚拟地址空间的内存。
page fault 又分为几种,major page fault、 minor page fault、 invalid(segment fault)。
major page fault 也称为 hard page fault, 指需要访问的内存不在虚拟地址空间,也不在物理内存中,需要从慢速设备载入。从swap 回到物理内存也是 hard page fault。
minor page fault 也称为 soft page fault, 指需要访问的内存不在虚拟地址空间,但是在物理内存中,只需要MMU建立物理内存和虚拟地址空间的映射关系即可。
- 当一个进程在调用 malloc 获取虚拟空间地址后,首次访问该地址会发生一次soft page fault。
- 通常是多个进程访问同一个共享内存中的数据,可能某些进程还没有建立起映射关系,所以访问时会出现soft page fault
invalid fault 也称为 segment fault,指进程需要访问的内存地址不在它的虚拟地址空间范围内,属于越界访问,内核会报 segment fault错误。
linux内核映像文件分类
zImage
zImage是ARM Linux常用的一种压缩映像文件,不超过512KB。
bzImage
big zImage,和zImage一样都是gzip压缩的。
uImage
u-boot专用的映像文件,它是在zImage上加上一个长度为0x40的“头部”,包含了这个映像文件的类型、加载位置、生成时间、大小等信息。如果直接从zImage的0x40位置开始加载,其和zImage就没有区别。
vmlinuz
可引导、压缩的内核。“vm”代表“virtual memory”。Linux支持虚拟内存。vmlinuz是可执行的linux内核。
vmlinux
未压缩的linux内核,vmlinuz是vmlinux的压缩文件。
initrd-xxx.img
initrd是initial ramdisk的缩写,initrd一般被用来临时的引导硬件到实际内核vmlinuz能够接管并继续引导的状态
perf详解
1. 简介
perf工具是基于linux内核提供的perf_event接口工作的。
2. 命令行
root@ubuntu:~# perf -husage: perf [--version] [--help] [OPTIONS] COMMAND [ARGS]The most commonly used perf commands are:annotate Read perf.data (created by perf record) and display annotated codearchive Create archive with object files with build-ids found in perf.data filebench General framework for benchmark suitesbuildid-cache Manage build-id cache.buildid-list List the buildids in a perf.data filec2c Shared Data C2C/HITM Analyzer.config Get and set variables in a configuration file.data Data file related processingdiff Read perf.data files and display the differential profileevlist List the event names in a perf.data fileftrace simple wrapper for kernel's ftrace functionalityinject Filter to augment the events stream with additional informationkallsyms Searches running kernel for symbolskmem Tool to trace/measure kernel memory propertieskvm Tool to trace/measure kvm guest oslist List all symbolic event typeslock Analyze lock eventsmem Profile memory accessesrecord Run a command and record its profile into perf.datareport Read perf.data (created by perf record) and display the profilesched Tool to trace/measure scheduler properties (latencies)script Read perf.data (created by perf record) and display trace outputstat Run a command and gather performance counter statisticstest Runs sanity tests.timechart Tool to visualize total system behavior during a workloadtop System profiling tool.version display the version of perf binaryprobe Define new dynamic tracepointstrace strace inspired toolSee 'perf help COMMAND' for more information on a specific command.
子功能表
annotate | perf annotate用于解析由perf record记录的数据文件perf.data并将代码注解显示。如果源代码开启了debug符号,则源码和汇编一起解析。如果源码未开启debug,则解析汇编代码 |
archive | 根据数据文件记录的build-id,将所有被采样到的elf文件打包。利用此压缩包,可以再任何机器上分析数据文件中记录的采样数据。 |
bench | perf中内置的benchmark。子系统:调度器和IPC机制、内存管理、NUMA调度、futex压力基准、epoll压力基准等 |
buildid-cache | 管理perf的buildid缓存,每个elf文件都有一个独一无二的buildid。buildid被perf用来关联性能数据与elf文件。 |
buildid-list | 列出perf.data文件中的buildid |
c2c | 用于调试cache to cache的false sharing问题,用于Shared Data C2C/HITM分析,可以追踪cacheline竞争问题 |
config | perf config用于读取和配置 .perfconfig配置文件 |
diff | 对比两个数据文件的差异。能够给出每个符号(函数)在热点分析上的具体差异。 |
evlist | 列出数据文件perf.data中所有性能事件 |
ftrace | 是内核ftrace功能的简化封装,可以跟踪指定进程的内核函数调用栈 |
inject | 该工具读取perf record工具记录的事件流,并将其定向到标准输出 |
kallsyms | 查找运行中的内核符号 |
kmem | 针对内核内存(slab)子系统进行追踪测量的工具 |
kvm | 用于测试kvm客户机的性能参数 |
list | 列出event事件 |
lock | 分析内核锁统计信息 |
mem | 测试内存存取性能数据 |
record | 运行一个命令,并将其数据保存到perf.data中。随后,可以使用perf report进行分析 |
report | 显示perf数据 |
sched | 分析调度器性能 |
script | 执行测试脚本 |
stat | perf stat能完整统计应用整个生命周期的信息 |
test | 用于sanity test |
timechart | 生成图标 |
top | 类似linux的top命令,查看整体性能 |
version | 查看版本信息 |
probe | 动态监测点 |
trace | 跟踪系统调用 |
2.1 annotate
annotate中文意思:
vi. 注释;给…作注释或评注
vt. 注释;作注解
perf annotate用于解析由perf record记录的数据文件perf.data并将代码注解显示。如果源代码开启了debug符号,则源码和汇编一起解析。如果源码未开启debug,则解析汇编代码。
用法
Usage: perf annotate [<options>]-C, --cpu <cpu> list of cpus to profile-d, --dsos <dso[,dso...]>only consider symbols in these dsos-D, --dump-raw-trace dump raw trace in ASCII-f, --force don't complain, do it-i, --input <file> input file name-k, --vmlinux <file> vmlinux pathname-l, --print-line print matching source lines (may be slow)-M, --disassembler-style <disassembler style>Specify disassembler style (e.g. -M intel for intel syntax)-m, --modules load module symbols - WARNING: use only with -k and LIVE kernel-n, --show-nr-samplesShow a column with the number of samples-P, --full-paths Don't shorten the displayed pathnames-q, --quiet do now show any message-s, --symbol <symbol>symbol to annotate-v, --verbose be more verbose (show symbol address, etc)--asm-raw Display raw encoding of assembly instructions (default)--group Show event group information together--group Show event group information together--gtk Use the GTK interface--ignore-vmlinux don't load vmlinux even if found--objdump <path> objdump binary to use for disassembly and annotations--percent-type <local-period>Set percent type local/global-period/hits--show-total-periodShow a column with the sum of periods--skip-missing Skip symbols that cannot be annotated--source Interleave source code with assembly code (default)--stdio Use the stdio interface--stdio-color <mode>'always' (default), 'never' or 'auto' only applicable to --stdio mode--stdio2 Use the stdio interface--symfs <directory>Look for files with symbols relative to this directory--tui Use the TUI interface
举例
实验perf annotate -i perf.data -C0
,其结果:
2.2 archive
根据数据文件记录的build-id,将所有被采样到的elf文件打包。利用此压缩包,可以再任何机器上分析数据文件中记录的采样数据。
该命令需要perf buildid-list --with-hits配合使用。
用法
perf archive [file]
举例
没搞清楚是怎么用的,总是报错
https://linux-perf-users.vger.kernel.narkive.com/gjAAds7D/perf-archive-is-not-a-perf-command
照网上上面这个例子,给cflags加上buildid和fno-xxx参数,还是不行
root@ubuntu:test# make
gcc -g -Wl,--build-id -fno-omit-frame-pointer -o t1 test.c
root@ubuntu:test# perf record -e cpu-clock ./t1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.017 MB perf.data (207 samples) ]
root@ubuntu:test# perf buildid-list -i perf.data
8b5069415e14c65b746661feb0b23246a1d44ea7 [kernel.kallsyms]
e22e5fae1bd7e9508834fdfce490ba5b12f6bcf6 /root/test/t1
cbbbd6f042b731b98a8df7ecc2de408198cf3506 [vdso]
root@ubuntu:test# perf archive perf.data
perf: 'archive' is not a perf-command. See 'perf --help'.
root@ubuntu:test# perf archive
perf: 'archive' is not a perf-command. See 'perf --help'.
root@ubuntu:test# perf buildid-list -i perf.data -H
8b5069415e14c65b746661feb0b23246a1d44ea7 /proc/kcore
e22e5fae1bd7e9508834fdfce490ba5b12f6bcf6 /root/test/t1
2.3 bench
除了调度器之外,很多时候人们都需要衡量自己的工作对系统性能的影响。benchmark 是衡量性能的标准方法,对于同一个目标,如果能够有一个大家都承认的 benchmark,将非常有助于”提高内核性能”这项工作
benchmark:基准测试
用法
root@ubuntu:~# perf bench -h
# benchmark:基准Usage: perf bench [<common options>] <collection> <benchmark> [<options>]-f, --format <default|simple>Specify the output formatting style-r, --repeat <n> Specify amount of times to repeat the run#使用方法# perf bench [<common options>] <subsystem> <suite> [<options>]# subsystem子系统包括有sched、mem、numa、futex、epoll以及all选项;
子系统 | 说明 |
---|---|
sched | 测试调度器和IPC机制 |
mem | 测试内存性能 |
numa | NUMA内存和调度 |
futex | futex压力测试 |
epoll | epoll压力测试 |
all | 所有benchmark子系统 |
举例:
2.3.1 sched
sched message 是从经典的测试程序 hackbench 移植而来,用来衡量调度器的性能,overhead 以及可扩展性。该 benchmark 启动 N 个 reader/sender 进程或线程对,通过 IPC(socket 或者 pipe) 进行并发的读写。一般人们将 N 不断加大来衡量调度器的可扩展性。Sched message 的用法及用途和 hackbench 一样
sched pipe 从 Ingo Molnar 的 pipe-test-1m.c 移植而来。当初 Ingo 的原始程序是为了测试不同的调度器的性能和公平性的。其工作原理很简单,两个进程互相通过 pipe 拼命地发 1000000 个整数,进程 A 发给 B,同时 B 发给 A。。。因为 A 和 B 互相依赖,因此假如调度器不公平,对 A 比 B 好,那么 A 和 B 整体所需要的时间就会更长。
本地虚拟机和树莓派4B数据对比
2.3.2 mem
这个是 perf bench 的作者 Hitoshi Mitake 自己写的一个执行 memcpy 的 benchmark。该测试衡量一个拷贝 1M 数据的 memcpy() 函数所花费的时间。我尚不明白该 benchmark 的使用场景。。。或许是一个例子,告诉人们如何利用 perf bench 框架开发更多的 benchmark 吧。
memcpy
用于评估简单的内存复制性能
memset
用于简单评估内存写性能
2.3.3 numa
NUMA(Non Uniform Memory Access)即非一致内存访问架构,市面上主要有X86_64(JASPER)和MIPS64(XLP)体系。
测试:
perf bench numa mem
NUMA架构介绍
1. SMP vs AMP
- SMP(Symmetric Multiprocessing), 即对称多处理器架构,是目前最常见的多处理器计算机架构。
- AMP(Asymmetric Multiprocessing), 即非对称多处理器架构,则是与SMP相对的概念。
那么两者之间的主要区别是什么呢? 总结下来有这么几点,
- SMP的多个处理器都是同构的,使用相同架构的CPU;而AMP的多个处理器则可能是异构的。
- SMP的多个处理器共享同一内存地址空间;而AMP的每个处理器则拥有自己独立的地址空间。
- SMP的多个处理器操通常共享一个操作系统的实例;而AMP的每个处理器可以有或者没有运行操作系统, 运行操作系统的CPU也是在运行多个独立的实例。
- SMP的多处理器之间可以通过共享内存来协同通信;而AMP则需要提供一种处理器间的通信机制。
现今主流的x86多处理器服务器都是SMP架构的, 而很多嵌入式系统则是AMP架构的
2. NUMA vs UMA
NUMA(Non-Uniform Memory Access) 非均匀内存访问架构是指多处理器系统中,内存的访问时间是依赖于处理器和内存之间的相对位置的。 这种设计里存在和处理器相对近的内存,通常被称作本地内存;还有和处理器相对远的内存, 通常被称为非本地内存。
UMA(Uniform Memory Access) 均匀内存访问架构则是与NUMA相反,所以处理器对共享内存的访问距离和时间是相同的。
由此可知,不论是NUMA还是UMA都是SMP架构的一种设计和实现上的选择。
阅读文档时,也常常能看到ccNUMA(Cache Coherent NUMA),即缓存一致性NUMA架构。 这种架构主要是在NUMA架构之上保证了多处理器之间的缓存一致性。降低了系统程序的编写难度。
x86多处理器发展历史上,早期的多核和多处理器系统都是UMA架构的。这种架构下, 多个CPU通过同一个北桥(North Bridge)芯片与内存链接。北桥芯片里集成了内存控制器(Memory Controller),
参考:https://houmin.cc/posts/b893097a/
2.3.4 futex
Futex 是Fast Userspace muTexes的缩写。
Futex按英文翻译过来就是快速用户空间互斥体。其设计思想其实 不难理解,在传统的Unix系统中,System V IPC(inter process communication),如 semaphores, msgqueues, sockets还有文件锁机制(flock())等进程间同步机制都是对一个内核对象操作来完成的,这个内核对象对要同步的进程都是可见的,其提供了共享 的状态信息和原子操作。当进程间要同步的时候必须要通过系统调用(如semop())在内核中完成。可是经研究发现,很多同步是无竞争的,即某个进程进入 互斥区,到再从某个互斥区出来这段时间,常常是没有进程也要进这个互斥区或者请求同一同步变量的。但是在这种情况下,这个进程也要陷入内核去看看有没有人 和它竞争,退出的时侯还要陷入内核去看看有没有进程等待在同一同步变量上。这些不必要的系统调用(或者说内核陷入)造成了大量的性能开销。为了解决这个问 题,Futex就应运而生,Futex是一种用户态和内核态混合的同步机制。首先,同步的进程间通过mmap共享一段内存,futex变量就位于这段共享 的内存中且操作是原子的,当进程尝试进入互斥区或者退出互斥区的时候,先去查看共享内存中的futex变量,如果没有竞争发生,则只修改futex,而不 用再执行系统调用了。当通过访问futex变量告诉进程有竞争发生,则还是得执行系统调用去完成相应的处理(wait 或者 wake up)。简单的说,futex就是通过在用户态的检查,(motivation)如果了解到没有竞争就不用陷入内核了,大大提高了low-contention时候的效率。 Linux从2.5.7开始支持Futex。
Futex是一种用户态和内核态混合机制,所以需要两个部分合作完成,linux上提供了sys_futex系统调用,对进程竞争情况下的同步处理提供支持。
所有的futex同步操作都应该从用户空间开始,首先创建一个futex同步变量,也就是位于共享内存的一个整型计数器。
当进程尝试持有锁或者要进入互斥区的时候,对futex执行"down"操作,即原子性的给futex同步变量减1。如果同步变量变为0,则没有竞争发生, 进程照常执行。
如果同步变量是个负数,则意味着有竞争发生,需要调用futex系统调用的futex_wait操作休眠当前进程。
当进程释放锁或 者要离开互斥区的时候,对futex进行"up"操作,即原子性的给futex同步变量加1。如果同步变量由0变成1,则没有竞争发生,进程照常执行。
如果加之前同步变量是负数,则意味着有竞争发生,需要调用futex系统调用的futex_wake操作唤醒一个或者多个等待进程。
hash | 评估哈希表性能 |
wake | Suite for evaluating wake calls |
wake-parallel | Suite for evaluating parallel wake calls |
requeue | Suite for evaluating requeue calls |
lock-pi | Suite for evaluating futex lock_pi calls |
2.3.5 epoll
select的改进者;
wait | Suite for evaluating concurrent epoll_wait calls |
ctl | Suite for evaluating multiple epoll_ctl calls |
2.4 buildid-cache
管理perf的buildid缓存,每个elf文件都有一个独一无二的buildid。buildid被perf用来关联性能数据与elf文件。
用法
[root@localhost jrg]# perf buildid-cache -hUsage: perf buildid-cache [<options>]-a, --add <file list>file(s) to add-f, --force don't complain, do it-k, --kcore <file> kcore file to add-l, --list list all cached files-M, --missing <file> to find missing build ids in the cache-p, --purge <file list>file(s) to remove (remove old caches too)-P, --purge-all purge all cached files-r, --remove <file list>file(s) to remove-u, --update <file list>file(s) to update-v, --verbose be more verbose--target-ns <n> target pid for namespace context
2.5 buildid-list
列出perf.data中的buildids。
用法
[root@localhost jrg]# perf buildid-list -hUsage: perf buildid-list [<options>]-f, --force don't complain, do it-H, --with-hits Show only DSOs with hits-i, --input <file> input file name-k, --kernel Show current kernel build id-v, --verbose be more verbose
举例
root@ubuntu:~# perf buildid-list -i perf.data
8b5069415e14c65b746661feb0b23246a1d44ea7 [kernel.kallsyms]
cbbbd6f042b731b98a8df7ecc2de408198cf3506 [vdso]
705933c4b146d0227e659a71b02f8fc187f20029 /usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.6400.3
何为build id?
是二进制文件的头部位和section内容计算出来的160位SHA-1算法值。
The build ID is a 160-bit SHA1 string computed over the elf header bits and section contents in the file. It is bundled in the elf file as an entry in the notes section.
+----------------+
| namesz | 32-bit, size of "name" field
+----------------+
| descsz | 32-bit, size of "desc" field
+----------------+
| type | 32-bit, vendor specific "type"
+----------------+
| name | "namesz" bytes, null-terminated string
+----------------+
| desc | "descsz" bytes, binary data
+----------------+
In GCC, you can enable build IDs with the -Wl,--build-id
which passes the --build-id
flag to the linker. You can then read it back by dumping the notes section of the resulting elf file with readelf -n
build id有何用?
- 当调试设备时,给出了一堆debug符号信息时,可以用于定位指定build id的二进制文件符号信息
- 用于区别二进制文件
build id举例:
root@ubuntu:test# make clean
root@ubuntu:test# make
gcc -g -Wl,--build-id -o t1 test.c
root@ubuntu:test# ls
Makefile perf.data t1 test.c
root@ubuntu:test# readelf -n t1 Displaying notes found in: .note.gnu.propertyOwner Data size DescriptionGNU 0x00000010 NT_GNU_PROPERTY_TYPE_0Properties: x86 feature: IBT, SHSTKDisplaying notes found in: .note.gnu.build-idOwner Data size DescriptionGNU 0x00000014 NT_GNU_BUILD_ID (unique build ID bitstring)Build ID: 104006ed448e657d6a4160f3718a72231df54b06Displaying notes found in: .note.ABI-tagOwner Data size DescriptionGNU 0x00000010 NT_GNU_ABI_TAG (ABI version tag)OS: Linux, ABI: 3.2.0
root@ubuntu:test# rm perf.data
root@ubuntu:test# perf record -e cpu-clock ./t1
now into main
now foo1 over
now foo2 over
now main over
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.018 MB perf.data (222 samples) ]
root@ubuntu:test# perf report
root@ubuntu:test# perf archive
perf: 'archive' is not a perf-command. See 'perf --help'.
root@ubuntu:test# perf buildid-list -i perf.data
8b5069415e14c65b746661feb0b23246a1d44ea7 [kernel.kallsyms]
104006ed448e657d6a4160f3718a72231df54b06 /root/test/t1
cbbbd6f042b731b98a8df7ecc2de408198cf3506 [vdso]
2.6 c2c
CPU缓存机制参考:https://www.cnblogs.com/jokerjason/p/10711022.html
perf c2c的使用主要参考博客:https://joemario.github.io/blog/2016/09/01/c2c-blog/
shared data c2c/HITM分析。
cache to cache:
缓存的命中率,是 CPU 性能的一个关键性能指标。我们知道,CPU 里面有好几级缓存(Cache),每一级缓存都比后面一级缓存访问速度快。最后一级缓存叫 LLC(Last Level Cache);LLC 的后面就是内存。
当 CPU 需要访问一块数据或者指令时,它会首先查看最靠近的一级缓存(L1);如果数据存在,那么就是缓存命中(Cache Hit),否则就是不命中(Cache Miss),需要继续查询下一级缓存。
c2c用来检测cache共享命中失败,一个处理器修改了某个cache line中的数据,另一个处理器访问该cache line数据时需要refresh该cache line, perf c2c命令就用来调试这里问题。
At a high level, “perf c2c” will show you:
* The cachelines where false sharing was detected.
* The readers and writers to those cachelines, and the offsets where those accesses occurred.
* The pid, tid, instruction addr, function name, binary object name for those readers and writers.
* The source file and line number for each reader and writer.
* The average load latency for the loads to those cachelines.
* Which numa nodes the samples a cacheline came from and which cpus were involved
HITM:
Notice the term “HITM”, which stands for a load that hit in a modified cacheline.这就是false sharing产生的地方。
Remote HITMs:
meaning across numa nodes, are the most expensive - especially when there are lots of readers and writers.
perf c2c report输出含义:
1 =================================================2 Trace Event Information3 =================================================4 Total records : 329219 << Total loads and stores sampled.5 Locked Load/Store Operations : 146546 Load Operations : 69679 << Total loads7 Loads - uncacheable : 08 Loads - IO : 09 Loads - Miss : 3972
10 Loads - no mapping : 0
11 Load Fill Buffer Hit : 11958
12 Load L1D hit : 17235 << loads that hit in the L1 cache.
13 Load L2D hit : 21
14 Load LLC hit : 14219 << loads that hit in the last level cache (LLC).
15 Load Local HITM : 3402 << loads that hit in a modified cache on the same numa node (local HITM).
16 Load Remote HITM : 12757 << loads that hit in a modified cache on a remote numa node (remote HITM).
17 Load Remote HIT : 5295
18 Load Local DRAM : 976 << loads that hit in the local node's main memory.
19 Load Remote DRAM : 3246 << loads that hit in a remote node's main memory.
20 Load MESI State Exclusive : 4222
21 Load MESI State Shared : 0
22 Load LLC Misses : 22274 << loads not found in any local node caches.
23 LLC Misses to Local DRAM : 4.4% << % hitting in local node's main memory.
24 LLC Misses to Remote DRAM : 14.6% << % hitting in a remote node's main memory.
25 LLC Misses to Remote cache (HIT) : 23.8% << % hitting in a clean cache in a remote node.
26 LLC Misses to Remote cache (HITM) : 57.3% << % hitting in remote modified cache. (most expensive - false sharing)
27 Store Operations : 259539 << store instruction sample count
28 Store - uncacheable : 0
29 Store - no mapping : 11
30 Store L1D Hit : 256696 << stores that got L1 cache when requested.
31 Store L1D Miss : 2832 << stores that couldn't get the L1 cache when requested (L1 miss).
32 No Page Map Rejects : 2376
33 Unable to parse data source : 1
shared data cache line table表格显示了fasle sharing发生的地方,它列表是按照拥有的HITMs数量从高到低排列显示的,
Shared Cache Line Distribution Pareto是最重要的表,它展示了每个冲突竞争的cacheline的详细信息。
举例
参考文章:https://www.bookstack.cn/read/perf-little-book/posts-check-cache-false-sharing.md
对比下数据:
- test.c
#include <omp.h>
#define N 100000000
#define THRAED_NUM 8
int values[N];
int main(void)
{int sum[THRAED_NUM];#pragma omp parallel forfor (int i = 0; i < THRAED_NUM; i++){//int local_sum;for (int j = 0; j < N; j++){//local_sum += values[j] >> i;sum[i] += values[j] >> i;}//sum[i] = local_sum;}return 0;
}
编译:
gcc -fopenmp -g test.c -o test
数据:
[root@localhost test]# gcc -fopenmp -g test.c -o test
[root@localhost test]# perf c2c record ./test
[ perf record: Woken up 139 times to write data ]
[ perf record: Captured and wrote 36.329 MB perf.data (424292 samples) ]
[root@localhost test]# perf c2c report -i perf.data --stdio=================================================Trace Event Information
=================================================Total records : 424292Locked Load/Store Operations : 129Load Operations : 182641Loads - uncacheable : 5Loads - IO : 0Loads - Miss : 2Loads - no mapping : 0Load Fill Buffer Hit : 153672Load L1D hit : 26947Load L2D hit : 278Load LLC hit : 1263Load Local HITM : 78Load Remote HITM : 456Load Remote HIT : 6Load Local DRAM : 10Load Remote DRAM : 464Load MESI State Exclusive : 468Load MESI State Shared : 6Load LLC Misses : 936LLC Misses to Local DRAM : 1.1%LLC Misses to Remote DRAM : 49.6%LLC Misses to Remote cache (HIT) : 0.6%LLC Misses to Remote cache (HITM) : 48.7%Store Operations : 241651Store - uncacheable : 0Store - no mapping : 32Store L1D Hit : 212012Store L1D Miss : 29607No Page Map Rejects : 5684Unable to parse data source : 0=================================================Global Shared Cache Line Event Information
=================================================Total Shared Cache Lines : 35Load HITs on shared lines : 180894Fill Buffer Hits on shared lines : 153200L1D hits on shared lines : 26135L2D hits on shared lines : 240LLC hits on shared lines : 859Locked Access on shared lines : 4Store HITs on shared lines : 219944Store L1D hits on shared lines : 194838Total Merged records : 220478
数据中显示,程序的第15行,有问题;
- test.c
#include <omp.h>
#define N 100000000
#define THRAED_NUM 8
int values[N];
int main(void)
{int sum[THRAED_NUM];#pragma omp parallel forfor (int i = 0; i < THRAED_NUM; i++){int local_sum;for (int j = 0; j < N; j++){local_sum += values[j] >> i;//sum[i] += values[j] >> i;}sum[i] = local_sum;}return 0;
}
数据:
[root@localhost test]# perf c2c record ./test
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 1.741 MB perf.data (20209 samples) ]=================================================Trace Event Information
=================================================Total records : 20052Locked Load/Store Operations : 28Load Operations : 9740Loads - uncacheable : 0Loads - IO : 0Loads - Miss : 0Loads - no mapping : 1Load Fill Buffer Hit : 102Load L1D hit : 9517Load L2D hit : 14Load LLC hit : 99Load Local HITM : 4Load Remote HITM : 3Load Remote HIT : 0Load Local DRAM : 2Load Remote DRAM : 5Load MESI State Exclusive : 7Load MESI State Shared : 0Load LLC Misses : 10LLC Misses to Local DRAM : 20.0%LLC Misses to Remote DRAM : 50.0%LLC Misses to Remote cache (HIT) : 0.0%LLC Misses to Remote cache (HITM) : 30.0%Store Operations : 10312Store - uncacheable : 0Store - no mapping : 26Store L1D Hit : 10107Store L1D Miss : 179No Page Map Rejects : 397Unable to parse data source : 0=================================================Global Shared Cache Line Event Information
=================================================Total Shared Cache Lines : 7Load HITs on shared lines : 8Fill Buffer Hits on shared lines : 1L1D hits on shared lines : 0L2D hits on shared lines : 0LLC hits on shared lines : 4Locked Access on shared lines : 0Store HITs on shared lines : 0Store L1D hits on shared lines : 0Total Merged records : 7=================================================c2c details
=================================================Events : cpu/mem-loads,ldlat=30/P: cpu/mem-stores/PCachelines sort on : Total HITMsCacheline data grouping : offset,iaddr=================================================
cache to cache这部分还是不理解,需要找计算机的专业书籍来看看;
2.7 config
perf config用于读取和配置 .perfconfig配置文件, name = value 格式;
用法
Usage: perf config [<file-option>] [options] [section.name[=value] ...]-l, --list show current config variables--system use system config file--user use user config file
2.8 data
perf data可以将perf数据文件格式转换为其他格式(目前只支持ctf格式);
用法
Usage: perf data [<common options>] <command> [<options>]
举例
root@ubuntu:~# perf data convert --to-ctf
No conversion support compiled in. perf should be compiled with environment variables LIBBABELTRACE=1 and LIBBABELTRACE_DIR=/path/to/libbabeltrace/
2.9 diff
This command displays the performance difference amongst two or more perf.data files captured via perf record.
用于比较多个perf.data数据之间的不同。如果没有输入参数,默认比较perf.data.old和perf.data文件;
用法
Usage: perf diff [<options>] [old_file] [new_file]-b, --baseline-only Show only items with match in baseline-C, --comms <comm[,comm...]>only consider symbols in these comms-c, --compute <delta,delta-abs,ratio,wdiff:w1,w2 (default delta-abs),cycles>Entries differential computation selection-d, --dsos <dso[,dso...]>only consider symbols in these dsos-D, --dump-raw-trace dump raw trace in ASCII-f, --force don't complain, do it-F, --formula Show formula.-m, --modules load module symbols - WARNING: use only with -k and LIVE kernel-o, --order <n> Specify compute sorting.-p, --period Show period values.-q, --quiet Do not show any message-s, --sort <key[,key2...]>sort by key(s): pid, comm, dso, symbol, parent, cpu, srcline, ... Please refer the man page for the complete list.-S, --symbols <symbol[,symbol...]>only consider these symbols-t, --field-separator <separator>separator for columns, no spaces will be added between columns '.' is reserved.-v, --verbose be more verbose (show symbol address, etc)--cpu <cpu> list of cpus to profile--kallsyms <file>kallsyms pathname--percentage <relative|absolute>How to display percentage of filtered entries--pid <pid[,pid...]>only consider symbols in these pids--symfs <directory>Look for files with symbols relative to this directory--tid <tid[,tid...]>only consider symbols in these tids--time <str> Time span (time percent or absolute timestamp)
举例
[root@localhost test]# ls
perf.data perf.data.old test test.c
[root@localhost test]# perf diff perf.data.old perf.data --cpu 0
# Event 'cpu/mem-loads,ldlat=30/P'
#
# Baseline Delta Abs Shared Object Symbol
# ........ ......... ................. ............................
#77.78% [kernel.kallsyms] [k] find_busiest_group11.11% [kernel.kallsyms] [k] __update_load_avg_cfs_rq11.11% [kernel.kallsyms] [k] account_user_time# Event 'cpu/mem-stores/P'
#
# Baseline Delta Abs Shared Object Symbol
# ........ ......... ................. ..........................
#54.13% [kernel.kallsyms] [k] tick_sched_timer37.15% [kernel.kallsyms] [k] repeat_nmi4.59% [kernel.kallsyms] [k] __native_set_fixmap3.36% [kernel.kallsyms] [k] finish_task_switch0.39% [kernel.kallsyms] [k] perf_event_nmi_handler0.29% [kernel.kallsyms] [k] nmi_handle0.09% [kernel.kallsyms] [k] native_apic_mem_write
2.10 evlist
列出数据文件perf.data中所有性能事件。
用法
Usage: perf evlist [<options>]-f, --force don't complain, do it-F, --freq Show the sample frequency-g, --group Show event group information-i, --input <file> Input file name-v, --verbose Show all event attr details--trace-fields Show tracepoint fields
举例
[root@localhost test]# perf evlist -i perf.data
cpu/mem-loads,ldlat=30/P
cpu/mem-stores/P
2.11 ftrace
是内核ftrace功能的简化封装,可以跟踪指定进程的内核函数调用栈。目前perf ftrace只支持单线程跟踪,仅仅是从ftrace管道中读取数据输出到标准输出。
用法
Usage: perf ftrace [<options>] [<command>]or: perf ftrace [<options>] -- <command> [<options>]-a, --all-cpus system-wide collection from all CPUs-C, --cpu <cpu> list of cpus to monitor-D, --graph-depth <n>Max depth for function graph tracer-G, --graph-funcs <func>Set graph filter on given functions-g, --nograph-funcs <func>Set nograph filter on given functions-N, --notrace-funcs <func>do not trace given functions-p, --pid <pid> trace on existing process id-T, --trace-funcs <func>trace given functions only-t, --tracer <tracer>tracer to use: function_graph(default) or function-v, --verbose be more verbose
举例
[root@localhost jrg]# perf ftrace -g -p 1200582) | switch_mm_irqs_off() {2) 0.327 us | load_new_mm_cr3();2) 2.808 us | }------------------------------------------2) <idle>-0 => <...>-122943------------------------------------------2) | finish_task_switch() {2) ==========> |2) | smp_irq_work_interrupt() {2) | irq_enter() {2) 0.052 us | rcu_irq_enter();2) 0.096 us | irqtime_account_irq();2) 1.029 us | }2) | __wake_up() {2) | __wake_up_common_lock() {2) 0.128 us | _raw_spin_lock_irqsave();2) 0.057 us | __wake_up_common();2) 0.048 us | _raw_spin_unlock_irqrestore();2) 1.842 us | }2) 2.550 us | }2) | irq_exit() {2) 0.080 us | irqtime_account_irq();2) 0.037 us | idle_cpu();2) 0.035 us | rcu_irq_exit();2) 1.136 us | }2) 6.958 us | }2) <========== |2) 8.031 us | }2) 0.047 us | finish_wait();2) | mutex_lock() {2) | _cond_resched() {2) 0.033 us | rcu_all_qs();2) 0.361 us | }2) 0.766 us | }2) 0.031 us | generic_pipe_buf_confirm();2) | _cond_resched() {2) 0.032 us | rcu_all_qs();2) 0.360 us | }2) 0.034 us | anon_pipe_buf_release();2) 0.049 us | mutex_unlock();2) | __wake_up_sync_key() {2) | __wake_up_common_lock() {2) 0.035 us | _raw_spin_lock_irqsave();2) 0.036 us | __wake_up_common();2) 0.045 us | _raw_spin_unlock_irqrestore();2) 1.069 us | }2) 1.397 us | }2) 0.042 us | kill_fasync();2) | touch_atime() {2) | atime_needs_update() {2) | current_time() {2) 0.034 us | ktime_get_coarse_real_ts64();2) 0.042 us | timespec64_trunc();2) 0.740 us | }2) 1.277 us | }2) 0.163 us | __sb_start_write();2) 0.116 us | __mnt_want_write();
2.12 inject
该工具读取perf record工具记录的事件流,并将其定向到标准输出。在被分析代码中的任何一点,都可以向事件流中注入其它事件。
用法
Usage: perf inject [<options>]-b, --build-ids Inject build-ids into the output stream-f, --force don't complain, do it-i, --input <file> input file name-j, --jit merge jitdump files into perf.data file-o, --output <file> output file name-s, --sched-stat Merge sched-stat and sched-switch for getting events where and how long tasks slept-v, --verbose be more verbose (show build ids, etc)--itrace[=<opts>]Instruction Tracing optionsi: synthesize instructions eventsb: synthesize branches eventsc: synthesize branches events (calls only)r: synthesize branches events (returns only)x: synthesize transactions eventsw: synthesize ptwrite eventsp: synthesize power eventse: synthesize error eventsd: create a debug logg[len]: synthesize a call chain (use with i or x)l[len]: synthesize last branch entries (use with i or x)sNUMBER: skip initial number of eventsPERIOD[ns|us|ms|i|t]: specify period to sample streamconcatenate multiple options. Default is ibxwpe or cewp--kallsyms <file>kallsyms pathname--strip strip non-synthesized events (use with --itrace)
举例
[root@localhost test]# ls
perf.data perf.data.old test test.c
[root@localhost test]# perf inject -i perf.data --jit -o perf.data.jitted
[root@localhost test]# ls
perf.data perf.data.jitted perf.data.old test test.c
2.13 kallsyms
查找运行中的内核符号;
用法
root@ubuntu:python# perf kallsyms -hUsage: perf kallsyms [<options>] symbol_name-v, --verbose be more verbose (show counter open errors, etc)
举例
root@ubuntu:python# perf kallsyms -v vmw_cmdbuf_header_submit
vmw_cmdbuf_header_submit: [vmwgfx] /lib/modules/5.4.0-56-generic/kernel/drivers/gpu/drm/vmwgfx/vmwgfx.ko 0xffffffffc040c300-0xffffffffc040c3a7 (0x1b380-0x1b427)
2.14 kmem
针对内核内存(slab)子系统进行追踪测量的工具。
比如内存分配/释放等。可以用来研究程序在哪里分配了大量内存,或者在什么地方产生碎片之类的和内存管理相关的问题。
perf kmem和perf lock实际上都是perf tracepoint的子类,等同于perf record -e kmem:和perf record -e lock:。
但是这些工具在内部队员是数据进行了慧聪和分析,因此统计报表更具可读性。
perf kmem record:抓取命令的内核slab分配器事件
perf kmem stat:生成内核slab分配器统计信息
用法
Usage: perf kmem [<options>] {record|stat}-f, --force don't complain, do it-i, --input <file> input file name-l, --line <num> show n lines-s, --sort <key[,key2...]>sort by keys: ptr, callsite, bytes, hit, pingpong, frag, page, order, migtype, gfp-v, --verbose be more verbose (show symbol address, etc)--alloc show per-allocation statistics--caller show per-callsite statistics--live Show live page stat--page Analyze page allocator--raw-ip show raw ip instead of symbol--slab Analyze slab allocator--time <str> Time span of interest (start,stop)
举例
2.15 kvm
用于测试kvm客户机的性能参数。
用法
Usage: perf kvm [<options>] {top|record|report|diff|buildid-list|stat}-i, --input <file> Input file name-o, --output <file> Output file name-v, --verbose be more verbose (show counter open errors, etc)--guest Collect guest os data--guestkallsyms <file>file saving guest os /proc/kallsyms--guestmodules <file>file saving guest os /proc/modules--guestmount <directory>guest mount directory under which every guest os instance has a subdir--guestvmlinux <file>file saving guest os vmlinux--host Collect host os data
举例
perf kvm --host record
perf kvm --host reportSamples: 41K of event 'cycles', Event count (approx.): 8303321915
Overhead Command Shared Object Symbol21.84% qemu-system-x86 [kernel.kallsyms] [k] vmx_vmexit18.92% swapper [kernel.kallsyms] [k] intel_idle3.23% qemu-system-x86 [kernel.kallsyms] [k] do_syscall_642.15% qemu-system-x86 [kernel.kallsyms] [k] native_write_msr1.49% qemu-system-x86 [kernel.kallsyms] [k] syscall_return_via_sysret1.39% qemu-system-x86 [kernel.kallsyms] [k] kvm_arch_vcpu_ioctl_run1.25% qemu-system-x86 [kernel.kallsyms] [k] entry_SYSCALL_640.96% qemu-system-x86 [kernel.kallsyms] [k] kvm_put_guest_fpu0.90% qemu-system-x86 [kernel.kallsyms] [k] kvm_on_user_return0.86% qemu-system-x86 [kernel.kallsyms] [k] vcpu_enter_guest0.79% qemu-system-x86 [kernel.kallsyms] [k] native_write_msr_safe0.78% qemu-system-x86 [kernel.kallsyms] [k] vmx_vcpu_run.part.750.75% qemu-system-x86 [kernel.kallsyms] [k] native_set_debugreg0.71% qemu-system-x86 [kernel.kallsyms] [k] __fget0.60% qemu-system-x86 [kernel.kallsyms] [k] __x86_indirect_thunk_rax0.55% qemu-system-x86 qemu-system-x86_64 [.] object_dynamic_cast_assert0.55% qemu-system-x86 [kernel.kallsyms] [k] native_load_gdt0.52% qemu-system-x86 [kernel.kallsyms] [k] kvm_vcpu_ioctl0.51% qemu-system-x86 [kernel.kallsyms] [k] __audit_syscall_exit0.49% qemu-system-x86 [kernel.kallsyms] [k] vmx_prepare_switch_to_guest0.47% qemu-system-x86 [kernel.kallsyms] [k] __srcu_read_lock
...
2.16 list
显示所有的能够触发perf采样点的事件。比如cpu-clock,task-clock,contex-switches等等。
在 2.6.35 版本的内核中,该列表已经相当的长,但无论有多少,我们可以将它们划分为三类:
Hardware Event 是由 PMU 硬件产生的事件,比如 cache 命中,当您需要了解程序对硬件特性的使用情况时,便需要对这些事件进行采样;
Software Event 是内核软件产生的事件,比如进程切换,tick 数等 ;
Tracepoint event 是内核中的静态 tracepoint 所触发的事件,这些 tracepoint 用来判断程序运行期间内核的行为细节,比如 slab 分配器的分配次数等。
举例
[root@localhost jrg]# perf listList of pre-defined events (to be used in -e):branch-instructions OR branches [Hardware event]branch-misses [Hardware event]bus-cycles [Hardware event]cache-misses [Hardware event]cache-references [Hardware event]cpu-cycles OR cycles [Hardware event]instructions [Hardware event]ref-cycles [Hardware event]alignment-faults [Software event]bpf-output [Software event]context-switches OR cs [Software event]cpu-clock [Software event]cpu-migrations OR migrations [Software event]dummy [Software event]emulation-faults [Software event]major-faults [Software event]minor-faults [Software event]page-faults OR faults [Software event]task-clock [Software event]duration_time [Tool event]L1-dcache-load-misses [Hardware cache event]L1-dcache-loads [Hardware cache event]L1-dcache-stores [Hardware cache event]L1-icache-load-misses [Hardware cache event]LLC-load-misses [Hardware cache event]LLC-loads [Hardware cache event]LLC-store-misses [Hardware cache event]
2.17 lock
要使用此功能,内核需要编译选项的支持:CONFIG_LOCKDEP、CONFIG_LOCK_STAT
分析内核锁统计信息。
锁是内核用于同步的方法,一旦加了锁,其他加锁的内核执行路径就必须等待,降低了并行。同时,如果加锁不正确还会造成死锁。
因此对于内核锁进行分析是一项重要的调优工作。
用法
Usage: perf lock [<options>] {record|report|script|info}-D, --dump-raw-trace dump raw trace in ASCII-f, --force don't complain, do it-i, --input <file> input file name-v, --verbose be more verbose (show symbol address, etc)
2.18 mem
测试内存存取性能数据。
用法
Usage: perf mem [<options>] {record|report}-C, --cpu <cpu> list of cpus to profile-D, --dump-raw-samplesdump raw samples in ASCII-f, --force don't complain, do it-i, --input <file> input file name-p, --phys-data Record/Report sample physical addresses-t, --type <type> memory operations(load,store) Default load,store-U, --hide-unresolvedOnly display entries resolved to a symbol-x, --field-separator <separator>separator for columns, no spaces will be added between columns '.' is reserved.
举例
2.19 record
运行一个命令,并将其数据保存到perf.data中。随后,可以使用perf report进行分析。
perf record和perf report可以更精确的分析一个应用,perf record可以精确到函数级别。并且在函数里面混合显示汇编语言和代码。
perf record [-e <EVENT> | --event=EVENT] [-a] <command>
perf record [-e <EVENT> | --event=EVENT] [-a] — <command> [<options>]
选项--call-graph
表示调用图/调用链的集合,即样本的函数堆栈。
默认的fp
使用框架指针。这非常有效,但可能不可靠,尤其是对于优化的代码。通过显式使用-fno-omit-frame-pointer
,可以确保该代码可用于您的代码。但是,库的结果可能会有所不同。
使用dwarf
,perf
实际上收集并存储堆栈内存本身的一部分,并通过后处理对其进行展开。这可能非常消耗资源,并且堆栈深度可能有限。默认堆栈内存块为8 kiB,但可以配置。
lbr
代表最后一个分支记录。这是Intel CPU支持的硬件机制。这可能会以可移植性为代价提供最佳性能。 lbr
也仅限于用户空间功能。
2.20 report
读取perf record创建的数据文件,并给出热点分析结果。
top适合对整体性能分析,stat适合单个程序,report则可以分析更细粒度,具体到代码指令。
用法
-i 导入的数据文件名称,如果没有则默认为perf.data-g 生成函数调用关系图,**此时内核要打开CONFIG_KALLSYMS;用户空间库或者执行文件需要带符号信息(not stripped),编译选项需要加上-g。**--sort 从更高层面显示分类统计信息,比如: pid, comm, dso, symbol, parent, cpu,socket, srcline, weight, local_weight.
2.21 sched
调度器的好坏直接影响一个系统的整体运行效率。在这个领域,内核黑客们常会发生争执,一个重要原因是对于不同的调度器,每个人给出的评测报告都各不相同,甚至常常有相反的结论。因此一个权威的统一的评测工具将对结束这种争论有益。Perf sched 便是这种尝试。
用法
子命令:
record:统计数据
latency:输出每个任务的延迟数据
map:显示上下文切换的映射
replay: 仿真perf.data数据,可重复仿真运行测试性能
script:同perf script功能
timehist:提供scheduling事件分析报告
Usage: perf sched [<options>] {record|latency|map|replay|script|timehist}-D, --dump-raw-trace dump raw trace in ASCII-f, --force don't complain, do it-i, --input <file> input file name-v, --verbose be more verbose (show symbol address, etc)
用户一般使用’ perf sched record ’收集调度相关的数据,然后就可以用’ perf sched latency ’查看诸如调度延迟等和调度器相关的统计数据。
其他几个命令也同样读取 record 收集到的数据并从其他不同的角度来展示这些数据
举例
[root@localhost jrg]# perf sched record sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 8.557 MB perf.data (49023 samples) ]perf sched latency:-----------------------------------------------------------------------------------------------------------------Task | Runtime ms | Switches | Average delay ms | Maximum delay ms | Maximum delay at |-----------------------------------------------------------------------------------------------------------------ksoftirqd/14:96 | 0.451 ms | 3 | avg: 2.315 ms | max: 3.981 ms | max at: 280823.049244 skworker/53:0-mm:1635514 | 0.008 ms | 1 | avg: 0.293 ms | max: 0.293 ms | max at: 280823.408523 skworker/49:3-mm:1693592 | 0.008 ms | 1 | avg: 0.188 ms | max: 0.188 ms | max at: 280823.134432 skworker/31:0-mm:1181706 | 0.023 ms | 2 | avg: 0.065 ms | max: 0.105 ms | max at: 280823.183359 skworker/41:0-ev:1674483 | 0.007 ms | 1 | avg: 0.062 ms | max: 0.062 ms | max at: 280823.256325 stail:1881449 | 1.235 ms | 1 | avg: 0.061 ms | max: 0.061 ms | max at: 280823.136554 skworker/40:2-ev:933115 | 0.027 ms | 2 | avg: 0.059 ms | max: 0.106 ms | max at: 280823.881309 skworker/38:3-ev:1698336 | 0.019 ms | 3 | avg: 0.053 ms | max: 0.141 ms | max at: 280823.600360 skworker/32:0-mm:1784542 | 0.007 ms | 1 | avg: 0.051 ms | max: 0.051 ms | max at: 280823.366285 skworker/48:1-ev:1877889 | 0.052 ms | 4 | avg: 0.041 ms | max: 0.110 ms | max at: 280823.744327 sQuorumPeer[myid:76988 | 0.084 ms | 1 | avg: 0.034 ms | max: 0.034 ms | max at: 280823.441806 swho:1881451 | 1.315 ms | 1 | avg: 0.031 ms | max: 0.031 ms | max at: 280823.139770 skworker/4:2-mm_:1866327 | 0.007 ms | 1 | avg: 0.031 ms | max: 0.031 ms | max at: 280823.281280 skworker/29:2-mm:1823927 | 0.013 ms | 1 | avg: 0.030 ms | max: 0.030 ms | max at: 280823.263316 sSPICE Worker:(5) | 7.424 ms | 172 | avg: 0.026 ms | max: 0.048 ms | max at: 280823.087379 sgvfsd-fuse:(2) | 0.075 ms | 2 | avg: 0.025 ms | max: 0.045 ms | max at: 280823.139171 sbash:(2) | 1.604 ms | 9 | avg: 0.023 ms | max: 0.036 ms | max at: 280823.254885 skworker/u113:2-:1880846 | 0.115 ms | 6 | avg: 0.022 ms | max: 0.036 ms | max at: 280823.254836 skworker/35:0-mm:1862481 | 0.017 ms | 2 | avg: 0.022 ms | max: 0.038 ms | max at: 280823.263328 srngd:1631 | 0.165 ms | 14 | avg: 0.022 ms | max: 0.045 ms | max at: 280823.805026 s
...上面latency结果中各个 column 的含义如下:Task: 进程的名字和 pid Runtime: 实际运行时间Switches: 进程切换的次数Average delay: 平均的调度延迟Maximum delay: 最大延迟
这里最值得人们关注的是 Maximum delay,一般从这里可以看到对交互性影响最大的特性:调度延迟,如果调度延迟比较大,那么用户就会感受到视频或者音频断断续续的。
perf sched map
结果:
星号表示调度事件发生所在的 CPU。
点号表示该 CPU 正在 IDLE。
Map 的好处在于提供了一个的总的视图,将成百上千的调度事件进行总结,显示了系统任务在 CPU 之间的分布,假如有不好的调度迁移,比如一个任务没有被及时迁移到 idle 的 CPU 却被迁移到其他忙碌的 CPU,类似这种调度器的问题可以从 map 的报告中一眼看出来。
perf sched replay
:
Perf replay 这个工具更是专门为调度器开发人员所设计,它试图重放 perf.data 文件中所记录的调度场景。很多情况下,一般用户假如发现调度器的奇怪行为,他们也无法准确说明发生该情形的场景,或者一些测试场景不容易再次重现,或者仅仅是出于“偷懒”的目的,使用 perf replay,perf 将模拟 perf.data 中的场景,无需开发人员花费很多的时间去重现过去,这尤其利于调试过程,因为需要一而再,再而三地重复新的修改是否能改善原始的调度场景所发现的问题。
[root@localhost jrg]# perf sched replay
run measurement overhead: 61 nsecs
sleep measurement overhead: 52656 nsecs
the run test took 1000030 nsecs
the sleep test took 1053803 nsecs
nr_run_events: 24218
nr_sleep_events: 26210
nr_wakeup_events: 12036
target-less wakeups: 16
multi-target wakeups: 53
task 0 ( swapper: 0), nr_events: 34269
task 1 ( swapper: 1), nr_events: 1
task 2 ( :2: 2), nr_events: 1
task 3 ( :10: 10), nr_events: 1
task 4 ( :100: 100), nr_events: 1
task 5 ( :9687: 9687), nr_events: 1
task 6 ( :10004: 10004), nr_events: 1
task 7 ( gsd-power: 10049), nr_events: 1
task 8 ( gsd-power: 10053), nr_events: 1
task 9 ( gsd-power: 10085), nr_events: 1
task 10 ( :10006: 10006), nr_events: 1
task 11 ( gsd-print-notif: 10019), nr_events: 1
task 12 ( gsd-print-notif: 10023), nr_events: 1
2.22 script
根据脚本来分析perf.data数据。可以编写perl和python脚本来辅助分析。自带的脚本有:
[root@localhost jrg]# perf script -l
List of available trace scripts:event_analyzing_sample analyze all perf samplescompaction-times [-h] [-u] [-p|-pv] [-t | [-m] [-fs] [-ms]] [pid|pid-range|comm-regex] display time taken by mm compactionmem-phys-addr resolve physical address samplesstackcollapse produce callgraphs in short form for scripting usenetdev-times [tx] [rx] [dev=] [debug] display a process of packet and processing timenet_dropmonitor display a table of dropped framessyscall-counts [comm] system-wide syscall countssched-migration sched migration overviewexport-to-sqlite [database name] [columns] [calls] export perf data to a sqlite3 databasepowerpc-hcallsexport-to-postgresql [database name] [columns] [calls] export perf data to a postgresql databasesctop [comm] [interval] syscall topsyscall-counts-by-pid [comm] system-wide syscall counts, by pidfailed-syscalls-by-pid [comm] system-wide failed syscalls, by pidfutex-contention futext contention measurementintel-pt-events print Intel PT Power Events and PTWRITErw-by-file <comm> r/w activity for a program, by filefailed-syscalls [comm] system-wide failed syscallsrwtop [interval] system-wide r/w topwakeup-latency system-wide min/max/avg wakeup latencyrw-by-pid system-wide r/w activity
perf script record <script> <command>
用于记录,对应的perf script report <script> <command>
则用于显示数据。
perf script <script> <required-script-args> <command>
实时记录事件,不会把数据保存到硬盘。
举例
[root@localhost jrg]# perf script report event_analyzing_sample
In trace_begin:comm=bytearray(b'perf\x00)\x00\x000\x00\x00\x00\x00\x00\x00\x00') common_callchain=[] common_comm=:1848848 common_cpu=7 common_ns=128968792 common_pid=1848848 common_s=277782 pid=1848855 prio=120 success=1 target_cpu=11
comm=bytearray(b'kworker/u113:2\x00\x00') common_callchain=[] common_comm=ls common_cpu=11 common_ns=131027987 common_pid=1848855 common_s=277782 pid=1845831 prio=120 success=1 target_cpu=36
In trace_end:There is 680264 records in gen_events table
Statistics about the general events grouped by thread/symbol/dso:comm number histogram
==========================================swapper 380530 ###################qemu-system-x86 191107 ##################perf 94123 #################
......
2.23 stat
perf stat用于运行指令,并分析其统计结果。虽然perf top也可以指定pid,但是必须先启动应用才能查看信息。
perf stat能完整统计应用整个生命周期的信息。
用法
Usage: perf stat [<options>] [<command>]-a, --all-cpus system-wide collection from all CPUs-A, --no-aggr disable CPU count aggregation-B, --big-num print large numbers with thousands' separators-C, --cpu <cpu> list of cpus to monitor in system-wide-D, --delay <n> ms to wait before starting measurement after program start-d, --detailed detailed run - start a lot of events-e, --event <event> event selector. use 'perf list' to list available events-G, --cgroup <name> monitor event in cgroup name only-g, --group put the counters into a counter group-I, --interval-print <n>print counts at regular interval in ms (overhead is possible for values <= 100ms)-i, --no-inherit child tasks do not inherit counters-M, --metrics <metric/metric group list>monitor specified metrics or metric groups (separated by ,)-n, --null null run - dont start any counters-o, --output <file> output file name-p, --pid <pid> stat events on existing process id //指定进程pid-r, --repeat <n> repeat command and print average + stddev (max: 100, forever: 0)-S, --sync call sync() before starting a run-t, --tid <tid> stat events on existing thread id-T, --transaction hardware transaction statistics-v, --verbose be more verbose (show counter open errors, etc)-x, --field-separator <separator>print counts with custom separator--append append to the output file--filter <filter>event filter--interval-clear clear screen in between new interval--interval-count <n>print counts for fixed number of times--log-fd <n> log output to fd, instead of stderr--metric-only Only print computed metrics. No raw values--no-merge Do not merge identical named events--per-core aggregate counts per physical processor core--per-die aggregate counts per processor die--per-socket aggregate counts per processor socket--per-thread aggregate counts per thread--post <command> command to run after to the measured command--pre <command> command to run prior to the measured command--scale Use --no-scale to disable counter scaling for multiplexing--smi-cost measure SMI cost--table display details about each run (only with -r option)--timeout <n> stop workload and print counts after a timeout period in ms (>= 10ms)--topdown measure topdown level 1 statistics
举例
[root@localhost jrg]# perf stat lsPerformance counter stats for 'ls':0.98 msec task-clock # 0.726 CPUs utilized0 context-switches # 0.000 K/sec0 cpu-migrations # 0.000 K/sec96 page-faults # 0.098 M/sec2,427,546 cycles # 2.486 GHz1,963,341 instructions # 0.81 insn per cycle391,211 branches # 400.601 M/sec14,916 branch-misses # 3.81% of all branches0.001345674 seconds time elapsed0.000683000 seconds user0.000683000 seconds sys[root@localhost jrg]# perf stat
^CPerformance counter stats for 'system wide':58,817.62 msec cpu-clock # 55.475 CPUs utilized32,551 context-switches # 0.553 K/sec62 cpu-migrations # 0.001 K/sec749 page-faults # 0.013 K/sec2,901,510,136 cycles # 0.049 GHz1,016,253,852 instructions # 0.35 insn per cycle202,174,764 branches # 3.437 M/sec11,997,035 branch-misses # 5.93% of all branches1.060261691 seconds time elapsedcpu-clock:任务真正占用的处理器时间,单位为ms。CPUs utilized = task-clock / time elapsed,CPU的占用率。
context-switches:程序在运行过程中上下文的切换次数。
CPU-migrations:程序在运行过程中发生的处理器迁移次数。Linux为了维持多个处理器的负载均衡,在特定条件下会将某个任务从一个CPU迁移到另一个CPU。
CPU迁移和上下文切换:发生上下文切换不一定会发生CPU迁移,而发生CPU迁移时肯定会发生上下文切换。发生上下文切换有可能只是把上下文从当前CPU中换出,下一次调度器还是将进程安排在这个CPU上执行。
page-faults:缺页异常的次数。当应用程序请求的页面尚未建立、请求的页面不在内存中,或者请求的页面虽然在内存中,但物理地址和虚拟地址的映射关系尚未建立时,都会触发一次缺页异常。另外TLB不命中,页面访问权限不匹配等情况也会触发缺页异常。
cycles:消耗的处理器周期数。如果把被ls使用的cpu cycles看成是一个处理器的,那么它的主频为2.486GHz。可以用cycles / task-clock算出。
instructions:执行了多少条指令。IPC为平均每个cpu cycle执行了多少条指令。
branches:遇到的分支指令数。
branch-misses是预测错误的分支指令数。
统计更多选项
[root@localhost jrg]# perf stat -e task-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,branches,branch-misses,L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses,dTLB-loads,dTLB-load-misses
^CPerformance counter stats for 'system wide':70,320.23 msec task-clock # 55.846 CPUs utilized39,186 context-switches # 0.557 K/sec87 cpu-migrations # 0.001 K/sec741 page-faults # 0.011 K/sec5,744,216,840 cycles # 0.082 GHz (21.45%)1,460,300,936 instructions # 0.25 insn per cycle (30.52%)276,595,324 branches # 3.933 M/sec (35.96%)20,787,859 branch-misses # 7.52% of all branches (40.15%)429,103,243 L1-dcache-loads # 6.102 M/sec (23.70%)46,528,058 L1-dcache-load-misses # 10.84% of all L1-dcache hits (20.50%)23,797,515 LLC-loads # 0.338 M/sec (19.81%)398,316 LLC-load-misses # 1.67% of all LL-cache hits (24.26%)442,296,850 dTLB-loads # 6.290 M/sec (19.25%)5,169,611 dTLB-load-misses # 1.17% of all dTLB cache hits (18.79%)1.259180585 seconds time elapsed
2.24 test
用于sanity test。perf自带了不少测试用例,可用于测试。
用法
root@ubuntu:~# perf test -hUsage: perf test [<options>] [{list <test-name-fragment>|[<test-name-fragments>|<test-numbers>]}]-F, --dont-fork Do not fork for testcase-s, --skip <tests> tests to skip-v, --verbose be more verbose (show symbol address, etc)
举例
perf list查看有哪些测试用例
root@ubuntu:~# perf test list1: vmlinux symtab matches kallsyms2: Detect openat syscall event3: Detect openat syscall event on all cpus4: Read samples using the mmap interface5: Test data source output6: Parse event definition strings7: Simple expression parser8: PERF_RECORD_* events & perf_sample fields9: Parse perf pmu format
10: DSO data read
11: DSO data cache
12: DSO data reopen
...
以test 51为例
root@ubuntu:~# perf test list 51
51: Print cpu maproot@ubuntu:~# perf test 51
51: Print cpu map : Okroot@ubuntu:~# perf test Print
51: Print cpu map : Ok
53: is_printable_array : Ok
54: Print bitmap : Ok
57: unit_number__scnprintf : Okroot@ubuntu:~# perf test -F -v Print
51: Print cpu map :
--- start ---
---- end ----
Print cpu map: Ok
53: is_printable_array :
--- start ---
---- end ----
is_printable_array: Ok
54: Print bitmap :
--- start ---
bitmap: 1
bitmap: 1,5
bitmap: 1,3,5,7,9,11,13,15,17,19,21-40
bitmap: 2-5
bitmap: 1,3-6,8-10,24,35-37
bitmap: 1,3-6,8-10,24,35-37
bitmap: 1-10,12-20,22-30,32-40
---- end ----
Print bitmap: Ok
57: unit_number__scnprintf :
--- start ---
n 1, str '1B', buf '1B'
n 10240, str '10K', buf '10K'
n 20971520, str '20M', buf '20M'
n 32212254720, str '30G', buf '30G'
n 0, str '0B', buf '0B'
---- end ----
unit_number__scnprintf: Ok
2.25 timechart
将统计信息转换为图形显示模式的工具。
主要两种用途:
perf timechart record <command>
,记录系统级事件,默认只记录scheduler和CPU事件(诸如进程切换,运行时间,cpu电源状态等)。不过也可以用来记录IO活动。
perf timechart
可以将perf.data中的数据以图表形式显示。
用法
record参数
-P 只记录power相关events
-T 只记录task相关events
-I 只记录IO相关events
-g 记录调用关系[root@localhost test]# perf timechart record -hUsage: perf timechart record [<options>]-g, --callchain record callchain-I, --io-only record only IO data
timechart参数
[root@localhost test]# perf timechart -hUsage: perf timechart [<options>] {record}-f, --force don't complain, do it-i, --input <file> input file name-n, --proc-num <n> min. number of tasks to print-o, --output <file> output file name-p, --process <process>process selector. Pass a pid or process name.-t, --topology sort CPUs according to topology-w, --width <n> page width //调整输出图宽度--highlight <duration or task name>highlight tasks. Pass duration in ns or process name.--io-merge-dist <time>merge events that are merge-dist us apart--io-min-time <time>all IO faster than min-time will visually appear longer--io-skip-eagain skip EAGAIN errors--symfs <directory>Look for files with symbols relative to this directory
举例
命令perf timechart record git pull
2.26 top
实时显示性能统计数据。
amples: 2K of event 'cpu-clock:pppH', 4000 Hz, Event count (approx.): 414561842 lost: 0/0 drop: 0/0
Overhead Shared Object Symbol37.18% [kernel] [k] __lock_text_start 12.92% [kernel] [k] vmw_cmdbuf_header_submit 12.03% [kernel] [k] clear_page_orig 1.93% [kernel] [k] finish_task_switch 0.61% libc-2.31.so [.] __memmove_avx_unaligned_erms0.58% [kernel] [k] do_syscall_640.55% [kernel] [k] exit_to_usermode_loop0.50% [kernel] [k] mpt_put_msg_frame0.46% [kernel] [k] rmqueue0.46% [kernel] [k] number0.43% [kernel] [k] kallsyms_expand_symbol.constprop.00.43% perf [.] rb_next0.42% libslang.so.2.3.2 [.] SLsmg_write_chars0.38% [kernel] [k] arch_local_irq_enable0.33% libc-2.31.so [.] malloc0.33% [kernel] [k] vsnprintf0.31% [kernel] [k] memcg_kmem_get_cache0.29% [kernel] [k] memset_orig0.28% libc-2.31.so [.] read0.28% perf [.] __symbols__insert.constprop.00.26% libpixman-1.so.0.38.4 [.] 0x000000000006e0ff0.26% [kernel] [k] native_write_msr0.24% [kernel] [k] __fget0.23% libpixman-1.so.0.38.4 [.] 0x000000000008cac50.23% libpixman-1.so.0.38.4 [.] 0x000000000008cadb0.23% perf [.] rust_demangle_callback...
第一列:符号引发的性能事件的比例,指占用的cpu周期比例。
第二列:符号所在的DSO(Dynamic Shared Object),可以是应用程序、内核、动态链接库、模块。
第三列:DSO的类型。[.]表示此符号属于用户态的ELF文件,包括可执行文件与动态链接库;[k]表述此符号属于内核或模块。
第四列:符号名。有些符号不能解析为函数名,只能用地址表示
用法
Usage: perf top [<options>]-a, --all-cpus system-wide collection from all CPUs-b, --branch-any sample any taken branches-c, --count <n> event period to sample-C, --cpu <cpu> list of cpus to monitor-d, --delay <n> number of seconds to delay between refreshes-D, --dump-symtab dump the symbol table used for profiling-E, --entries <n> display this many functions-e, --event <event> event selector. use 'perf list' to list available events //指定event-f, --count-filter <n>only display functions with more events than this-F, --freq <freq or 'max'>profile at this frequency-g enables call-graph recording and display //得到函数调用关系图-i, --no-inherit child tasks do not inherit counters-j, --branch-filter <branch filter mask>branch stack filter modes-K, --hide_kernel_symbolshide kernel symbols-k, --vmlinux <file> vmlinux pathname-M, --disassembler-style <disassembler style>Specify disassembler style (e.g. -M intel for intel syntax)-m, --mmap-pages <pages>number of mmap data pages-n, --show-nr-samplesShow a column with the number of samples-p, --pid <pid> profile events on existing process id-r, --realtime <n> collect data with this RT SCHED_FIFO priority-s, --sort <key[,key2...]>sort by key(s): pid, comm, dso, symbol, parent, cpu, srcline, ... Please refer the man page for the complete list.-t, --tid <tid> profile events on existing thread id-U, --hide_user_symbolshide user symbols-u, --uid <user> user to profile-v, --verbose be more verbose (show counter open errors, etc)-w, --column-widths <width[,width...]>don't try to adjust column width, use these fixed values-z, --zero zero history across updates--asm-raw Display raw encoding of assembly instructions (default)--call-graph <record_mode[,record_size],print_type,threshold[,print_limit],order,sort_key[,branch]>setup and enables call-graph (stack chain/backtrace):record_mode: call graph recording mode (fp|dwarf|lbr)record_size: if record_mode is 'dwarf', max size of stack recording (<bytes>)default: 8192 (bytes)print_type: call graph printing style (graph|flat|fractal|folded|none)threshold: minimum call graph inclusion threshold (<percent>)print_limit: maximum number of call graph entry (<number>)order: call graph order (caller|callee)sort_key: call graph sort key (function|address)branch: include last branch info to call graph (branch)value: call graph value (percent|period|count)Default: fp,graph,0.5,caller,function--children Accumulate callchains of children and show total overhead as well--comms <comm[,comm...]>only consider symbols in these comms--demangle-kernelEnable kernel symbol demangling--dsos <dso[,dso...]>only consider symbols in these dsos--fields <key[,keys...]>output field(s): overhead, period, sample plus all of sort keys--force don't complain, do it--group put the counters into a counter group--hierarchy Show entries in a hierarchy--ignore-callees <regex>ignore callees of these functions in call graphs--ignore-vmlinux don't load vmlinux even if found--kallsyms <file>kallsyms pathname--max-stack <n> Set the maximum stack depth when parsing the callchain. Default: kernel.perf_event_max_stack or 127--namespaces Record namespaces events--no-bpf-event do not record bpf events--num-thread-synthesize <n>number of thread to run event synthesize--objdump <path> objdump binary to use for disassembly and annotations--overwrite Use a backward ring buffer, default: no--percent-limit <percent>Don't show entries under that percent--percentage <relative|absolute>How to display percentage of filtered entries--proc-map-timeout <n>per thread proc mmap processing timeout in ms--raw-trace Show raw trace event output (do not use print fmt or plugins)--show-total-periodShow a column with the sum of periods--source Interleave source code with assembly code (default)--stdio Use the stdio interface--sym-annotate <symbol name>symbol to annotate--symbols <symbol[,symbol...]>only consider these symbols--tui Use the TUI interface
举例
写一个死循环函数t1,perf top查看性能;
示例:
#include "stdlib.h"
#include "stdio.h"void longa()
{ int i,j; for(i = 0; i < 1000000; i++) j=i; //am I silly or crazy? I feel boring and desperate.
} void foo2()
{ int i; for(i=0 ; i < 10; i++) longa();
} void foo1()
{ int i; for(i = 0; i< 100; i++) longa();
} int main(int argc, char *argv[])
{ int i = 0;foo1(); foo2(); while(1){foo1(); foo2(); i++;}return 0;
}
t1的main函数占了96%的cpu周期,性能全部耗在t1上。实际环境中没这么容易找到问题,可结合-e参数指定event来判断
amples: 80K of event 'cpu-clock:pppH', 4000 Hz, Event count (approx.): 7881621045 lost: 0/0 drop: 0/0
Overhead Shared Object Symbol96.05% t1 [.] main0.56% [kernel] [k] __softirqentry_text_start0.55% [kernel] [k] clear_page_orig0.17% [kernel] [k] vmw_cmdbuf_header_submit0.16% [kernel] [k] __lock_text_start0.06% ld-2.31.so [.] do_lookup_x0.04% [kernel] [k] mpt_put_msg_frame0.04% [kernel] [k] exit_to_usermode_loop0.04% [kernel] [k] do_syscall_64
perf top --call-graph graph
Samples: 25K of event 'cpu-clock:pppH', 4000 Hz, Event count (approx.): 6058000000 lost: 0/0 drop: 0/0Children Self Shared Object Symbol
- 94.37% 94.25% t1 [.] longa94.25% __libc_start_mai- main- 85.62% foo1longa- 8.63% foo2longa
- 59.45% 0.00% libc-2.31.so [.] __libc_start_main__libc_start_main- main- 54.05% foo1longa- 5.40% foo2longa
- 59.45% 0.00% t1 [.] main- main- 54.05% foo1longa- 5.40% foo2longa
- 54.05% 0.00% t1 [.] foo1foo1longa
- 5.40% 0.00% t1 [.] foo2foo2longa
+ 1.82% 0.07% [kernel] [k] do_syscall_64
...
2.27 version
查看当前perf版本
root@pi:~# perf version
perf version 5.4.73
2.28 probe
动态插入采样监测点。
内核需要编译支持CONFIG_DEBUG_INFO
2.29 trace
追踪系统调用。strace工具
用法
Usage: perf trace [<options>] [<command>]or: perf trace [<options>] -- <command> [<options>]or: perf trace record [<options>] [<command>]or: perf trace record [<options>] -- <command> [<options>]-a, --all-cpus system-wide collection from all CPUs-C, --cpu <cpu> list of cpus to monitor-D, --delay <n> ms to wait before starting measurement after program start-e, --event <event> event/syscall selector. use 'perf list' to list available events-f, --force don't complain, do it-F, --pf <all|maj|min>Trace pagefaults-G, --cgroup <name> monitor event in cgroup name only-i, --input <file> Analyze events in file-m, --mmap-pages <pages>number of mmap data pages-o, --output <file> output file name-p, --pid <pid> trace events on existing process id-s, --summary Show only syscall summary with statistics-S, --with-summary Show all syscalls and summary with statistics-t, --tid <tid> trace events on existing thread id-T, --time Show full timestamp, not time relative to first start-u, --uid <user> user to profile-v, --verbose be more verbose--call-graph <record_mode[,record_size]>setup and enables call-graph (stack chain/backtrace):record_mode: call graph recording mode (fp|dwarf|lbr)record_size: if record_mode is 'dwarf', max size of stack recording (<bytes>)default: 8192 (bytes)Default: fp--comm show the thread COMM next to its id--duration <float>show only events with duration > N.M ms--expr <expr> list of syscalls/events to trace--failure Show only syscalls that failed--filter-pids <CSV list of pids>pids to filter (by the kernel)--kernel-syscall-graphShow the kernel callchains on the syscall exit path--map-dump <BPF map>BPF map to periodically dump--max-events <n> Set the maximum number of events to print, exit after that is reached.--max-stack <n> Set the maximum stack depth when parsing the callchain, anything beyond the specified depth will be ignored. Default: kernel.perf_event_max_stack or 127--min-stack <n> Set the minimum stack depth when parsing the callchain, anything below the specified depth will be ignored.--no-inherit child tasks do not inherit counters--print-sample print the PERF_RECORD_SAMPLE PERF_SAMPLE_ info, for debugging--proc-map-timeout <n>per thread proc mmap processing timeout in ms--sched show blocking scheduler events--sort-events Sort batch of events before processing, use if getting out of order events--syscalls Trace syscalls--tool_stats show tool stats
举例
[root@localhost jrg]# perf trace0.000 ( 1.008 ms): qemu-system-x8/2231208 futex(uaddr: 0x55d15469b5a8, op: WAIT_BITSET|PRIVATE_FLAG|CLOCK_REALTIME, utime: 0x7fb6395fd630, val3: MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)1.013 ( 1.006 ms): qemu-system-x8/2231208 futex(uaddr: 0x55d15469b5a8, op: WAIT_BITSET|PRIVATE_FLAG|CLOCK_REALTIME, utime: 0x7fb6395fd630, val3: MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)2.024 ( ): qemu-system-x8/2231208 futex(uaddr: 0x55d15469b5a8, op: WAIT_BITSET|PRIVATE_FLAG|CLOCK_REALTIME, utime: 0x7fb6395fd630, val3: MATCH_ANY) ...
18446744073709.520 ( 1.020 ms): qemu-system-x8/2342046 futex(uaddr: 0x55f9bd192208, op: WAIT_BITSET|PRIVATE_FLAG|CLOCK_REALTIME, utime: 0x7f69d65fd630, val3: MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)1.004 ( 0.995 ms): qemu-system-x8/2342046 futex(uaddr: 0x55f9bd192208, op: WAIT_BITSET|PRIVATE_FLAG|CLOCK_REALTIME, utime: 0x7f69d65fd630, val3: MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)2.004 ( ): qemu-system-x8/2342046 futex(uaddr: 0x55f9bd192208, op: WAIT_BITSET|PRIVATE_FLAG|CLOCK_REALTIME, utime: 0x7f69d65fd630, val3: MATCH_ANY) ...0.920 ( ): qemu-system-x8/2364252 ppoll(ufds: 0x562288f9fa70, nfds: 27, tsp: 0x7ffe1f525420, sigsetsize: 8) ...0.942 ( 0.988 ms): qemu-system-x8/2364252 ppoll(ufds: 0x562288f9fa70, nfds: 27, tsp: 0x7ffe1f525420, sigsetsize: 8) = 0 (Timeout)1.951 ( 0.006 ms): qemu-system-x8/2364252 ppoll(ufds: 0x562288f9fa70, nfds: 27, tsp: 0x7ffe1f525420, sigsetsize: 8) = 0 (Timeout)1.971 ( ): qemu-system-x8/2364252 ppoll(ufds: 0x562288f9fa70, nfds: 27, tsp: 0x7ffe1f525420, sigsetsize: 8) ...0.525 ( ): qemu-system-x8/11141 ioctl(fd: 17<anon_inode:kvm-vcpu:1>, cmd: KVM_RUN) ...0.539 ( 0.997 ms): qemu-system-x8/11141 ioctl(fd: 17<anon_inode:kvm-vcpu:1>, cmd: KVM_RUN) = 01.543 ( 0.023 ms): qemu-system-x8/11141 ioctl(fd: 17<anon_inode:kvm-vcpu:1>, cmd: KVM_RUN) = 01.569 ( 0.955 ms): qemu-system-x8/11141 ioctl(fd: 17<anon_inode:kvm-vcpu:1>, cmd: KVM_RUN) = 02.533 ( 0.010 ms): qemu-system-x8/11141 ioctl(fd: 17<anon_inode:kvm-vcpu:1>, cmd: KVM_RUN) = 02.545 ( ): qemu-system-x8/11141 ioctl(fd: 17<anon_inode:kvm-vcpu:1>, cmd: KVM_RUN) ...0.385 ( ): qemu-system-x8/11140 ioctl(fd: 16<anon_inode:kvm-vcpu:0>, cmd: KVM_RUN) ...0.397 ( ): qemu-system-x8/11140 ioctl(fd: 16<anon_inode:kvm-vcpu:0>, cmd: KVM_RUN) ...0.405 ( ): qemu-system-x8/11140 ioctl(fd: 16<anon_inode:kvm-vcpu:0>, cmd: KVM_RUN) ...0.413 ( ): qemu-system-x8/11140 ioctl(fd: 16<anon_inode:kvm-vcpu:0>, cmd: KVM_RUN) ...
参考文档:
https://www.cnblogs.com/arnoldlu/p/6241297.html
https://www.ibm.com/developerworks/cn/linux/l-cn-perf1/
https://www.ibm.com/developerworks/cn/linux/l-cn-perf2/
perf的man手册
perf使用实例详解相关推荐
- java异常例子_java 异常的实例详解
java 异常的实例详解 1.异常的定义:程序在运行时出现不正常情况. 异常的划分: Error:严重的问题,对于error一般不编写针对性的代码对其进行处理. Exception:非严重的问题,对于 ...
- python 自动化办公 案例_python自动化工具之pywinauto实例详解
python自动化工具之pywinauto实例详解 来源:中文源码网 浏览: 次 日期:2019年11月5日 [下载文档: python自动化工具之pywinauto实例详解.txt ] (友情提示: ...
- java写exe程序实例,java实现可安装的exe程序实例详解
java实现可安装的exe程序实例详解 通过编写java代码,实现可安装的exe文件的一般思路: 1.在eclipse中创建java项目,然后编写java代码,将编写好的java项目导出一个.jar格 ...
- python多进程应用场景_python使用多进程的实例详解
python多线程适合IO密集型场景,而在CPU密集型场景,并不能充分利用多核CPU,而协程本质基于线程,同样不能充分发挥多核的优势. 针对计算密集型场景需要使用多进程,python的multipro ...
- php可以打印一个页面,利用html实现分页打印功能的实例详解
本篇介绍利用html实现分页打印功能的实例详解,有些不想打印出来的分页打印的都可以应用这类样式进行控制 在非打印时是无效的. 页面打印 /* 应用这个样式的在打印时隐藏 */ .noPrint { d ...
- python判断是否回文_对python判断是否回文数的实例详解
设n是一任意自然数.若将n的各位数字反向排列所得自然数n1与n相等,则称n为一回文数.例如,若n=1234321,则称n为一回文数:但若n=1234567,则n不是回文数. 上面的解释就是说回文数和逆 ...
- python2.7除法_对python中的float除法和整除法的实例详解
从python2.2开始,便有两种除法运算符:"/"."//".两者最大区别在: python2.2前的版本和python2.2以后3.0以前的版本的默认情况下 ...
- java测试类 main方法_Java使用agent实现main方法之前的实例详解
Java使用agent实现main方法之前的实例详解 创建Agent项目 PreMainExecutor 类,在main方法之前执行此方法 public class PreMainExecutor { ...
- java中匿名内部类详解_java 中匿名内部类的实例详解
搜索热词 java 中匿名内部类的实例详解 原来的面貌: class TT extends Test{ void show() { System.out.println(s+"~~~哈哈&q ...
最新文章
- WINCE cvrtbin命令简介
- 在Windows上安装Docker
- Django之序列化
- 上海应用技术大学计算机专业分数线,上海应用技术大学2016年上海市各专业录取分数线...
- 如何检测被锁住的Oracle存储过程及处理办法汇总(转)
- HDU2020 绝对值排序【入门】
- 【百度分享】javascript中函数调用过程中的this .
- java并发编程实战看哪几章,附源代码
- C语言知识点总结2022
- java中socket编程实例_Java Socket编程实例
- php搞笑图片合成,PS教你怎么把照片做成搞笑的qq表情
- GfK十大洞见揭示物联网时代正全面开启
- java读取rar_java怎么读取Zip和RAR里面的文件啊?
- 使用maven为web工程引入jstl包时报错
- 线性降维算法简介及PCA主成分分析
- JL 杰理 AC692N系列TWS 蓝牙音箱 开发
- 2.muduo之Channel
- 导致论文高被引的关键因素
- 邀请函 |「相信开放的力量」PingCAP D 轮融资线上发布会
- ArcSDE和Oracle分离安装(生产实践)