这里填写标题

1. golang pprof
- 1.1. pprof 实例
2. go tool
- 2.1. `--inuse/alloc_space` `--inuse/alloc_objects` 区别
3. go-torch
4. 优化建议
- 4.1. 将多个小对象合并成一个大的对象
- 4.2. 减少不必要的指针间接引用, 多使用 copy 引用
- 4.3. 局部变量逃逸时, 将其聚合起来
- 4.4. `[]byte` 的预分配
- 4.5. 尽可能使用字节数少的类型
- 4.6. 减少不必要的指针引用
- 4.7. 使用 `sync.Pool` 来缓存常用的对象

1. golang pprof

当你的 golang 程序在运行过程中消耗了超出你理解的内存时, 你就需要搞明白, 到底是程序中哪些代码导致了这些内存消耗。此时 golang 编译好的程序对你来说是个黑盒, 该如何搞清其中的内存使用呢? 幸好 golang 已经内置了一些机制来帮助我们进行分析和追踪。

此时, 通常我们可以采用 golang 的 pprof 来帮助我们分析 golang 进程的内存使用。

1.1. pprof 实例

通常我们采用 http api 来将 pprof 信息暴露出来以供分析, 我们可以采用 net/http/pprof 这个 package。下面是一个简单的示例:

// pprof 的 init 函数会将 pprof 里的一些 handler 注册到 http.DefaultServeMux 上
// 当不使用 http.DefaultServeMux 来提供 http api 时, 可以查阅其 init 函数, 自己注册 handler
import _ "net/http/pprof"go func() {http.ListenAndServe("0.0.0.0:8080", nil)
}()

此时我们可以启动进程, 然后访问 http://localhost:8080/debug/pprof/ 可以看到一个简单的页面, 页面上显示: 注意: 以下的全部数据, 包括 go tool pprof 采集到的数据都依赖进程中的 pprof 采样率, 默认 512kb 进行一次采样, 当我们认为数据不够细致时, 可以调节采样率 runtime.MemProfileRate, 但是采样率越低, 进程运行速度越慢。

/debug/pprof/profiles:
0         block
136840    goroutine
902       heap
0         mutex
40        threadcreatefull goroutine stack dump

上面简单暴露出了几个内置的 Profile 统计项。例如有 136840 个 goroutine 在运行, 点击相关链接可以看到详细信息。

当我们分析内存相关的问题时, 可以点击 heap 项, 进入 http://127.0.0.1:8080/debug/pprof/heap?debug=1 可以查看具体的显示:

heap profile: 3190: 77516056 [54762: 612664248] @ heap/1048576
1: 29081600 [1: 29081600] @ 0x89368e 0x894cd9 0x8a5a9d 0x8a9b7c 0x8af578 0x8b4441 0x8b4c6d 0x8b8504 0x8b2bc3 0x45b1c1
#    0x89368d    github.com/syndtr/goleveldb/leveldb/memdb.(*DB).Put+0x59d
#    0x894cd8    xxxxx/storage/internal/memtable.(*MemTable).Set+0x88
#    0x8a5a9c    xxxxx/storage.(*snapshotter).AppendCommitLog+0x1cc
#    0x8a9b7b    xxxxx/storage.(*store).Update+0x26b
#    0x8af577    xxxxx/config.(*config).Update+0xa7
#    0x8b4440    xxxxx/naming.(*naming).update+0x120
#    0x8b4c6c    xxxxx/naming.(*naming).instanceTimeout+0x27c
#    0x8b8503    xxxxx/naming.(*naming).(xxxxx/naming.instanceTimeout)-fm+0x63......# runtime.MemStats
# Alloc = 2463648064
# TotalAlloc = 31707239480
# Sys = 4831318840
# Lookups = 2690464
# Mallocs = 274619648
# Frees = 262711312
# HeapAlloc = 2463648064
# HeapSys = 3877830656
# HeapIdle = 854990848
# HeapInuse = 3022839808
# HeapReleased = 0
# HeapObjects = 11908336
# Stack = 655949824 / 655949824
# MSpan = 63329432 / 72040448
# MCache = 38400 / 49152
# BuckHashSys = 1706593
# GCSys = 170819584
# OtherSys = 52922583
# NextGC = 3570699312
# PauseNs = [1052815 217503 208124 233034 1146462 456882 1098525 530706 551702 419372 768322 596273 387826 455807 563621 587849 416204 599143 572823 488681 701731 656358 2476770 12141392 5827253 3508261 1715582 1295487 908563 788435 718700 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
# NumGC = 31
# DebugGC = false

其中显示的内容会比较多, 但是主体分为 2 个部分: 第一个部分打印为通过 runtime.MemProfile() 获取的 runtime.MemProfileRecord 记录。其含义为:

heap profile: 3190(inused objects): 77516056(inused bytes) [54762(alloc objects): 612664248(alloc bytes)] @ heap/1048576(2*MemProfileRate)
1: 29081600 [1: 29081600] (前面 4 个数跟第一行的一样, 此行以后是每次记录的, 后面的地址是记录中的栈指针)@ 0x89368e 0x894cd9 0x8a5a9d 0x8a9b7c 0x8af578 0x8b4441 0x8b4c6d 0x8b8504 0x8b2bc3 0x45b1c1
#    0x89368d    github.com/syndtr/goleveldb/leveldb/memdb.(*DB).Put+0x59d 栈信息

第二部分就比较好理解, 打印的是通过 runtime.ReadMemStats() 读取的 runtime.MemStats 信息。我们可以重点关注一下

Sys 进程从系统获得的内存空间, 虚拟地址空间。
HeapAlloc 进程堆内存分配使用的空间, 通常是用户 new 出来的堆对象, 包含未被 gc 掉的。
HeapSys 进程从系统获得的堆内存, 因为 golang 底层使用 TCmalloc 机制, 会缓存一部分堆内存, 虚拟地址空间。
PauseNs 记录每次 gc 暂停的时间 (纳秒), 最多记录 256 个最新记录。
NumGC 记录 gc 发生的次数。

相信, 对 pprof 不了解的用户看了以上内容, 很难获得更多的有用信息。因此我们需要引用更多工具来帮助我们更加简单的解读 pprof 内容。

2. go tool

我们可以采用 go tool pprof -inuse_space http://127.0.0.1:8080/debug/pprof/heap 命令连接到进程中查看正在使用的一些内存相关信息, 此时我们得到一个可以交互的命令行。

我们可以看数据 top10 来查看正在使用的对象较多的 10 个函数入口。通常用来检测有没有不符合预期的内存对象引用。

(pprof) top10
1355.47MB of 1436.26MB total (94.38%)
Dropped 371 nodes (cum <= 7.18MB)
Showing top 10 nodes out of 61 (cum >= 23.50MB)flat  flat%   sum%        cum   cum%512.96MB 35.71% 35.71%   512.96MB 35.71%  net/http.newBufioWriterSize503.93MB 35.09% 70.80%   503.93MB 35.09%  net/http.newBufioReader113.04MB  7.87% 78.67%   113.04MB  7.87%  runtime.rawstringtmp55.02MB  3.83% 82.50%    55.02MB  3.83%  runtime.malg45.01MB  3.13% 85.64%    45.01MB  3.13%  xxxxx/storage.(*Node).clone26.50MB  1.85% 87.48%    52.50MB  3.66%  context.WithCancel25.50MB  1.78% 89.26%    83.58MB  5.82%  runtime.systemstack25.01MB  1.74% 91.00%    58.51MB  4.07%  net/http.readRequest25MB  1.74% 92.74%    29.03MB  2.02%  runtime.mapassign23.50MB  1.64% 94.38%    23.50MB  1.64%  net/http.(*Server).newConn

然后我们在用 go tool pprof -alloc_space http://127.0.0.1:8080/debug/pprof/heap 命令链接程序来查看内存对象分配的相关情况。然后输入 top 来查看累积分配内存较多的一些函数调用:

(pprof) top
523.38GB of 650.90GB total (80.41%)
Dropped 342 nodes (cum <= 3.25GB)
Showing top 10 nodes out of 106 (cum >= 28.02GB)flat  flat%   sum%        cum   cum%147.59GB 22.68% 22.68%   147.59GB 22.68%  runtime.rawstringtmp129.23GB 19.85% 42.53%   129.24GB 19.86%  runtime.mapassign48.23GB  7.41% 49.94%    48.23GB  7.41%  bytes.makeSlice46.25GB  7.11% 57.05%    71.06GB 10.92%  encoding/json.Unmarshal31.41GB  4.83% 61.87%   113.86GB 17.49%  net/http.readRequest30.55GB  4.69% 66.57%   171.20GB 26.30%  net/http.(*conn).readRequest22.95GB  3.53% 70.09%    22.95GB  3.53%  net/url.parse22.70GB  3.49% 73.58%    22.70GB  3.49%  runtime.stringtoslicebyte22.70GB  3.49% 77.07%    22.70GB  3.49%  runtime.makemap21.75GB  3.34% 80.41%    28.02GB  4.31%  context.WithCancel

可以看出 string-[]byte 相互转换、分配 map、bytes.makeSlice、encoding/json.Unmarshal 等调用累积分配的内存较多。此时我们就可以 review 代码, 如何减少这些相关的调用, 或者优化相关代码逻辑。

当我们不明确这些调用时是被哪些函数引起的时, 我们可以输入 top -cum 来查找, -cum 的意思就是, 将函数调用关系中的数据进行累积, 比如 A 函数调用的 B 函数, 则 B 函数中的内存分配量也会累积到 A 上面, 这样就可以很容易的找出调用链。

(pprof) top20 -cum
322890.40MB of 666518.53MB total (48.44%)
Dropped 342 nodes (cum <= 3332.59MB)
Showing top 20 nodes out of 106 (cum >= 122316.23MB)flat  flat%   sum%        cum   cum%0     0%     0% 643525.16MB 96.55%  runtime.goexit2184.63MB  0.33%  0.33% 620745.26MB 93.13%  net/http.(*conn).serve0     0%  0.33% 435300.50MB 65.31%  xxxxx/api/server.(*HTTPServer).ServeHTTP5865.22MB  0.88%  1.21% 435300.50MB 65.31%  xxxxx/api/server/router.(*httpRouter).ServeHTTP0     0%  1.21% 433121.39MB 64.98%  net/http.serverHandler.ServeHTTP0     0%  1.21% 430456.29MB 64.58%  xxxxx/api/server/filter.(*chain).Next43.50MB 0.0065%  1.21% 429469.71MB 64.43%  xxxxx/api/server/filter.TransURLTov10     0%  1.21% 346440.39MB 51.98%  xxxxx/api/server/filter.Role30x
31283.56MB  4.69%  5.91% 175309.48MB 26.30%  net/http.(*conn).readRequest0     0%  5.91% 153589.85MB 23.04%  github.com/julienschmidt/httprouter.(*Router).ServeHTTP0     0%  5.91% 153589.85MB 23.04%  github.com/julienschmidt/httprouter.(*Router).ServeHTTP-fm0     0%  5.91% 153540.85MB 23.04%  xxxxx/api/server/router.(*httpRouter).Register.func12MB 0.0003%  5.91% 153117.78MB 22.97%  xxxxx/api/server/filter.Validate
151134.52MB 22.68% 28.58% 151135.02MB 22.68%  runtime.rawstringtmp0     0% 28.58% 150714.90MB 22.61%  xxxxx/api/server/router/naming/v1.(*serviceRouter).(git.intra.weibo.com/platform/vintage/api/server/router/naming/v1.service)-fm0     0% 28.58% 150714.90MB 22.61%  xxxxx/api/server/router/naming/v1.(*serviceRouter).service0     0% 28.58% 141200.76MB 21.18%  net/http.Redirect
132334.96MB 19.85% 48.44% 132342.95MB 19.86%  runtime.mapassign42MB 0.0063% 48.44% 125834.16MB 18.88%  xxxxx/api/server/router/naming/v1.heartbeat0     0% 48.44% 122316.23MB 18.35%  xxxxxx/config.(*config).Lookup

如上所示, 我们就很容易的查找到这些函数是被哪些函数调用的。

根据代码的调用关系, filter.TransURLTov1 会调用 filter.Role30x, 但是他们之间的 cum% 差值有 12.45%, 因此我们可以得知 filter.TransURLTov1 内部自己直接分配的内存量达到了整个进程分配内存总量的 12.45%, 这可是一个值得大大优化的地方。

然后我们可以输入命令 web, 其会给我们的浏览器弹出一个 .svg 图片, 其会把这些累积关系画成一个拓扑图, 提供给我们。或者直接执行 go tool pprof -alloc_space -cum -svg http://127.0.0.1:8080/debug/pprof/heap > heap.svg 来生成 heap.svg 图片。

下面我们取一个图片中的一个片段进行分析:

每一个方块为 pprof 记录的一个函数调用栈, 指向方块的箭头上的数字是记录的该栈累积分配的内存向, 从方块指出的箭头上的数字为该函数调用的其他函数累积分配的内存。他们之间的差值可以简单理解为本函数除调用其他函数外, 自身分配的。方块内部的数字也体现了这一点, 其数字为:(自身分配的内存 of 该函数累积分配的内存)。

2.1. `--inuse/alloc_space` `--inuse/alloc_objects` 区别

通常情况下:

用 --inuse_space 来分析程序常驻内存的占用情况;
用 --alloc_objects 来分析内存的临时分配情况, 可以提高程序的运行速度。

3. go-torch

除了直接使用 go tool pprof 外, 我们还可以使用更加直观了火焰图。因此我们可以直接使用 go-torch 来生成 golang 程序的火焰图, 该工具也直接依赖 pprof/go tool pprof 等。该工具的相关安装请看该项目的介绍。该软件的 a4daa2b 以后版本才支持内存的 profiling。

我们可以使用

go-torch -alloc_space http://127.0.0.1:8080/debug/pprof/heap --colors=mem
go-torch -inuse_space http://127.0.0.1:8080/debug/pprof/heap --colors=mem

注意:-alloc_space/-inuse_space 参数与 -u/-b 等参数有冲突, 使用了 -alloc_space/-inuse_space 后请将 pprof 的资源直接追加在参数后面, 而不要使用 -u/-b 参数去指定, 这与 go-torch 的参数解析问题有关, 看过其源码后既能明白。同时还要注意, 分析内存的 URL 一定是 heap 结尾的, 因为默认路径是 profile 的, 其用来分析 cpu 相关问题。

通过上面 2 个命令, 我们就可以得到 alloc_space/inuse_space 含义的 2 个火焰图, 例如 alloc_space.svg/inuse_space.svg。我们可以使用浏览器观察这 2 张图, 这张图, 就像一个山脉的截面图, 从下而上是每个函数的调用栈, 因此山的高度跟函数调用的深度正相关, 而山的宽度跟使用 / 分配内存的数量成正比。我们只需要留意那些宽而平的山顶, 这些部分通常是我们需要优化的地方。

4. 优化建议

Debugging performance issues in Go programs 提供了一些常用的优化建议:

4.1. 将多个小对象合并成一个大的对象

4.2. 减少不必要的指针间接引用, 多使用 copy 引用

例如使用 bytes.Buffer 代替 *bytes.Buffer, 因为使用指针时, 会分配 2 个对象来完成引用。

4.3. 局部变量逃逸时, 将其聚合起来

这一点理论跟 1 相同, 核心在于减少 object 的分配, 减少 gc 的压力。例如, 以下代码

for k, v := range m {k, v := k, v   // copy for capturing by the goroutinego func() {// use k and v}()
}

可以修改为:

for k, v := range m {x := struct{ k, v string }{k, v}   // copy for capturing by the goroutinego func() {// use x.k and x.v}()
}

修改后, 逃逸的对象变为了 x, 将 k, v2 个对象减少为 1 个对象。

4.4. `[]byte` 的预分配

当我们比较清楚的知道 []byte 会到底使用多少字节, 我们就可以采用一个数组来预分配这段内存。例如:

type X struct {buf      []bytebufArray [16]byte // Buf usually does not grow beyond 16 bytes.
}func MakeX() *X {x := &X{}// Preinitialize buf with the backing array.x.buf = x.bufArray[:0]return x
}

4.5. 尽可能使用字节数少的类型

当我们的一些 const 或者计数字段不需要太大的字节数时, 我们通常可以将其声明为 int8 类型。

4.6. 减少不必要的指针引用

当一个对象不包含任何指针 (注意: strings, slices, maps 和 chans 包含隐含的指针), 时, 对 gc 的扫描影响很小。比如, 1GB byte 的 slice 事实上只包含有限的几个 object, 不会影响垃圾收集时间。因此, 我们可以尽可能的减少指针的引用。

4.7. 使用 `sync.Pool` 来缓存常用的对象

golang pprof相关推荐

白话 Golang pprof
文章目录 0.前言 1.什么是 pprof 2.pprof 的作用是什么 3.pprof 的使用模式 4.安装 Graphviz 4.应用程序性能分析 4.1 CPU 性能分析 4.2 内存性能分析 ...
使用golang pprof进行性能分析
golang pprof,说实话自己还一次都没有实际操作过. 最近这几天的需求恰好需要分析下一个看似很简单的服务,内存配置上限是900m,最终在大量并发的时候出现oom的情况. 代码准备首先代码需要 ...
Golang pprof简介
目录概要 pprof的作用使用方式交互式常用命令以profile为例,其余的指标也是用一样的命令 Top N List func Traces web func Base Debug=[num ...
golang pprof工具
pprof工具 pprof是什么 pprof是分析和显示性能相关数据的工具 pprof读取profile.proto格式的分析抽样集合数据,同时创建报告来展现和帮助分析数据,它能创建包括文本和图型报告 ...
Golang pprof 使用
目录什么是 Profile? 两种收集方式工具型应用服务型应用 go tool ppof 获取和分析 profile 数据终端可视化什么是 Profile? 在计算机性能调试领域里,pro ...
Golang pprof 性能分析与火焰图
文章目录 1. 安装graphviz 1.1 下载 graphviz (windows 环境) 1.2 测试graphviz是否安装成功 2. 使用pprof 2.1 修改代码 2.2 火焰图生成 3 ...
一看就懂系列之Golang的pprof
前言这是一篇给网友的文章,正好最近在研究分析golang的性能,我觉得是时候来一个了断了. 正文 1.一句话简介 Golang自带的一款开箱即用的性能监控和分析工具. (全篇看的过程中没必要特意记忆 ...
golang 内存分析/动态追踪
https://my.oschina.net/ytqvip/blog/1920459 golang pprof 当你的golang程序在运行过程中消耗了超出你理解的内存时,你就需要搞明白,到底是程序 ...
golang profiling
这里写目录标题 1. Golang Profiling 1.1. runtime/pprof 包的使用 1.2. net/http/pprof 包的使用 2. 创建火焰图 2.1. 安装 go-tor ...

golang pprof

这里填写标题

1. golang pprof

1.1. pprof 实例

2. go tool

2.1. `--inuse/alloc_space` `--inuse/alloc_objects` 区别

3. go-torch

4. 优化建议

4.1. 将多个小对象合并成一个大的对象

4.2. 减少不必要的指针间接引用, 多使用 copy 引用

4.3. 局部变量逃逸时, 将其聚合起来

4.4. `[]byte` 的预分配

4.5. 尽可能使用字节数少的类型

4.6. 减少不必要的指针引用

4.7. 使用 `sync.Pool` 来缓存常用的对象

golang pprof相关推荐

最新文章

热门文章

golang pprof

这里填写标题

1. golang pprof

1.1. pprof 实例

2. go tool

2.1. --inuse/alloc_space --inuse/alloc_objects 区别

3. go-torch

4. 优化建议

4.1. 将多个小对象合并成一个大的对象

4.2. 减少不必要的指针间接引用, 多使用 copy 引用

4.3. 局部变量逃逸时, 将其聚合起来

4.4. []byte 的预分配

4.5. 尽可能使用字节数少的类型

4.6. 减少不必要的指针引用

4.7. 使用 sync.Pool 来缓存常用的对象

golang pprof相关推荐

最新文章

热门文章

2.1. `--inuse/alloc_space` `--inuse/alloc_objects` 区别

4.4. `[]byte` 的预分配

4.7. 使用 `sync.Pool` 来缓存常用的对象