最近在一个系统频频遇到负载突然飙升到几百, 然后又下去的情况.
根据负载升高的时间点对应的数据库日志分析, 对应的时间点, 有大量的类似如下的日志 :

"UPDATE waiting",2015-01-09 01:38:47 CST,979/7,2927976054,LOG,00000,"process 26366 still waiting for ExclusiveLock on extension of relation 686062002 of database 35078604 after 1117.676 ms",,,,,,"
"INSERT waiting",2015-01-09 01:38:36 CST,541/8,2927976307,LOG,00000,"process 25936 still waiting for ExclusiveLock on extension of relation 686062002 of database 35078604 after 1219.762 ms",,,,,,"
"INSERT waiting",2015-01-09 01:38:48 CST,1018/64892,2929458056,LOG,00000,"process 26439 still waiting for ExclusiveLock on extension of relation 686061993 of database 35078604 after 1000.105 ms",
.........

对应几个对象的块扩展等待

select 686062002::regclass;regclass
-----------------------------pg_toast.pg_toast_686061993
(1 row)
select relname from pg_class where reltoastrelid=686062002;relname
-------------------------------------tbl_xxx_20150109
(1 row)
Time: 4.643 ms

同时系统的dmesg还伴随 :

postgres: page allocation failure. order:1, mode:0x20
Pid: 20427, comm: postgres Tainted: P           ---------------    2.6.32-504.el6.x86_64 #1
Call Trace:<IRQ>  [<ffffffff8113438a>] ? __alloc_pages_nodemask+0x74a/0x8d0[<ffffffff810eaa90>] ? handle_IRQ_event+0x60/0x170[<ffffffff81173332>] ? kmem_getpages+0x62/0x170[<ffffffff81173f4a>] ? fallback_alloc+0x1ba/0x270[<ffffffff8117399f>] ? cache_grow+0x2cf/0x320[<ffffffff81173cc9>] ? ____cache_alloc_node+0x99/0x160[<ffffffff81174c4b>] ? kmem_cache_alloc+0x11b/0x190[<ffffffff8144c768>] ? sk_prot_alloc+0x48/0x1c0[<ffffffff8144d992>] ? sk_clone+0x22/0x2e0[<ffffffff814a1b76>] ? inet_csk_clone+0x16/0xd0[<ffffffff814bb713>] ? tcp_create_openreq_child+0x23/0x470[<ffffffff814b8ecd>] ? tcp_v4_syn_recv_sock+0x4d/0x310[<ffffffff814bb4b6>] ? tcp_check_req+0x226/0x460[<ffffffff814b890b>] ? tcp_v4_do_rcv+0x35b/0x490[<ffffffffa0207557>] ? ipv4_confirm+0x87/0x1d0 [nf_conntrack_ipv4][<ffffffff814ba1a2>] ? tcp_v4_rcv+0x522/0x900[<ffffffff81496d10>] ? ip_local_deliver_finish+0x0/0x2d0[<ffffffff81496ded>] ? ip_local_deliver_finish+0xdd/0x2d0[<ffffffff81497078>] ? ip_local_deliver+0x98/0xa0[<ffffffff8149653d>] ? ip_rcv_finish+0x12d/0x440[<ffffffff81496ac5>] ? ip_rcv+0x275/0x350[<ffffffff8145c88b>] ? __netif_receive_skb+0x4ab/0x750[<ffffffff81460588>] ? netif_receive_skb+0x58/0x60[<ffffffff81460690>] ? napi_skb_finish+0x50/0x70[<ffffffff81461f69>] ? napi_gro_receive+0x39/0x50[<ffffffffa01a7d91>] ? igb_poll+0x981/0x1010 [igb][<ffffffff814b59c0>] ? tcp_delack_timer+0x0/0x270[<ffffffff814b3af9>] ? tcp_send_ack+0xd9/0x120[<ffffffff81462083>] ? net_rx_action+0x103/0x2f0[<ffffffff8107d8b1>] ? __do_softirq+0xc1/0x1e0[<ffffffff810eaa90>] ? handle_IRQ_event+0x60/0x170[<ffffffff8107d90f>] ? __do_softirq+0x11f/0x1e0[<ffffffff8100c30c>] ? call_softirq+0x1c/0x30[<ffffffff8100fc15>] ? do_softirq+0x65/0xa0[<ffffffff8107d765>] ? irq_exit+0x85/0x90[<ffffffff81533b45>] ? do_IRQ+0x75/0xf0[<ffffffff8100b9d3>] ? ret_from_intr+0x0/0x11<EOI>  [<ffffffff8116f5f9>] ? compaction_alloc+0x269/0x4b0[<ffffffff8116f552>] ? compaction_alloc+0x1c2/0x4b0[<ffffffff811799fa>] ? migrate_pages+0xaa/0x480[<ffffffff8100b9ce>] ? common_interrupt+0xe/0x13[<ffffffff8116f390>] ? compaction_alloc+0x0/0x4b0[<ffffffff8116e9ea>] ? compact_zone+0x61a/0xba0[<ffffffff8116f01c>] ? compact_zone_order+0xac/0x100[<ffffffff8116f151>] ? try_to_compact_pages+0xe1/0x120[<ffffffff81133b6a>] ? __alloc_pages_direct_compact+0xda/0x1b0[<ffffffff81134055>] ? __alloc_pages_nodemask+0x415/0x8d0[<ffffffff8116c79a>] ? alloc_pages_vma+0x9a/0x150[<ffffffff8118845d>] ? do_huge_pmd_anonymous_page+0x14d/0x3b0[<ffffffff8114fdb0>] ? handle_mm_fault+0x2f0/0x300[<ffffffff8104d0d8>] ? __do_page_fault+0x138/0x480[<ffffffff8152ae5e>] ? mutex_lock+0x1e/0x50[<ffffffff8152ffbe>] ? do_page_fault+0x3e/0xa0[<ffffffff8152d375>] ? page_fault+0x25/0x30

这是个日志表, 有4个索引, 其中一个变长字段存储的值较长(因此有用到TOAST存储), 例如

DxxxxxxxxxxxxzwLlyyDd7xGd7^7xxwLDxyD@5xHB7^if5^vv4&DJCEL7xxxCFyhsxxxd4x~j2%$BB%ChkzHlzzvxBwqn5^DDCFexzwC@zyLDz
zC~zyDDzyCbAyyh3M~v5^DDCHvBBy%j0%iL4^fJB%K1xxxB%G1wz~h2M%B4%qn5&7xxwyPs!$xJ!Dd7xCb3^DFCGLnzyzlzyP7zyCJ7x)Lx^xxxxxxxy73$rLB&DND
zL5zy~xxyCt4xPj4%DJCE~DzyP#zyLPzyypxxx3&~DB^P1zzC5zye5wzz10MCb3^Gp4^DLCEiNywi$yzvxBwL73&$F7%7xzwG5zyy5wyah4MbzB%C1DzL9zyf5yyG9z
y!1zyLJxyCt4xPP5%nLB&xxxx
&7xxwjHzzi#yyi$yzi$yzmHIPm^K@CbAzzh5MDBCGLxxxxwz~h5M$JB%DxDzGlzyH5zyL7zzylzyC9AyLxxx7xxwC5AyzNxxxxxx5%Pp0
^~d0&6NzwK1wyzN1M%xxxx^P74^DJCGD7yyvBByiF4&Pt0%~d0&6hwwvDByiF4&et5xxxxxx17%$$1^DBCFGfwxzh2M!j1%qv5^DLCF!NzyH5zzvJBy%h4
%aD4&%v4^61zwDnyyK5wyzN0Mxxxx5$71zwCd7xCv0&fj5&(h1%yNc%mf7%71zwxxx%a@4%rpd^a5d$71zwCt5x!l1$~^l^LDx&K1wzmh5MxxxxxxxxxxGr4&Dvd&$jl
%LBx%K1wzPh5M$Ll^yn1&Ht6+fxxxxxxwC@5xqnxxx&~90^fj5+Oh5%71zwDJ7xH~j&yDh$G@5^7xxxx^Gp4@DLCEf1yw7b0~bn0^%^i&HDzyebz
y6)zyLBzyfdyyvxBwH#5%GP5^nvd&$LB^DN5zqHA^%P1^nLlEDL5%$@n^i#4^$J7%nPn+bzF@Ct4xD~1&GB4^add%7xxw!7zybBxyC@5xqn5xxxxx!DLCEvBxxxd6&!vL@7xx
w)dyECn5&)B1zxxxxxOz)D&HND&C9DOOl1yzpD&xxxxxx1^6hzwLrzyf3zxxx&L73+Gp4@DLCEiNywi$yzvxBw$h5%Hdf&(l9%zh0%nHB^D1zz~7zyvzB
yHl6%jl4!!Fg^rLB%DhDzC5xxxxb7j&aH4^)txxxDzvxBw$v0%DJCEzNxxxxzyzr1y~Lzx(vA&(@zyrtCyy5DxxxHl5OHbDO$3DxxxyyN0MD~1%afd&71z
wDd7xj17xxxxx$)7^7N2wq8*=
每天会新建一个表, 因此不停的在做数据块的扩展, 但是理论上扩展是比较快的, 不会导致以上情况的发生, 而且发生问题的时间点, 数据量, 并发量也正常.
关于这个等待的情况, 可以参考之前写过一篇文章, 关于批量导入遇到的extend lock等待的性能问题.
http://blog.163.com/digoal@126/blog/static/163877040201392641033482

和本文 性能的 case 无关.
看样子是ZFS的问题, 最后排查发现. 
free的内存在不停的减少, 当减少到0的时候, 负载就会马上飙升. 
环境 : 
CentOS 6.x x64
2.6.32-504.el6.x86_64
zfs 版本
zfs-0.6.3-1.1.el6.x86_64
libzfs2-0.6.3-1.1.el6.x86_64
zfs-dkms-0.6.3-1.1.el6.noarch

服务器内存 384G
数据库shared buffer 20GB, maintenance_work_mem=2G, autovacuum_max_workers=6 
不算work_MEM的话, 数据库最多可能占用32G内存. 
还有300多G可以给系统和ZFS使用.
zfs 参数如下

cd /sys/module/zfs/parameters
# grep '' *|sort
l2arc_feed_again:1
l2arc_feed_min_ms:200
l2arc_feed_secs:1
l2arc_headroom:2
l2arc_headroom_boost:200
l2arc_nocompress:0
l2arc_noprefetch:1
l2arc_norw:0
l2arc_write_boost:8388608
l2arc_write_max:8388608
metaslab_debug_load:0
metaslab_debug_unload:0
spa_asize_inflation:24
spa_config_path:/etc/zfs/zpool.cache
zfetch_array_rd_sz:1048576
zfetch_block_cap:256
zfetch_max_streams:8
zfetch_min_sec_reap:2
zfs_arc_grow_retry:5
zfs_arc_max:10240000000
zfs_arc_memory_throttle_disable:1
zfs_arc_meta_limit:0
zfs_arc_meta_prune:1048576
zfs_arc_min:0
zfs_arc_min_prefetch_lifespan:1000
zfs_arc_p_aggressive_disable:1
zfs_arc_p_dampener_disable:1
zfs_arc_shrink_shift:5
zfs_autoimport_disable:0
zfs_dbuf_state_index:0
zfs_deadman_enabled:1
zfs_deadman_synctime_ms:1000000
zfs_dedup_prefetch:1
zfs_delay_min_dirty_percent:60
zfs_delay_scale:500000
zfs_dirty_data_max:10240000000
zfs_dirty_data_max_max:101595342848
zfs_dirty_data_max_max_percent:25
zfs_dirty_data_max_percent:10
zfs_dirty_data_sync:67108864
zfs_disable_dup_eviction:0
zfs_expire_snapshot:300
zfs_flags:1
zfs_free_min_time_ms:1000
zfs_immediate_write_sz:32768
zfs_mdcomp_disable:0
zfs_nocacheflush:0
zfs_nopwrite_enabled:1
zfs_no_scrub_io:0
zfs_no_scrub_prefetch:0
zfs_pd_blks_max:100
zfs_prefetch_disable:0
zfs_read_chunk_size:1048576
zfs_read_history:0
zfs_read_history_hits:0
zfs_recover:0
zfs_resilver_delay:2
zfs_resilver_min_time_ms:3000
zfs_scan_idle:50
zfs_scan_min_time_ms:1000
zfs_scrub_delay:4
zfs_send_corrupt_data:0
zfs_sync_pass_deferred_free:2
zfs_sync_pass_dont_compress:5
zfs_sync_pass_rewrite:2
zfs_top_maxinflight:32
zfs_txg_history:0
zfs_txg_timeout:5
zfs_vdev_aggregation_limit:131072
zfs_vdev_async_read_max_active:3
zfs_vdev_async_read_min_active:1
zfs_vdev_async_write_active_max_dirty_percent:60
zfs_vdev_async_write_active_min_dirty_percent:30
zfs_vdev_async_write_max_active:10
zfs_vdev_async_write_min_active:1
zfs_vdev_cache_bshift:16
zfs_vdev_cache_max:16384
zfs_vdev_cache_size:0
zfs_vdev_max_active:1000
zfs_vdev_mirror_switch_us:10000
zfs_vdev_read_gap_limit:32768
zfs_vdev_scheduler:noop
zfs_vdev_scrub_max_active:2
zfs_vdev_scrub_min_active:1
zfs_vdev_sync_read_max_active:10
zfs_vdev_sync_read_min_active:10
zfs_vdev_sync_write_max_active:10
zfs_vdev_sync_write_min_active:10
zfs_vdev_write_gap_limit:4096
zfs_zevent_cols:80
zfs_zevent_console:0
zfs_zevent_len_max:768
zil_replay_disable:0
zil_slog_limit:1048576
zio_bulk_flags:0
zio_delay_max:30000
zio_injection_enabled:0
zio_requeue_io_start_cut_in_line:1
zvol_inhibit_dev:0
zvol_major:230
zvol_max_discard_blocks:16384
zvol_threads:32

这些参数的介绍可参考 :

man /usr/share/man/man5/zfs-module-parameters.5.gz

zpool参数

# zpool get all zp1
NAME  PROPERTY               VALUE                  SOURCE
zp1   size                   40T                    -
zp1   capacity               2%                     -
zp1   altroot                -                      default
zp1   health                 ONLINE                 -
zp1   guid                   15254203672861282738   default
zp1   version                -                      default
zp1   bootfs                 -                      default
zp1   delegation             on                     default
zp1   autoreplace            off                    default
zp1   cachefile              -                      default
zp1   failmode               wait                   default
zp1   listsnapshots          off                    default
zp1   autoexpand             off                    default
zp1   dedupditto             0                      default
zp1   dedupratio             1.00x                  -
zp1   free                   39.0T                  -
zp1   allocated              995G                   -
zp1   readonly               off                    -
zp1   ashift                 12                     local
zp1   comment                -                      default
zp1   expandsize             0                      -
zp1   freeing                0                      default
zp1   feature@async_destroy  enabled                local
zp1   feature@empty_bpobj    active                 local
zp1   feature@lz4_compress   active                 local

zfs参数

# zfs get all zp1/data_a0
NAME         PROPERTY              VALUE                  SOURCE
zp1/data_a0  type                  filesystem             -
zp1/data_a0  creation              Thu Dec 18 10:30 2014  -
zp1/data_a0  used                  98.8G                  -
zp1/data_a0  available             34.1T                  -
zp1/data_a0  referenced            98.8G                  -
zp1/data_a0  compressratio         1.00x                  -
zp1/data_a0  mounted               yes                    -
zp1/data_a0  quota                 none                   default
zp1/data_a0  reservation           none                   default
zp1/data_a0  recordsize            128K                   default
zp1/data_a0  mountpoint            /data_a0               local
zp1/data_a0  sharenfs              off                    default
zp1/data_a0  checksum              on                     default
zp1/data_a0  compression           off                    local
zp1/data_a0  atime                 off                    inherited from zp1
zp1/data_a0  devices               on                     default
zp1/data_a0  exec                  on                     default
zp1/data_a0  setuid                on                     default
zp1/data_a0  readonly              off                    default
zp1/data_a0  zoned                 off                    default
zp1/data_a0  snapdir               hidden                 default
zp1/data_a0  aclinherit            restricted             default
zp1/data_a0  canmount              on                     default
zp1/data_a0  xattr                 sa                     local
zp1/data_a0  copies                1                      default
zp1/data_a0  version               5                      -
zp1/data_a0  utf8only              off                    -
zp1/data_a0  normalization         none                   -
zp1/data_a0  casesensitivity       sensitive              -
zp1/data_a0  vscan                 off                    default
zp1/data_a0  nbmand                off                    default
zp1/data_a0  sharesmb              off                    default
zp1/data_a0  refquota              none                   default
zp1/data_a0  refreservation        none                   default
zp1/data_a0  primarycache          metadata               local
zp1/data_a0  secondarycache        all                    local
zp1/data_a0  usedbysnapshots       0                      -
zp1/data_a0  usedbydataset         98.8G                  -
zp1/data_a0  usedbychildren        0                      -
zp1/data_a0  usedbyrefreservation  0                      -
zp1/data_a0  logbias               latency                default
zp1/data_a0  dedup                 off                    default
zp1/data_a0  mlslabel              none                   default
zp1/data_a0  sync                  standard               default
zp1/data_a0  refcompressratio      1.00x                  -
zp1/data_a0  written               98.8G                  -
zp1/data_a0  logicalused           98.7G                  -
zp1/data_a0  logicalreferenced     98.7G                  -
zp1/data_a0  snapdev               hidden                 default
zp1/data_a0  acltype               off                    default
zp1/data_a0  context               none                   default
zp1/data_a0  fscontext             none                   default
zp1/data_a0  defcontext            none                   default
zp1/data_a0  rootcontext           none                   default
zp1/data_a0  relatime              off                    default

解决问题可能要从 arc入手 : 
ARC原理参考
https://pthree.org/2012/12/07/zfs-administration-part-iv-the-adjustable-replacement-cache/
man zfs-module-parameters
arc 优化案例
http://dtrace.org/blogs/brendan/2014/02/11/another-10-performance-wins/
https://www.cupfighter.net/2013/03/default-nexenta-zfs-settings-you-want-to-change-part-2
在大内存下建议调整ARC shrink shift (降到每次100M左右)

       zfs_arc_shrink_shift (int)log2(fraction of arc to reclaim)Default value: 5.
默认是5, 也就是1/32 , 如果内存有384G, 将达到12GB, 一次shrink 12GB arc的话, 要hang很久的.
建议降低到100MB左右, 那么可以设置zfs_arc_shrink_shift =11, 也就是1/2048, 相当于187.5MB

Description: Semi-regular spikes in I/O latency on an SSD postgres server.
Analysis: The customer reported multi-second I/O latency for a server with flash memory-based solid state disks (SSDs). Since this SSD type was new in production, it was feared that there may be a new drive or firmware problem causing high latency. ZFS latency counters, measured at the VFS interface, confirmed that I/O latency was dismal, sometimes reaching 10 seconds for I/O. The DTrace-based iosnoop tool (DTraceToolkit) was used to trace at the block device level, however, no seriously slow I/O was observed from the SSDs. I plotted the iosnoop traces using R for evidence of queueing behind TXG flushes, but they didn’t support that theory either.
This was difficult to investigate since the slow I/O was intermittent, sometimes only occurring once per hour. Instead of a typical interactive investigation, I developed various ways to log activity from DTrace and kstats, so that clues for the issue could be examined afterwards from the logs. This included capturing which processes were executed using execsnoop, and dumping ZFS metrics from kstat, including arcstats. This showed that various maintenance processes were executing during the hour, and, the ZFS ARC, which was around 210 Gbytes, would sometimes drop by around 6 Gbytes. Having worked performance issues with shrinking ARCs before, I developed a DTrace script to trace ARC reaping along with process execution, and found that it was a match with a cp(1) command. This was part of the maintenance task, which was copying a 30 Gbyte file, hitting the ARC limit and triggering an ARC shrink. Shrinking involves holding ARC hash locks, which can cause latency, especially when shrinking 6 Gbytes worth of buffers. The zfs:zfs_arc_shrink_shift tunable was adjusted to reduce the shrink size, which also made them more frequent. The worst-case I/O improved from 10s to 100ms.ARC shrink shift
Every second a process runs which checks if data can be removed from the ARC and evicts it. Default max 1/32nd of the ARC can be evicted at a time. This is limited because evicting large amounts of data from ARC stalls all other processes. Back when 8GB was a lot of memory 1/32nd meant 256MB max at a time. When you have 196GB of memory 1/32nd is 6.3GB, which can cause up to 20-30 seconds of unresponsiveness (depending on the record size).
This 1/32nd needs to be changed to make sure the max is set to ~100-200MB again, by adding the following to /etc/system:
set zfs:zfs_arc_shrink_shift=11
(where 11 is 1/2 11 or 1/2048th, 10 is  1/2 10 or 1/1024th etc. Change depending on amount of RAM in your system).

结合ARC原理还有异步dirty write delay的情况, 优化如下 :

       zfs_vdev_async_write_active_min_dirty_percent (int)When  the  pool  has  less  than  zfs_vdev_async_write_active_min_dirty_percent  dirty  data,   usezfs_vdev_async_write_min_active to limit active async writes.  If the dirty data is between min andmax, the active I/O limit is linearly interpolated. See the section "ZFS I/O SCHEDULER".Default value: 30.zfs_vdev_async_write_active_max_dirty_percent (int)When  the  pool  has  more  than  zfs_vdev_async_write_active_max_dirty_percent  dirty  data,   usezfs_vdev_async_write_max_active to limit active async writes.  If the dirty data is between min andmax, the active I/O limit is linearly interpolated. See the section "ZFS I/O SCHEDULER".Default value: 60.zfs_vdev_async_write_max_active (int)Maxium asynchronous write I/Os active to each device.  See the section "ZFS I/O SCHEDULER".Default value: 10.zfs_vdev_async_write_min_active (int)Minimum asynchronous write I/Os active to each device.  See the section "ZFS I/O SCHEDULER".Default value: 1.
这幅图表示异步dirty write的提速和限速情况, 降低zfs_vdev_async_write_active_min_dirty_percent可以使最小限速区间变小,
降低zfs_vdev_async_write_active_max_dirty_percent可以使最大限速提早, 从而提高脏数据的flush速度. 
但是可能影响同步写的IO争抢.

              |              o---------| <-- zfs_vdev_async_write_max_active^    |             /^         ||    |            / |         |active |           /  |         |I/O   |          /   |         |count  |         /    |         ||        /     |         ||-------o      |         | <-- zfs_vdev_async_write_min_active0|_______^______|_________|0%      |      |       100% of zfs_dirty_data_max|      ||      ‘-- zfs_vdev_async_write_active_max_dirty_percent‘--------- zfs_vdev_async_write_active_min_dirty_percent
另一方面, 我们需要设置arc max, 注意不是dirty arc max
因为数据库也占用了大部分内存, ZFS ARC不限制的话就无节制了.
有文章指出将ARC限制到总内存的40% . (总内存有384GB, PostgreSQL shared buffer用掉 20GB)
http://blog.163.com/digoal@126/blog/static/163877040201462204333503

到底设置为多少呢 ?
查看当前情况, 数据库已经开启,

# freetotal       used       free     shared    buffers     cached
Mem:     396856808  228812456  168044352   21633868      58744   45380060
系统有168GB空闲内存
ARC已使用约20GB内存

# cat /proc/spl/kstat/zfs/arcstats |grep size
size                            4    19751851104
那么, 在当前空闲内存的情形下我再留48GB给系统和数据库的话, ZFS还有120GB可用.
加上已用的20G, ZFS可以用140G.
把arc max设置到140GB  (差不多是总内存的40%)

# echo 140000000000 > /sys/module/zfs/parameters/zfs_arc_max

接下来设置一下dirty相关的参数

zfs_dirty_data_max 降到 arc max 的 1/5 = 28000000000 (可动态调整)

异步写的加速参数调整

zfs_vdev_async_write_active_min_dirty_percent=10
zfs_vdev_async_write_active_max_dirty_percent=30  (务必小于zfs_delay_min_dirty_percent)
zfs_delay_min_dirty_percent=60

动态调整后, 建议设置启动模块参数 :

# cd /sys/module/zfs/parameters/
# echo 140000000000 >zfs_arc_max
# echo 28000000000 >zfs_dirty_data_max
# echo 10 > zfs_vdev_async_write_active_min_dirty_percent
# echo 30 > zfs_vdev_async_write_active_max_dirty_percent
# echo 60 > zfs_delay_min_dirty_percent
# echo 11 > zfs_arc_shrink_shift

zfs模块启动参数

# vi /etc/modprobe.d/zfs.conf
options zfs zfs_arc_max=140000000000
options zfs zfs_dirty_data_max=28000000000
options zfs zfs_vdev_async_write_active_min_dirty_percent=10
options zfs zfs_vdev_async_write_active_max_dirty_percent=30
options zfs zfs_delay_min_dirty_percent=60
options zfs zfs_arc_shrink_shift=11

观察期.....
还是一个样子, 内存会用光, 然后一样CPU暴增.
但是进程的内存消耗是正常的,

# ps -e --width=1024 -o pid,%mem,rss,size,sz,vsz,cmd --sort rss
rss        RSS      resident set size, the non-swapped physical memory that a task has used (in kiloBytes).(alias rssize, rsz).
size       SZ       approximate amount of swap space that would be required if the process were to dirty all writablepages and then be swapped out. This number is very rough!
sz         SZ       size in physical pages of the core image of the process. This includes text, data, and stackspace. Device mappings are currently excluded; this is subject to change. See vsz and rss.
vsz        VSZ      virtual memory size of the process in KiB (1024-byte units). Device mappings are currentlyexcluded; this is subject to change. (alias vsize).
06:10:01 AM kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit
06:20:01 AM 219447748 177409060     44.70     23196  22965260  27351828      6.75
06:30:01 AM 219304016 177552792     44.74     24628  23080756  27348820      6.75
06:40:01 AM 218698000 178158808     44.89     26276  23638736  27365764      6.75
06:50:01 AM 218454732 178402076     44.95     27588  23852552  27365664      6.75
07:00:01 AM 218211060 178645748     45.02     28840  24066384  27365736      6.75
07:10:01 AM 218006588 178850220     45.07     30144  24231036  27366528      6.75
07:20:01 AM 217784072 179072736     45.12     31424  24412084  27365496      6.75
07:30:01 AM 217128620 179728188     45.29     32752  24970064  27370048      6.75
07:40:01 AM 216704964 180151844     45.39     34372  25331396  27369700      6.75
07:50:01 AM 216372456 180484352     45.48     35740  25610760  27371348      6.75
08:00:01 AM 216028392 180828416     45.57     37060  25890136  27393748      6.76
08:10:01 AM 214706196 182150612     45.90     38808  27120088  27400288      6.76
08:20:01 AM 213981920 182874888     46.08     42712  27798924  27413000      6.76
08:30:01 AM 213551104 183305704     46.19     44268  28193028  27411516      6.76

设置cache的使用趋势

vfs_cache_pressure
------------------This percentage value controls the tendency of the kernel to reclaim
the memory which is used for caching of directory and inode objects.At the default value of vfs_cache_pressure=100 the kernel will attempt to
reclaim dentries and inodes at a "fair" rate with respect to pagecache and
swapcache reclaim.  Decreasing vfs_cache_pressure causes the kernel to prefer
to retain dentry and inode caches. When vfs_cache_pressure=0, the kernel will
never reclaim dentries and inodes due to memory pressure and this can easily
lead to out-of-memory conditions. Increasing vfs_cache_pressure beyond 100
causes the kernel to prefer to reclaim dentries and inodes.Increasing vfs_cache_pressure significantly beyond 100 may have negative
performance impact. Reclaim code needs to take various locks to find freeable
directory and inode objects. With vfs_cache_pressure=1000, it will look for
ten times more freeable objects than there are.
即使设置为1, 貌似还是不断的使用cache.
因为和脏数据无关, 所以也不需要调整脏数据的内核参数 :

# cat /proc/meminfo |grep -i -E "dirt|back"
Dirty:                 0 kB
Writeback:             0 kB
WritebackTmp:          0 kB
==============================================================dirty_background_bytesContains the amount of dirty memory at which the background kernel
flusher threads will start writeback.If dirty_background_bytes is written, dirty_background_ratio becomes a function
of its value (dirty_background_bytes / the amount of dirtyable system memory).==============================================================dirty_background_ratioContains, as a percentage of total system memory, the number of pages at which
the background kernel flusher threads will start writing out dirty data.==============================================================dirty_bytesContains the amount of dirty memory at which a process generating disk writes
will itself start writeback.If dirty_bytes is written, dirty_ratio becomes a function of its value
(dirty_bytes / the amount of dirtyable system memory).Note: the minimum value allowed for dirty_bytes is two pages (in bytes); any
value lower than this limit will be ignored and the old configuration will be
retained.==============================================================dirty_expire_centisecsThis tunable is used to define when dirty data is old enough to be eligible
for writeout by the kernel flusher threads.  It is expressed in 100'ths
of a second.  Data which has been dirty in-memory for longer than this
interval will be written out next time a flusher thread wakes up.==============================================================dirty_ratioContains, as a percentage of total system memory, the number of pages at which
a process which is generating disk writes will itself start writing out dirty
data.==============================================================dirty_writeback_centisecsThe kernel flusher threads will periodically wake up and write `old' data
out to disk.  This tunable expresses the interval between those wakeups, in
100'ths of a second.Setting this to zero disables periodic writeback altogether.

现在暂且增加一个空闲时间自动FREE的脚本.

/usr/share/doc/kernel-doc-2.6.32/Documentation/sysctl/vm.txt
drop_cachesWriting to this will cause the kernel to drop clean caches, dentries and
inodes from memory, causing that memory to become free.To free pagecache:echo 1 > /proc/sys/vm/drop_caches
To free dentries and inodes:echo 2 > /proc/sys/vm/drop_caches
To free pagecache, dentries and inodes:echo 3 > /proc/sys/vm/drop_cachesAs this is a non-destructive operation and dirty objects are not freeable, the
user should run `sync' first.crontab -e
30 4 * * * /usr/local/bin/free.sh >>/tmp/free.log 2>&1# cat /usr/local/bin/free.sh
#!/bin/bash. /root/.bash_profile
. /etc/profileecho "`date +%F%T` start drop cache."
free
sync
echo 3 > /proc/sys/vm/drop_caches
echo "`date +%F%T` end drop cache."
free

最终调整的参数如下 : 
负载恢复正常.
减少脏数据比例, 提高脏数据刷新频率
将ARC改成只存储metadata, 不存储page.

sysctl -w vm.zone_reclaim_mode=1
sysctl -w vm.dirty_background_bytes=102400000
sysctl -w vm.dirty_bytes=102400000
sysctl -w vm.dirty_expire_centisecs=10
sysctl -w vm.dirty_writeback_centisecs=10
sysctl -w vm.swappiness=0
sysctl -w vm.vfs_cache_pressure=80# vi /etc/sysctl.conf
vm.zone_reclaim_mode=1
vm.dirty_background_bytes=102400000
vm.dirty_bytes=102400000
vm.dirty_expire_centisecs=10
vm.dirty_writeback_centisecs=10
vm.swappiness=0
vm.vfs_cache_pressure=80# cd /sys/module/zfs/parameters/
# cat zfs_arc_max
10240000000查看arc统计信息/proc/spl/kstat/zfs/arcstats, 可以看到metadata使用了不到2G, 所以给10G差不多了.
不够的话, 以后可以再调整.
meta_size                       4    1952531968# cat /etc/modprobe.d/zfs.conf
options zfs zfs_arc_max=10240000000
options zfs zfs_dirty_data_max=800000000
options zfs zfs_vdev_async_write_active_min_dirty_percent=10
options zfs zfs_vdev_async_write_active_max_dirty_percent=30
options zfs zfs_delay_min_dirty_percent=60
options zfs zfs_arc_shrink_shift=11设置为metadata, 因为LINUX本身也带cache, 没有必要多重cache.
zfs 和 PostgreSQL 一样有这个多重cache问题, 除非使用directIO.
# zfs set primarycache=metadata zp1
# zfs set primarycache=metadata zp1/data_a0
# zfs set primarycache=metadata zp1/data_a1
# zfs set primarycache=metadata zp1/data_b0
# zfs set primarycache=metadata zp1/data_b1
# zfs set primarycache=metadata zp1/data_c0
# zfs set primarycache=metadata zp1/data_c1
# zfs set primarycache=metadata zp1/data_ssd0
# zfs set primarycache=metadata zp1/data_ssd1设置为与数据库块大小一致.
# zfs set recordsize=16k zp1/data_a0  wal_block_size=16k
# zfs set recordsize=8k zp1/data_a0  block_size=8k

[参考]
1. http://blog.163.com/digoal@126/blog/static/163877040201392641033482

2. http://constantin.glez.de/blog/2010/04/ten-ways-easily-improve-oracle-solaris-zfs-filesystem-performance

3. https://github.com/zfsonlinux/zfs/issues/258
4. http://blog.163.com/digoal@126/blog/static/163877040201462204333503
5. https://github.com/spacelama
6. https://github.com/mharsch
7. https://pthree.org/2012/12/07/zfs-administration-part-iv-the-adjustable-replacement-cache/
8. man zfs-module-parameters
rpm -ql zfs
9. http://dtrace.org/blogs/brendan/2014/02/11/another-10-performance-wins/
10. https://www.cupfighter.net/2013/03/default-nexenta-zfs-settings-you-want-to-change-part-2

11. /proc/spl/*

ZFS case : top CPU 100%sy, when no free memory trigger it.相关推荐

  1. 系统运行缓慢,CPU 100%,以及Full GC次数过多问题的排查思路

    点击上方"方志朋",选择"设为星标" 回复"666"获取新整理的面试资料 来源:http://h5ip.cn/uWWR 处理过线上问题的同学 ...

  2. 【系统缓慢、CPU 100%、频繁Full GC问题】的定位排查思路!

    作者:爱宝贝 https://my.oschina.net/zhangxufeng/blog/3017521 处理过线上问题的同学基本上都会遇到系统突然运行缓慢,CPU 100%,以及Full GC次 ...

  3. gc的原因 频繁full_系统缓慢+CPU 100%+频繁Full GC问题的定位排查思路!

    处理过线上问题的同学基本上都会遇到系统突然运行缓慢,CPU 100%,以及Full GC次数过多的问题. 当然,这些问题的最终导致的直观现象就是系统运行缓慢,并且有大量的报警. 本文主要针对系统运行缓 ...

  4. 多事之秋-最近在阿里云上遇到的问题:负载均衡失灵、服务器 CPU 100%、被 DDoS 攻击...

    昨天 22:00~22:30 左右与 23:30~00:30 左右,有1台服役多年的阿里云负载均衡突然失灵,造成通过这台负载均衡访问博客站点的用户遭遇 502, 503, 504 ,由此给您带来麻烦, ...

  5. kprobe分析内核kworker占用CPU 100%问题总结

    kprobe分析内核kworker占用CPU 100%问题总结 Create by Billow.Jen,2020.3.8 前言 利用linux kernel 动态追踪技术,排查问题本身就可能会变成一 ...

  6. 接口访问次数_系统运行缓慢,CPU 100%,Full GC次数过多,这一招帮你全搞定

    处理过线上问题的同学基本上都会遇到系统突然运行缓慢,CPU 100%,以及Full GC次数过多的问题.当然,这些问题的最终导致的直观现象就是系统运行缓慢,并且有大量的报警.本文主要针对系统运行缓慢这 ...

  7. 不止JDK7的HashMap,JDK8的ConcurrentHashMap也会造成CPU 100%

    点击上方"方志朋",选择"设为星标" 回复"666"获取新整理的面试资料 作者:朱小厮 公众号:朱小厮的博客(ID:hiddenkafka) ...

  8. 使用dotnet-dump 查找 .net core 3.0 占用CPU 100%的原因

    使用dotnet-dump 查找 .net core 3.0 占用CPU 100%的原因 原文:使用dotnet-dump 查找 .net core 3.0 占用CPU 100%的原因 公司的产品一直 ...

  9. 不止 JDK7 的 HashMap ,JDK8 的 ConcurrentHashMap 也会造成 CPU 100%?原因与解决~

    现象 大家可能都听过JDK7中的HashMap在多线程环境下可能造成CPU 100%的现象,这个由于在扩容的时候put时产生了死链,由此会在get时造成了CPU 100%.这个问题在JDK8中的Has ...

最新文章

  1. Java枚举类使用方式
  2. 高校复试计算机英语文献翻译,专业文献英语翻译复试.pdf
  3. 你羡慕的「自由职业者」,都在焦虑没有保障的退休生活
  4. 手动打开和关闭windows的相关服务
  5. Centos7.6环境Docker安装Oracle19c企业版
  6. 需要在html上引用脚本文件myjs,需要在 html 页面上引用脚本文件myJs.js,下列语句中,正确的是()...
  7. python3 ascii转utf8_ASCII、Unicode、UTF-8以及Python3编码问题
  8. 多个引用类型的变量“引用”同一个对象意味着什么
  9. 不通用版(从SVN取版本,通过MAVEN生成JAVA包,通过SALTSTACK传送到远程服务器并自动重启TOMCAT服务)PYTHON代码...
  10. cnn 一维时序数据_多角度理解CNN网络
  11. L298N电机驱动电路
  12. 【源码】高精度31波段音频均衡器
  13. excel 选择一个单元格,所在行列变色
  14. ssh 远程锁住解锁_Linux 中锁定和解锁用户帐户的三种方法
  15. sap scc4 客户端设置(设置生产机不可更改代码)
  16. Python实现饼形图的绘制
  17. 最佳实践 缓存穿透,瞬间并发,缓存雪崩的解决方法
  18. 重学JS(一):什么是枚举?
  19. STM32 Simulink 自动代码生成电机控制:基于反电动势观测器的锁相环设计
  20. open62541 (R 1.1.2)中文文档 (译文)第一篇 (1 - 5)

热门文章

  1. 解决Windows 10不显示打字框
  2. springboot项目启动报错 url‘ attribute is not specified and no embedded datasource could be configured
  3. HDU 6578 Blank 区间dp
  4. hdu 6578 Blank dp求 给定区间中数字不同的方案数
  5. 码支付系统 无授权—个人免签约支付系统二维码收款即时到账源码 –
  6. 低血压形成的原因和治疗方法
  7. 数学建模竞赛知识点汇总(一)——层次分析法
  8. 英语测试题库软件,英语考试app哪个好 英语考试题库app推荐_96u手游网
  9. java 媒体框架_Java 媒体框架 之 JMF
  10. 高权重淘宝标题关键词优化原理解析