ZFS case : top CPU 100%sy, when no free memory trigger it.
"UPDATE waiting",2015-01-09 01:38:47 CST,979/7,2927976054,LOG,00000,"process 26366 still waiting for ExclusiveLock on extension of relation 686062002 of database 35078604 after 1117.676 ms",,,,,,"
"INSERT waiting",2015-01-09 01:38:36 CST,541/8,2927976307,LOG,00000,"process 25936 still waiting for ExclusiveLock on extension of relation 686062002 of database 35078604 after 1219.762 ms",,,,,,"
"INSERT waiting",2015-01-09 01:38:48 CST,1018/64892,2929458056,LOG,00000,"process 26439 still waiting for ExclusiveLock on extension of relation 686061993 of database 35078604 after 1000.105 ms",
.........
对应几个对象的块扩展等待
select 686062002::regclass;regclass
-----------------------------pg_toast.pg_toast_686061993
(1 row)
select relname from pg_class where reltoastrelid=686062002;relname
-------------------------------------tbl_xxx_20150109
(1 row)
Time: 4.643 ms
同时系统的dmesg还伴随 :
postgres: page allocation failure. order:1, mode:0x20
Pid: 20427, comm: postgres Tainted: P --------------- 2.6.32-504.el6.x86_64 #1
Call Trace:<IRQ> [<ffffffff8113438a>] ? __alloc_pages_nodemask+0x74a/0x8d0[<ffffffff810eaa90>] ? handle_IRQ_event+0x60/0x170[<ffffffff81173332>] ? kmem_getpages+0x62/0x170[<ffffffff81173f4a>] ? fallback_alloc+0x1ba/0x270[<ffffffff8117399f>] ? cache_grow+0x2cf/0x320[<ffffffff81173cc9>] ? ____cache_alloc_node+0x99/0x160[<ffffffff81174c4b>] ? kmem_cache_alloc+0x11b/0x190[<ffffffff8144c768>] ? sk_prot_alloc+0x48/0x1c0[<ffffffff8144d992>] ? sk_clone+0x22/0x2e0[<ffffffff814a1b76>] ? inet_csk_clone+0x16/0xd0[<ffffffff814bb713>] ? tcp_create_openreq_child+0x23/0x470[<ffffffff814b8ecd>] ? tcp_v4_syn_recv_sock+0x4d/0x310[<ffffffff814bb4b6>] ? tcp_check_req+0x226/0x460[<ffffffff814b890b>] ? tcp_v4_do_rcv+0x35b/0x490[<ffffffffa0207557>] ? ipv4_confirm+0x87/0x1d0 [nf_conntrack_ipv4][<ffffffff814ba1a2>] ? tcp_v4_rcv+0x522/0x900[<ffffffff81496d10>] ? ip_local_deliver_finish+0x0/0x2d0[<ffffffff81496ded>] ? ip_local_deliver_finish+0xdd/0x2d0[<ffffffff81497078>] ? ip_local_deliver+0x98/0xa0[<ffffffff8149653d>] ? ip_rcv_finish+0x12d/0x440[<ffffffff81496ac5>] ? ip_rcv+0x275/0x350[<ffffffff8145c88b>] ? __netif_receive_skb+0x4ab/0x750[<ffffffff81460588>] ? netif_receive_skb+0x58/0x60[<ffffffff81460690>] ? napi_skb_finish+0x50/0x70[<ffffffff81461f69>] ? napi_gro_receive+0x39/0x50[<ffffffffa01a7d91>] ? igb_poll+0x981/0x1010 [igb][<ffffffff814b59c0>] ? tcp_delack_timer+0x0/0x270[<ffffffff814b3af9>] ? tcp_send_ack+0xd9/0x120[<ffffffff81462083>] ? net_rx_action+0x103/0x2f0[<ffffffff8107d8b1>] ? __do_softirq+0xc1/0x1e0[<ffffffff810eaa90>] ? handle_IRQ_event+0x60/0x170[<ffffffff8107d90f>] ? __do_softirq+0x11f/0x1e0[<ffffffff8100c30c>] ? call_softirq+0x1c/0x30[<ffffffff8100fc15>] ? do_softirq+0x65/0xa0[<ffffffff8107d765>] ? irq_exit+0x85/0x90[<ffffffff81533b45>] ? do_IRQ+0x75/0xf0[<ffffffff8100b9d3>] ? ret_from_intr+0x0/0x11<EOI> [<ffffffff8116f5f9>] ? compaction_alloc+0x269/0x4b0[<ffffffff8116f552>] ? compaction_alloc+0x1c2/0x4b0[<ffffffff811799fa>] ? migrate_pages+0xaa/0x480[<ffffffff8100b9ce>] ? common_interrupt+0xe/0x13[<ffffffff8116f390>] ? compaction_alloc+0x0/0x4b0[<ffffffff8116e9ea>] ? compact_zone+0x61a/0xba0[<ffffffff8116f01c>] ? compact_zone_order+0xac/0x100[<ffffffff8116f151>] ? try_to_compact_pages+0xe1/0x120[<ffffffff81133b6a>] ? __alloc_pages_direct_compact+0xda/0x1b0[<ffffffff81134055>] ? __alloc_pages_nodemask+0x415/0x8d0[<ffffffff8116c79a>] ? alloc_pages_vma+0x9a/0x150[<ffffffff8118845d>] ? do_huge_pmd_anonymous_page+0x14d/0x3b0[<ffffffff8114fdb0>] ? handle_mm_fault+0x2f0/0x300[<ffffffff8104d0d8>] ? __do_page_fault+0x138/0x480[<ffffffff8152ae5e>] ? mutex_lock+0x1e/0x50[<ffffffff8152ffbe>] ? do_page_fault+0x3e/0xa0[<ffffffff8152d375>] ? page_fault+0x25/0x30
这是个日志表, 有4个索引, 其中一个变长字段存储的值较长(因此有用到TOAST存储), 例如
DxxxxxxxxxxxxzwLlyyDd7xGd7^7xxwLDxyD@5xHB7^if5^vv4&DJCEL7xxxCFyhsxxxd4x~j2%$BB%ChkzHlzzvxBwqn5^DDCFexzwC@zyLDz
zC~zyDDzyCbAyyh3M~v5^DDCHvBBy%j0%iL4^fJB%K1xxxB%G1wz~h2M%B4%qn5&7xxwyPs!$xJ!Dd7xCb3^DFCGLnzyzlzyP7zyCJ7x)Lx^xxxxxxxy73$rLB&DND
zL5zy~xxyCt4xPj4%DJCE~DzyP#zyLPzyypxxx3&~DB^P1zzC5zye5wzz10MCb3^Gp4^DLCEiNywi$yzvxBwL73&$F7%7xzwG5zyy5wyah4MbzB%C1DzL9zyf5yyG9z
y!1zyLJxyCt4xPP5%nLB&xxxx
&7xxwjHzzi#yyi$yzi$yzmHIPm^K@CbAzzh5MDBCGLxxxxwz~h5M$JB%DxDzGlzyH5zyL7zzylzyC9AyLxxx7xxwC5AyzNxxxxxx5%Pp0
^~d0&6NzwK1wyzN1M%xxxx^P74^DJCGD7yyvBByiF4&Pt0%~d0&6hwwvDByiF4&et5xxxxxx17%$$1^DBCFGfwxzh2M!j1%qv5^DLCF!NzyH5zzvJBy%h4
%aD4&%v4^61zwDnyyK5wyzN0Mxxxx5$71zwCd7xCv0&fj5&(h1%yNc%mf7%71zwxxx%a@4%rpd^a5d$71zwCt5x!l1$~^l^LDx&K1wzmh5MxxxxxxxxxxGr4&Dvd&$jl
%LBx%K1wzPh5M$Ll^yn1&Ht6+fxxxxxxwC@5xqnxxx&~90^fj5+Oh5%71zwDJ7xH~j&yDh$G@5^7xxxx^Gp4@DLCEf1yw7b0~bn0^%^i&HDzyebz
y6)zyLBzyfdyyvxBwH#5%GP5^nvd&$LB^DN5zqHA^%P1^nLlEDL5%$@n^i#4^$J7%nPn+bzF@Ct4xD~1&GB4^add%7xxw!7zybBxyC@5xqn5xxxxx!DLCEvBxxxd6&!vL@7xx
w)dyECn5&)B1zxxxxxOz)D&HND&C9DOOl1yzpD&xxxxxx1^6hzwLrzyf3zxxx&L73+Gp4@DLCEiNywi$yzvxBw$h5%Hdf&(l9%zh0%nHB^D1zz~7zyvzB
yHl6%jl4!!Fg^rLB%DhDzC5xxxxb7j&aH4^)txxxDzvxBw$v0%DJCEzNxxxxzyzr1y~Lzx(vA&(@zyrtCyy5DxxxHl5OHbDO$3DxxxyyN0MD~1%afd&71z
wDd7xj17xxxxx$)7^7N2wq8*=
cd /sys/module/zfs/parameters
# grep '' *|sort
l2arc_feed_again:1
l2arc_feed_min_ms:200
l2arc_feed_secs:1
l2arc_headroom:2
l2arc_headroom_boost:200
l2arc_nocompress:0
l2arc_noprefetch:1
l2arc_norw:0
l2arc_write_boost:8388608
l2arc_write_max:8388608
metaslab_debug_load:0
metaslab_debug_unload:0
spa_asize_inflation:24
spa_config_path:/etc/zfs/zpool.cache
zfetch_array_rd_sz:1048576
zfetch_block_cap:256
zfetch_max_streams:8
zfetch_min_sec_reap:2
zfs_arc_grow_retry:5
zfs_arc_max:10240000000
zfs_arc_memory_throttle_disable:1
zfs_arc_meta_limit:0
zfs_arc_meta_prune:1048576
zfs_arc_min:0
zfs_arc_min_prefetch_lifespan:1000
zfs_arc_p_aggressive_disable:1
zfs_arc_p_dampener_disable:1
zfs_arc_shrink_shift:5
zfs_autoimport_disable:0
zfs_dbuf_state_index:0
zfs_deadman_enabled:1
zfs_deadman_synctime_ms:1000000
zfs_dedup_prefetch:1
zfs_delay_min_dirty_percent:60
zfs_delay_scale:500000
zfs_dirty_data_max:10240000000
zfs_dirty_data_max_max:101595342848
zfs_dirty_data_max_max_percent:25
zfs_dirty_data_max_percent:10
zfs_dirty_data_sync:67108864
zfs_disable_dup_eviction:0
zfs_expire_snapshot:300
zfs_flags:1
zfs_free_min_time_ms:1000
zfs_immediate_write_sz:32768
zfs_mdcomp_disable:0
zfs_nocacheflush:0
zfs_nopwrite_enabled:1
zfs_no_scrub_io:0
zfs_no_scrub_prefetch:0
zfs_pd_blks_max:100
zfs_prefetch_disable:0
zfs_read_chunk_size:1048576
zfs_read_history:0
zfs_read_history_hits:0
zfs_recover:0
zfs_resilver_delay:2
zfs_resilver_min_time_ms:3000
zfs_scan_idle:50
zfs_scan_min_time_ms:1000
zfs_scrub_delay:4
zfs_send_corrupt_data:0
zfs_sync_pass_deferred_free:2
zfs_sync_pass_dont_compress:5
zfs_sync_pass_rewrite:2
zfs_top_maxinflight:32
zfs_txg_history:0
zfs_txg_timeout:5
zfs_vdev_aggregation_limit:131072
zfs_vdev_async_read_max_active:3
zfs_vdev_async_read_min_active:1
zfs_vdev_async_write_active_max_dirty_percent:60
zfs_vdev_async_write_active_min_dirty_percent:30
zfs_vdev_async_write_max_active:10
zfs_vdev_async_write_min_active:1
zfs_vdev_cache_bshift:16
zfs_vdev_cache_max:16384
zfs_vdev_cache_size:0
zfs_vdev_max_active:1000
zfs_vdev_mirror_switch_us:10000
zfs_vdev_read_gap_limit:32768
zfs_vdev_scheduler:noop
zfs_vdev_scrub_max_active:2
zfs_vdev_scrub_min_active:1
zfs_vdev_sync_read_max_active:10
zfs_vdev_sync_read_min_active:10
zfs_vdev_sync_write_max_active:10
zfs_vdev_sync_write_min_active:10
zfs_vdev_write_gap_limit:4096
zfs_zevent_cols:80
zfs_zevent_console:0
zfs_zevent_len_max:768
zil_replay_disable:0
zil_slog_limit:1048576
zio_bulk_flags:0
zio_delay_max:30000
zio_injection_enabled:0
zio_requeue_io_start_cut_in_line:1
zvol_inhibit_dev:0
zvol_major:230
zvol_max_discard_blocks:16384
zvol_threads:32
这些参数的介绍可参考 :
man /usr/share/man/man5/zfs-module-parameters.5.gz
zpool参数
# zpool get all zp1
NAME PROPERTY VALUE SOURCE
zp1 size 40T -
zp1 capacity 2% -
zp1 altroot - default
zp1 health ONLINE -
zp1 guid 15254203672861282738 default
zp1 version - default
zp1 bootfs - default
zp1 delegation on default
zp1 autoreplace off default
zp1 cachefile - default
zp1 failmode wait default
zp1 listsnapshots off default
zp1 autoexpand off default
zp1 dedupditto 0 default
zp1 dedupratio 1.00x -
zp1 free 39.0T -
zp1 allocated 995G -
zp1 readonly off -
zp1 ashift 12 local
zp1 comment - default
zp1 expandsize 0 -
zp1 freeing 0 default
zp1 feature@async_destroy enabled local
zp1 feature@empty_bpobj active local
zp1 feature@lz4_compress active local
zfs参数
# zfs get all zp1/data_a0
NAME PROPERTY VALUE SOURCE
zp1/data_a0 type filesystem -
zp1/data_a0 creation Thu Dec 18 10:30 2014 -
zp1/data_a0 used 98.8G -
zp1/data_a0 available 34.1T -
zp1/data_a0 referenced 98.8G -
zp1/data_a0 compressratio 1.00x -
zp1/data_a0 mounted yes -
zp1/data_a0 quota none default
zp1/data_a0 reservation none default
zp1/data_a0 recordsize 128K default
zp1/data_a0 mountpoint /data_a0 local
zp1/data_a0 sharenfs off default
zp1/data_a0 checksum on default
zp1/data_a0 compression off local
zp1/data_a0 atime off inherited from zp1
zp1/data_a0 devices on default
zp1/data_a0 exec on default
zp1/data_a0 setuid on default
zp1/data_a0 readonly off default
zp1/data_a0 zoned off default
zp1/data_a0 snapdir hidden default
zp1/data_a0 aclinherit restricted default
zp1/data_a0 canmount on default
zp1/data_a0 xattr sa local
zp1/data_a0 copies 1 default
zp1/data_a0 version 5 -
zp1/data_a0 utf8only off -
zp1/data_a0 normalization none -
zp1/data_a0 casesensitivity sensitive -
zp1/data_a0 vscan off default
zp1/data_a0 nbmand off default
zp1/data_a0 sharesmb off default
zp1/data_a0 refquota none default
zp1/data_a0 refreservation none default
zp1/data_a0 primarycache metadata local
zp1/data_a0 secondarycache all local
zp1/data_a0 usedbysnapshots 0 -
zp1/data_a0 usedbydataset 98.8G -
zp1/data_a0 usedbychildren 0 -
zp1/data_a0 usedbyrefreservation 0 -
zp1/data_a0 logbias latency default
zp1/data_a0 dedup off default
zp1/data_a0 mlslabel none default
zp1/data_a0 sync standard default
zp1/data_a0 refcompressratio 1.00x -
zp1/data_a0 written 98.8G -
zp1/data_a0 logicalused 98.7G -
zp1/data_a0 logicalreferenced 98.7G -
zp1/data_a0 snapdev hidden default
zp1/data_a0 acltype off default
zp1/data_a0 context none default
zp1/data_a0 fscontext none default
zp1/data_a0 defcontext none default
zp1/data_a0 rootcontext none default
zp1/data_a0 relatime off default
解决问题可能要从 arc入手 :
zfs_arc_shrink_shift (int)log2(fraction of arc to reclaim)Default value: 5.
Description: Semi-regular spikes in I/O latency on an SSD postgres server.
Analysis: The customer reported multi-second I/O latency for a server with flash memory-based solid state disks (SSDs). Since this SSD type was new in production, it was feared that there may be a new drive or firmware problem causing high latency. ZFS latency counters, measured at the VFS interface, confirmed that I/O latency was dismal, sometimes reaching 10 seconds for I/O. The DTrace-based iosnoop tool (DTraceToolkit) was used to trace at the block device level, however, no seriously slow I/O was observed from the SSDs. I plotted the iosnoop traces using R for evidence of queueing behind TXG flushes, but they didn’t support that theory either.
This was difficult to investigate since the slow I/O was intermittent, sometimes only occurring once per hour. Instead of a typical interactive investigation, I developed various ways to log activity from DTrace and kstats, so that clues for the issue could be examined afterwards from the logs. This included capturing which processes were executed using execsnoop, and dumping ZFS metrics from kstat, including arcstats. This showed that various maintenance processes were executing during the hour, and, the ZFS ARC, which was around 210 Gbytes, would sometimes drop by around 6 Gbytes. Having worked performance issues with shrinking ARCs before, I developed a DTrace script to trace ARC reaping along with process execution, and found that it was a match with a cp(1) command. This was part of the maintenance task, which was copying a 30 Gbyte file, hitting the ARC limit and triggering an ARC shrink. Shrinking involves holding ARC hash locks, which can cause latency, especially when shrinking 6 Gbytes worth of buffers. The zfs:zfs_arc_shrink_shift tunable was adjusted to reduce the shrink size, which also made them more frequent. The worst-case I/O improved from 10s to 100ms.ARC shrink shift
Every second a process runs which checks if data can be removed from the ARC and evicts it. Default max 1/32nd of the ARC can be evicted at a time. This is limited because evicting large amounts of data from ARC stalls all other processes. Back when 8GB was a lot of memory 1/32nd meant 256MB max at a time. When you have 196GB of memory 1/32nd is 6.3GB, which can cause up to 20-30 seconds of unresponsiveness (depending on the record size).
This 1/32nd needs to be changed to make sure the max is set to ~100-200MB again, by adding the following to /etc/system:
set zfs:zfs_arc_shrink_shift=11
(where 11 is 1/2 11 or 1/2048th, 10 is 1/2 10 or 1/1024th etc. Change depending on amount of RAM in your system).
结合ARC原理还有异步dirty write delay的情况, 优化如下 :
zfs_vdev_async_write_active_min_dirty_percent (int)When the pool has less than zfs_vdev_async_write_active_min_dirty_percent dirty data, usezfs_vdev_async_write_min_active to limit active async writes. If the dirty data is between min andmax, the active I/O limit is linearly interpolated. See the section "ZFS I/O SCHEDULER".Default value: 30.zfs_vdev_async_write_active_max_dirty_percent (int)When the pool has more than zfs_vdev_async_write_active_max_dirty_percent dirty data, usezfs_vdev_async_write_max_active to limit active async writes. If the dirty data is between min andmax, the active I/O limit is linearly interpolated. See the section "ZFS I/O SCHEDULER".Default value: 60.zfs_vdev_async_write_max_active (int)Maxium asynchronous write I/Os active to each device. See the section "ZFS I/O SCHEDULER".Default value: 10.zfs_vdev_async_write_min_active (int)Minimum asynchronous write I/Os active to each device. See the section "ZFS I/O SCHEDULER".Default value: 1.
| o---------| <-- zfs_vdev_async_write_max_active^ | /^ || | / | |active | / | |I/O | / | |count | / | || / | ||-------o | | <-- zfs_vdev_async_write_min_active0|_______^______|_________|0% | | 100% of zfs_dirty_data_max| || ‘-- zfs_vdev_async_write_active_max_dirty_percent‘--------- zfs_vdev_async_write_active_min_dirty_percent
# freetotal used free shared buffers cached
Mem: 396856808 228812456 168044352 21633868 58744 45380060
# cat /proc/spl/kstat/zfs/arcstats |grep size
size 4 19751851104
# echo 140000000000 > /sys/module/zfs/parameters/zfs_arc_max
接下来设置一下dirty相关的参数
zfs_dirty_data_max 降到 arc max 的 1/5 = 28000000000 (可动态调整)
异步写的加速参数调整
zfs_vdev_async_write_active_min_dirty_percent=10
zfs_vdev_async_write_active_max_dirty_percent=30 (务必小于zfs_delay_min_dirty_percent)
zfs_delay_min_dirty_percent=60
动态调整后, 建议设置启动模块参数 :
# cd /sys/module/zfs/parameters/
# echo 140000000000 >zfs_arc_max
# echo 28000000000 >zfs_dirty_data_max
# echo 10 > zfs_vdev_async_write_active_min_dirty_percent
# echo 30 > zfs_vdev_async_write_active_max_dirty_percent
# echo 60 > zfs_delay_min_dirty_percent
# echo 11 > zfs_arc_shrink_shift
zfs模块启动参数
# vi /etc/modprobe.d/zfs.conf
options zfs zfs_arc_max=140000000000
options zfs zfs_dirty_data_max=28000000000
options zfs zfs_vdev_async_write_active_min_dirty_percent=10
options zfs zfs_vdev_async_write_active_max_dirty_percent=30
options zfs zfs_delay_min_dirty_percent=60
options zfs zfs_arc_shrink_shift=11
观察期.....
# ps -e --width=1024 -o pid,%mem,rss,size,sz,vsz,cmd --sort rss
rss RSS resident set size, the non-swapped physical memory that a task has used (in kiloBytes).(alias rssize, rsz).
size SZ approximate amount of swap space that would be required if the process were to dirty all writablepages and then be swapped out. This number is very rough!
sz SZ size in physical pages of the core image of the process. This includes text, data, and stackspace. Device mappings are currently excluded; this is subject to change. See vsz and rss.
vsz VSZ virtual memory size of the process in KiB (1024-byte units). Device mappings are currentlyexcluded; this is subject to change. (alias vsize).
06:10:01 AM kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit
06:20:01 AM 219447748 177409060 44.70 23196 22965260 27351828 6.75
06:30:01 AM 219304016 177552792 44.74 24628 23080756 27348820 6.75
06:40:01 AM 218698000 178158808 44.89 26276 23638736 27365764 6.75
06:50:01 AM 218454732 178402076 44.95 27588 23852552 27365664 6.75
07:00:01 AM 218211060 178645748 45.02 28840 24066384 27365736 6.75
07:10:01 AM 218006588 178850220 45.07 30144 24231036 27366528 6.75
07:20:01 AM 217784072 179072736 45.12 31424 24412084 27365496 6.75
07:30:01 AM 217128620 179728188 45.29 32752 24970064 27370048 6.75
07:40:01 AM 216704964 180151844 45.39 34372 25331396 27369700 6.75
07:50:01 AM 216372456 180484352 45.48 35740 25610760 27371348 6.75
08:00:01 AM 216028392 180828416 45.57 37060 25890136 27393748 6.76
08:10:01 AM 214706196 182150612 45.90 38808 27120088 27400288 6.76
08:20:01 AM 213981920 182874888 46.08 42712 27798924 27413000 6.76
08:30:01 AM 213551104 183305704 46.19 44268 28193028 27411516 6.76
设置cache的使用趋势
vfs_cache_pressure
------------------This percentage value controls the tendency of the kernel to reclaim
the memory which is used for caching of directory and inode objects.At the default value of vfs_cache_pressure=100 the kernel will attempt to
reclaim dentries and inodes at a "fair" rate with respect to pagecache and
swapcache reclaim. Decreasing vfs_cache_pressure causes the kernel to prefer
to retain dentry and inode caches. When vfs_cache_pressure=0, the kernel will
never reclaim dentries and inodes due to memory pressure and this can easily
lead to out-of-memory conditions. Increasing vfs_cache_pressure beyond 100
causes the kernel to prefer to reclaim dentries and inodes.Increasing vfs_cache_pressure significantly beyond 100 may have negative
performance impact. Reclaim code needs to take various locks to find freeable
directory and inode objects. With vfs_cache_pressure=1000, it will look for
ten times more freeable objects than there are.
# cat /proc/meminfo |grep -i -E "dirt|back"
Dirty: 0 kB
Writeback: 0 kB
WritebackTmp: 0 kB
==============================================================dirty_background_bytesContains the amount of dirty memory at which the background kernel
flusher threads will start writeback.If dirty_background_bytes is written, dirty_background_ratio becomes a function
of its value (dirty_background_bytes / the amount of dirtyable system memory).==============================================================dirty_background_ratioContains, as a percentage of total system memory, the number of pages at which
the background kernel flusher threads will start writing out dirty data.==============================================================dirty_bytesContains the amount of dirty memory at which a process generating disk writes
will itself start writeback.If dirty_bytes is written, dirty_ratio becomes a function of its value
(dirty_bytes / the amount of dirtyable system memory).Note: the minimum value allowed for dirty_bytes is two pages (in bytes); any
value lower than this limit will be ignored and the old configuration will be
retained.==============================================================dirty_expire_centisecsThis tunable is used to define when dirty data is old enough to be eligible
for writeout by the kernel flusher threads. It is expressed in 100'ths
of a second. Data which has been dirty in-memory for longer than this
interval will be written out next time a flusher thread wakes up.==============================================================dirty_ratioContains, as a percentage of total system memory, the number of pages at which
a process which is generating disk writes will itself start writing out dirty
data.==============================================================dirty_writeback_centisecsThe kernel flusher threads will periodically wake up and write `old' data
out to disk. This tunable expresses the interval between those wakeups, in
100'ths of a second.Setting this to zero disables periodic writeback altogether.
现在暂且增加一个空闲时间自动FREE的脚本.
/usr/share/doc/kernel-doc-2.6.32/Documentation/sysctl/vm.txt
drop_cachesWriting to this will cause the kernel to drop clean caches, dentries and
inodes from memory, causing that memory to become free.To free pagecache:echo 1 > /proc/sys/vm/drop_caches
To free dentries and inodes:echo 2 > /proc/sys/vm/drop_caches
To free pagecache, dentries and inodes:echo 3 > /proc/sys/vm/drop_cachesAs this is a non-destructive operation and dirty objects are not freeable, the
user should run `sync' first.crontab -e
30 4 * * * /usr/local/bin/free.sh >>/tmp/free.log 2>&1# cat /usr/local/bin/free.sh
#!/bin/bash. /root/.bash_profile
. /etc/profileecho "`date +%F%T` start drop cache."
free
sync
echo 3 > /proc/sys/vm/drop_caches
echo "`date +%F%T` end drop cache."
free
最终调整的参数如下 :
sysctl -w vm.zone_reclaim_mode=1
sysctl -w vm.dirty_background_bytes=102400000
sysctl -w vm.dirty_bytes=102400000
sysctl -w vm.dirty_expire_centisecs=10
sysctl -w vm.dirty_writeback_centisecs=10
sysctl -w vm.swappiness=0
sysctl -w vm.vfs_cache_pressure=80# vi /etc/sysctl.conf
vm.zone_reclaim_mode=1
vm.dirty_background_bytes=102400000
vm.dirty_bytes=102400000
vm.dirty_expire_centisecs=10
vm.dirty_writeback_centisecs=10
vm.swappiness=0
vm.vfs_cache_pressure=80# cd /sys/module/zfs/parameters/
# cat zfs_arc_max
10240000000查看arc统计信息/proc/spl/kstat/zfs/arcstats, 可以看到metadata使用了不到2G, 所以给10G差不多了.
不够的话, 以后可以再调整.
meta_size 4 1952531968# cat /etc/modprobe.d/zfs.conf
options zfs zfs_arc_max=10240000000
options zfs zfs_dirty_data_max=800000000
options zfs zfs_vdev_async_write_active_min_dirty_percent=10
options zfs zfs_vdev_async_write_active_max_dirty_percent=30
options zfs zfs_delay_min_dirty_percent=60
options zfs zfs_arc_shrink_shift=11设置为metadata, 因为LINUX本身也带cache, 没有必要多重cache.
zfs 和 PostgreSQL 一样有这个多重cache问题, 除非使用directIO.
# zfs set primarycache=metadata zp1
# zfs set primarycache=metadata zp1/data_a0
# zfs set primarycache=metadata zp1/data_a1
# zfs set primarycache=metadata zp1/data_b0
# zfs set primarycache=metadata zp1/data_b1
# zfs set primarycache=metadata zp1/data_c0
# zfs set primarycache=metadata zp1/data_c1
# zfs set primarycache=metadata zp1/data_ssd0
# zfs set primarycache=metadata zp1/data_ssd1设置为与数据库块大小一致.
# zfs set recordsize=16k zp1/data_a0 wal_block_size=16k
# zfs set recordsize=8k zp1/data_a0 block_size=8k
[参考]
2. http://constantin.glez.de/blog/2010/04/ten-ways-easily-improve-oracle-solaris-zfs-filesystem-performance
ZFS case : top CPU 100%sy, when no free memory trigger it.相关推荐
- 系统运行缓慢,CPU 100%,以及Full GC次数过多问题的排查思路
点击上方"方志朋",选择"设为星标" 回复"666"获取新整理的面试资料 来源:http://h5ip.cn/uWWR 处理过线上问题的同学 ...
- 【系统缓慢、CPU 100%、频繁Full GC问题】的定位排查思路!
作者:爱宝贝 https://my.oschina.net/zhangxufeng/blog/3017521 处理过线上问题的同学基本上都会遇到系统突然运行缓慢,CPU 100%,以及Full GC次 ...
- gc的原因 频繁full_系统缓慢+CPU 100%+频繁Full GC问题的定位排查思路!
处理过线上问题的同学基本上都会遇到系统突然运行缓慢,CPU 100%,以及Full GC次数过多的问题. 当然,这些问题的最终导致的直观现象就是系统运行缓慢,并且有大量的报警. 本文主要针对系统运行缓 ...
- 多事之秋-最近在阿里云上遇到的问题:负载均衡失灵、服务器 CPU 100%、被 DDoS 攻击...
昨天 22:00~22:30 左右与 23:30~00:30 左右,有1台服役多年的阿里云负载均衡突然失灵,造成通过这台负载均衡访问博客站点的用户遭遇 502, 503, 504 ,由此给您带来麻烦, ...
- kprobe分析内核kworker占用CPU 100%问题总结
kprobe分析内核kworker占用CPU 100%问题总结 Create by Billow.Jen,2020.3.8 前言 利用linux kernel 动态追踪技术,排查问题本身就可能会变成一 ...
- 接口访问次数_系统运行缓慢,CPU 100%,Full GC次数过多,这一招帮你全搞定
处理过线上问题的同学基本上都会遇到系统突然运行缓慢,CPU 100%,以及Full GC次数过多的问题.当然,这些问题的最终导致的直观现象就是系统运行缓慢,并且有大量的报警.本文主要针对系统运行缓慢这 ...
- 不止JDK7的HashMap,JDK8的ConcurrentHashMap也会造成CPU 100%
点击上方"方志朋",选择"设为星标" 回复"666"获取新整理的面试资料 作者:朱小厮 公众号:朱小厮的博客(ID:hiddenkafka) ...
- 使用dotnet-dump 查找 .net core 3.0 占用CPU 100%的原因
使用dotnet-dump 查找 .net core 3.0 占用CPU 100%的原因 原文:使用dotnet-dump 查找 .net core 3.0 占用CPU 100%的原因 公司的产品一直 ...
- 不止 JDK7 的 HashMap ,JDK8 的 ConcurrentHashMap 也会造成 CPU 100%?原因与解决~
现象 大家可能都听过JDK7中的HashMap在多线程环境下可能造成CPU 100%的现象,这个由于在扩容的时候put时产生了死链,由此会在get时造成了CPU 100%.这个问题在JDK8中的Has ...
最新文章
- Java枚举类使用方式
- 高校复试计算机英语文献翻译,专业文献英语翻译复试.pdf
- 你羡慕的「自由职业者」,都在焦虑没有保障的退休生活
- 手动打开和关闭windows的相关服务
- Centos7.6环境Docker安装Oracle19c企业版
- 需要在html上引用脚本文件myjs,需要在 html 页面上引用脚本文件myJs.js,下列语句中,正确的是()...
- python3 ascii转utf8_ASCII、Unicode、UTF-8以及Python3编码问题
- 多个引用类型的变量“引用”同一个对象意味着什么
- 不通用版(从SVN取版本,通过MAVEN生成JAVA包,通过SALTSTACK传送到远程服务器并自动重启TOMCAT服务)PYTHON代码...
- cnn 一维时序数据_多角度理解CNN网络
- L298N电机驱动电路
- 【源码】高精度31波段音频均衡器
- excel 选择一个单元格,所在行列变色
- ssh 远程锁住解锁_Linux 中锁定和解锁用户帐户的三种方法
- sap scc4 客户端设置(设置生产机不可更改代码)
- Python实现饼形图的绘制
- 最佳实践 缓存穿透,瞬间并发,缓存雪崩的解决方法
- 重学JS(一):什么是枚举?
- STM32 Simulink 自动代码生成电机控制:基于反电动势观测器的锁相环设计
- open62541 (R 1.1.2)中文文档 (译文)第一篇 (1 - 5)
热门文章
- 解决Windows 10不显示打字框
- springboot项目启动报错 url‘ attribute is not specified and no embedded datasource could be configured
- HDU 6578 Blank 区间dp
- hdu 6578 Blank dp求 给定区间中数字不同的方案数
- 码支付系统 无授权—个人免签约支付系统二维码收款即时到账源码 –
- 低血压形成的原因和治疗方法
- 数学建模竞赛知识点汇总(一)——层次分析法
- 英语测试题库软件,英语考试app哪个好 英语考试题库app推荐_96u手游网
- java 媒体框架_Java 媒体框架 之 JMF
- 高权重淘宝标题关键词优化原理解析