问题

Redisson版本: 3.12.5
在使用 redisson 的 lock api 时,如果跟 redis 之间的连接出现了中断,会导致调用方挂死。

样例代码:

// 1. start redis server
// 2. 初始化 RedissonClient
RedissonClient redissonClient = ...
// 3. stop redis server
// 这时候连接断开了,lock()调用挂起到永远
redissonClient.getLock(key).lock();

输出:

2020-08-20 00:26:49 [main] INFO  org.redisson.Version - Redisson 3.12.5
2020-08-20 00:26:50 [redisson-netty-2-9] INFO  o.r.c.pool.MasterConnectionPool - 5 connections initialized for localhost/127.0.0.1:16379
2020-08-20 00:26:50 [redisson-netty-2-10] INFO  o.r.c.p.MasterPubSubConnectionPool - 1 connections initialized for localhost/127.0.0.1:16379
2020-08-20 00:26:51 [redisson-timer-4-1] WARN  io.netty.util.HashedWheelTimer - An exception was thrown by TimerTask.
java.lang.NullPointerException: causeat io.netty.util.internal.ObjectUtil.checkNotNull(ObjectUtil.java:33)at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:606)at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:111)at org.redisson.misc.RedissonPromise.tryFailure(RedissonPromise.java:96)at org.redisson.command.RedisExecutor$2.run(RedisExecutor.java:228)at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:672)at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:747)at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:472)at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)at java.lang.Thread.run(Thread.java:748)

虽然打印了错误堆栈,但是程序不会退出。

分析

根据错误堆栈查找到:

// org.redisson.command.RedisExecutor
private void scheduleRetryTimeout(RFuture<RedisConnection> connectionFuture, RPromise<R> attemptPromise) {...if (attempt == attempts) {attemptPromise.tryFailure(exception);return;}...
}

debug后可以发现这里的 exception 在某些条件下会为 null,导致打印了 NPE 堆栈。
检查这个方法后,发现:

// org.redisson.command.RedisExecutor
private void scheduleRetryTimeout(RFuture<RedisConnection> connectionFuture, RPromise<R> attemptPromise) {...if (connectionFuture.cancel(false)) {if (exception == null) {...}} else {if (connectionFuture.isSuccess()) {if (writeFuture == null || !writeFuture.isDone()) {...}if (writeFuture.isSuccess()) {return;} // 这里少了个 else 的判断,导致在 write failed 的时候,木有创建异常}}...
}

代码只判断了 write success 的情况,没有对 write fail 做处理。修改源码,加上:

...if (writeFuture.isSuccess()) {return;} else if (exception == null) {exception = new RedisException("===== Write failed.");}
...

再次运行,输出:

2020-08-20 00:41:02 [main] INFO  org.redisson.Version - Redisson 3.12.5
2020-08-20 00:41:03 [redisson-netty-2-9] INFO  o.r.c.p.MasterPubSubConnectionPool - 1 connections initialized for localhost/127.0.0.1:16379
2020-08-20 00:41:03 [redisson-netty-2-13] INFO  o.r.c.pool.MasterConnectionPool - 5 connections initialized for localhost/127.0.0.1:16379

这次没有错误了,但是程序还是挂死,看来问题不是那么简单。

这时候就需要做个 thread dump,看看那里挂起了。

根据上下文查找出挂起的地方:

"main" #1 prio=5 os_prio=0 tid=0x00007f4878011000 nid=0x7ae4 in Object.wait() [0x00007f487f94f000]java.lang.Thread.State: WAITING (on object monitor)at java.lang.Object.wait(Native Method)- waiting on <0x00000000da0ca0d0> (a io.netty.util.concurrent.ImmediateEventExecutor$ImmediatePromise)at java.lang.Object.wait(Object.java:502)at io.netty.util.concurrent.DefaultPromise.await(DefaultPromise.java:247)- locked <0x00000000da0ca0d0> (a io.netty.util.concurrent.ImmediateEventExecutor$ImmediatePromise)at org.redisson.misc.RedissonPromise.await(RedissonPromise.java:110)at org.redisson.misc.RedissonPromise.await(RedissonPromise.java:35)at org.redisson.command.CommandAsyncService.get(CommandAsyncService.java:139)at org.redisson.RedissonObject.get(RedissonObject.java:90)at org.redisson.RedissonLock.tryAcquire(RedissonLock.java:226)at org.redisson.RedissonLock.lock(RedissonLock.java:180)at org.redisson.RedissonLock.lock(RedissonLock.java:152)at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)at java.lang.reflect.Method.invoke(Method.java:498)

看到是 DefaultPromise.await() 方法里调用了 Object.wait()
修改源码,加入测试打印信息:

// io.netty.util.concurrent.DefaultPromise
@Override
public Promise<V> await() throws InterruptedException {...synchronized (this) {while (!isDone()) {incWaiters();try {System.out.println("before wait: " + this + ", " + this.waiters);wait();} finally {decWaiters();System.out.println("after wait: " + this + ", " + this.waiters);}}}return this;
}

重新运行,输出:

2020-08-20 00:51:11 [main] INFO  org.redisson.Version - Redisson 3.12.5
2020-08-20 00:51:11 [redisson-netty-2-12] INFO  o.r.c.pool.MasterConnectionPool - 5 connections initialized for localhost/127.0.0.1:16379
2020-08-20 00:51:11 [redisson-netty-2-13] INFO  o.r.c.p.MasterPubSubConnectionPool - 1 connections initialized for localhost/127.0.0.1:16379
before wait: ImmediateEventExecutor$ImmediatePromise@420745d7(incomplete), 1
after wait: ImmediateEventExecutor$ImmediatePromise@420745d7(success), 0
before wait: ImmediateEventExecutor$ImmediatePromise@7e11ab3d(incomplete), 1
after wait: ImmediateEventExecutor$ImmediatePromise@7e11ab3d(success: 1), 0
before wait: ImmediateEventExecutor$ImmediatePromise@5c09d180(incomplete), 1

从这里可以看出 waiters 的数量为1,然后进入了 wait()
wait() 必定有 notify() 或者 notifyAll(),而且因为 wait() 的对象还是 this,所以 DefaultPromise 里面必定有相应的方法来唤醒等待的线程。

DefaultPromise 里面搜索一下,发现:

// io.netty.util.concurrent.DefaultPromise
private synchronized boolean checkNotifyWaiters() {if (waiters > 0) {notifyAll();}return listeners != null;
}

果然有 notifyAll() ,而且前提还是 waiters > 0
回头看一下 NPE 的错误堆栈,可以发现这个调用链:

// org.redisson.command.RedisExecutor
if (attempt == attempts) {attemptPromise.tryFailure(exception);return;
}
--->
// org.redisson.misc.RedissonPromise
@Override
public boolean tryFailure(Throwable cause) {if (promise.tryFailure(cause)) {completeExceptionally(cause);return true;}return false;
}
--->
// io.netty.util.concurrent.DefaultPromise
@Override
public boolean tryFailure(Throwable cause) {return setFailure0(cause);
}
->
private boolean setFailure0(Throwable cause) {return setValue0(new CauseHolder(checkNotNull(cause, "cause")));
}
->
private boolean setValue0(Object objResult) {if (RESULT_UPDATER.compareAndSet(this, null, objResult) ||RESULT_UPDATER.compareAndSet(this, UNCANCELLABLE, objResult)) {if (checkNotifyWaiters()) {notifyListeners();}return true;}return false;
}
->
private synchronized boolean checkNotifyWaiters() {if (waiters > 0) {notifyAll();}return listeners != null;
}

根据上面的测试打印,可以知道 waiters 是大于0的,所以当出现 write failed 后,最终应该是能调用得到 notifyAll() 的,但现在却木有生效。

你可能会认为是以下判断为 false 导致没调用到 notifyAll

// io.netty.util.concurrent.DefaultPromise
private boolean setValue0(Object objResult) {if (RESULT_UPDATER.compareAndSet(this, null, objResult) ||RESULT_UPDATER.compareAndSet(this, UNCANCELLABLE, objResult)) {...}
}

实际上,就算把这个判断去掉,也是一样不通过的。
这时候就应该怀疑是不是那个问题了。(啥问题?你猜,hah~~)

接下来就是一个反复阅读源码 + debug 的过程,此处省略一万字。

重点看这个方法:

// org.redisson.RedissonLock
protected <T> RFuture<T> evalWriteAsync(String key, Codec codec, RedisCommand<T> evalCommandType, String script, List<Object> keys, Object... params) {CommandBatchService executorService = createCommandBatchService();RFuture<T> result = executorService.evalWriteAsync(key, codec, evalCommandType, script, keys, params);if (!(commandExecutor instanceof CommandBatchService)) {    executorService.executeAsync();}return result;
}

这里返回的 result,它的 await() 方法最终会被调用,结果就是导致程序被永远挂起,它对应到上面测试打印信息里面的最后一条:

before wait: ImmediateEventExecutor$ImmediatePromise@5c09d180(incomplete), 1

executorService.executeAsync() 则最终会因为 write failed 而经过上面论述的调用链。
修改源码,加些测试打印信息:

// org.redisson.RedissonLock
protected <T> RFuture<T> evalWriteAsync(String key, Codec codec, RedisCommand<T> evalCommandType, String script, List<Object> keys, Object... params) {CommandBatchService executorService = createCommandBatchService();RFuture<T> result = executorService.evalWriteAsync(key, codec, evalCommandType, script, keys, params);System.out.println("======== main result: " + result);if (!(commandExecutor instanceof CommandBatchService)) {    RFuture<BatchResult<?>> rs = executorService.executeAsync();System.out.println("======== async result: " + rs);}return result;
}// io.netty.util.concurrent.DefaultPromise
private boolean setFailure0(Throwable cause) {// 为了更好地看出问题,这里也加入测试打印,这个 cause message 的内容是前面修复 NPE 异常时指定的if (cause != null && "===== Write failed.".equals(cause.getMessage())) {System.out.println("========== setFailure0: " + this + ", waiters: " + waiters);}return setValue0(new CauseHolder(checkNotNull(cause, "cause")));
}

重新启动,输出:

2020-08-20 01:48:39 [main] INFO  org.redisson.Version - Redisson 3.12.5
2020-08-20 01:48:40 [redisson-netty-2-14] INFO  o.r.c.p.MasterPubSubConnectionPool - 1 connections initialized for localhost/127.0.0.1:16379
2020-08-20 01:48:40 [redisson-netty-2-11] INFO  o.r.c.pool.MasterConnectionPool - 5 connections initialized for localhost/127.0.0.1:16379
before wait: ImmediateEventExecutor$ImmediatePromise@420745d7(incomplete), 1
after wait: ImmediateEventExecutor$ImmediatePromise@420745d7(success), 0
before wait: ImmediateEventExecutor$ImmediatePromise@7e11ab3d(incomplete), 1
after wait: ImmediateEventExecutor$ImmediatePromise@7e11ab3d(success: 1), 0
======== main result: RedissonPromise [promise=ImmediateEventExecutor$ImmediatePromise@2c0f7678(incomplete)]
======== async result: RedissonPromise [promise=ImmediateEventExecutor$ImmediatePromise@88a8218(incomplete)]
before wait: ImmediateEventExecutor$ImmediatePromise@2c0f7678(incomplete), 1
========== setFailure0: ImmediateEventExecutor$ImmediatePromise@385ecb2e(incomplete), waiters: 0
========== setFailure0: ImmediateEventExecutor$ImmediatePromise@7d5c5241(incomplete), waiters: 0
========== setFailure0: ImmediateEventExecutor$ImmediatePromise@88a8218(incomplete), waiters: 0

有意思了。

  1. main result (@2c0f7678) 和 async result (@88a8218) 是两个不同的 RedissonPromise 对象。
  2. main result (@2c0f7678) 走进了 wait()waiters == 1
  3. async result (@88a8218) 因为 write failed 也走到了 setFailure0(),但是它的 waiters == 0,所以不会 notifyAll()

问题就出在这里,wait()notifyAll() 的调用分属2个不同的对象。

解决方法

修改源码:

// org.redisson.RedissonLock
protected <T> RFuture<T> evalWriteAsync(String key, Codec codec, RedisCommand<T> evalCommandType, String script, List<Object> keys, Object... params) {CommandBatchService executorService = createCommandBatchService();RFuture<T> result = executorService.evalWriteAsync(key, codec, evalCommandType, script, keys, params);System.out.println("======== main result: " + result);if (!(commandExecutor instanceof CommandBatchService)) {RFuture<BatchResult<?>> rs = executorService.executeAsync();System.out.println("======== async result: " + rs);// 以下的强制类型转换可能存在问题,请根据需要进行优化rs.onComplete((v, e) -> {if (e == null) {((RPromise) result).trySuccess(v);} else {((RPromise) result).tryFailure(e);}});}return result;
}

重新启动,输出:

2020-08-20 02:06:46 [main] INFO  org.redisson.Version - Redisson 3.12.5
2020-08-20 02:06:47 [redisson-netty-2-13] INFO  o.r.c.pool.MasterConnectionPool - 5 connections initialized for localhost/127.0.0.1:16379
2020-08-20 02:06:47 [redisson-netty-2-14] INFO  o.r.c.p.MasterPubSubConnectionPool - 1 connections initialized for localhost/127.0.0.1:16379
before wait: ImmediateEventExecutor$ImmediatePromise@420745d7(incomplete), 1
after wait: ImmediateEventExecutor$ImmediatePromise@420745d7(success), 0
before wait: ImmediateEventExecutor$ImmediatePromise@7e11ab3d(incomplete), 1
after wait: ImmediateEventExecutor$ImmediatePromise@7e11ab3d(success: 1), 0
======== main result: RedissonPromise [promise=ImmediateEventExecutor$ImmediatePromise@2c0f7678(incomplete)]
======== async result: RedissonPromise [promise=ImmediateEventExecutor$ImmediatePromise@88a8218(incomplete)]
before wait: ImmediateEventExecutor$ImmediatePromise@2c0f7678(incomplete), 1
========== setFailure0: ImmediateEventExecutor$ImmediatePromise@68bf51eb(incomplete), waiters: 0
========== setFailure0: ImmediateEventExecutor$ImmediatePromise@2dbdc298(incomplete), waiters: 0
========== setFailure0: ImmediateEventExecutor$ImmediatePromise@88a8218(incomplete), waiters: 0
========== setFailure0: ImmediateEventExecutor$ImmediatePromise@2c0f7678(incomplete), waiters: 1
after wait: ImmediateEventExecutor$ImmediatePromise@2c0f7678(failure: org.redisson.client.RedisException: ===== Write failed.), 0org.redisson.client.RedisException: ===== Write failed.at org.redisson.command.RedisExecutor$2.run(RedisExecutor.java:215)at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:672)at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:747)at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:472)at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)at java.lang.Thread.run(Thread.java:748)

可以看出,在 async result (@88a8218) setFailure0() 后,main result (@2c0f7678) 也跟着 setFailure0(),接着从 wait() 退出,打印了 "after wait … ", waiters 也变回了0,最终把 write failed 的异常给抛了出来。

Redisson lock挂死问题的分析与解决相关推荐

  1. 计算机经常死机故障排除,电脑经常死机问题分析及解决办法是什么?

    小编为大家介绍下电脑经常死机问题分析及解决办法是什么. 电脑经常死机是什么原因? 一.电脑频繁死机,在进行CMOS设置时也会出现死机现象,一般由硬件问题引起的,散热不良,电脑内灰尘过多,cpu设置超频 ...

  2. 记一次 .NET 某上市工业智造 CPU+内存+挂死 三高分析

    一:背景 1. 讲故事 上个月有位朋友加wx告知他的程序有挂死现象,询问如何进一步分析,截图如下: 看这位朋友还是有一定的分析基础,可能玩的少,缺乏一定的分析经验,当我简单分析之后,我发现这个dump ...

  3. 记一次 .NET WPF布草管理系统 挂死分析

    一:背景 1. 讲故事 这几天看的 dump 有点多,有点伤神伤脑,晚上做梦都是dump,今天早上头晕晕的到公司就听到背后同事抱怨他负责的WPF程序挂死了,然后测试的小姑娘也跟着抱怨...嗨,也不知道 ...

  4. VxWorks任务挂死实战分析

    目录 背景描述 根本原因 分析过程 背景描述 操作系统:VxWorks 5.5 CPU:MIPS32 74Kc内核CPU 现象描述:联调代码时发现应用层代码调用以下接口函数必现任务挂死,检查代码发现入 ...

  5. I2C 挂死原因分析及解决方案

    I2C几乎是嵌入系统中最为通用串行总线,MCU周边的各种器件只要对速度要求不高都可以使用.优点是兼容性好(几乎所有MCU都有I2C主机控制器,没有也可以用IO模拟),管脚占用少,芯片实现简单.I2C协 ...

  6. i2c- sda挂死分析

    I2C是由Philips公司发明的一种串行数据通信协议,仅使用两根信号线:SerialClock(简称SCL)和SerialData(简称SDA).I2C是总线结构,1个Master,1个或多个Sla ...

  7. 一次挂死(hang)的处理过程及经验

    前言: CPU占用率低,内存还有许多空余,但网站无法响应,这就是网站挂死,通常也叫做hang.这种情况对于我这样既是CEO,又是CTO,还兼职扫地洗碗的个人站长来说根本就是家常便饭.以下是一次处理ha ...

  8. 一个Job运行失败导致数据库挂死

    今天上午10点多的时候,同事接到一个电话,某数据库任何连接都连不上数据库,登录主机后发现,该数据库已经挂死,sqlplus都无法登陆,在alertlog中发现大量的"PMON failed ...

  9. 艾伟_转载:一次挂死(hang)的处理过程及经验

    前言: CPU占用率低,内存还有许多空余,但网站无法响应,这就是网站挂死,通常也叫做hang.这种情况对于我这样既是CEO,又是CTO,还兼职扫地洗碗的个人站长来说根本就是家常便饭.以下是一次处理ha ...

最新文章

  1. FFmpeg学习4:音频格式转换
  2. Ubuntu 14.04 安装 Sublime Text 3
  3. c语言程序设计精髓第二周,2实型数据C语言程序设计精髓.pdf
  4. 传递函数尾1法和首1法及具体举例+H(s)与H(z)在书中出现的目的
  5. Oracle Service Bus简介
  6. 【转】RabbitMQ六种队列模式-4.路由模式
  7. 如何使用Alert 组件
  8. 记录一则数据库连接故障ORA-12560,ORA-12518
  9. mysql 客户端乱码_Mysql客户端中文乱码问题解决
  10. 2015-12-02 定时自动执行存储过程
  11. 哈萨比斯首次解读AlphaZero竟被当场diss,他起身当面回击说…
  12. 阿里java工具包_阿里开源的Java诊断工具Arthas(阿尔萨斯)
  13. html字体兼容写法,字体兼容写法
  14. MATLAB作图颜色
  15. 记录mt7615e wifi 驱动移植到openwrt cc
  16. FastFDS文件服务部署
  17. 【功能上新】Python实现OSM地图数据解析——OSM2Rail
  18. Bmob后台云数据库
  19. 如何找到浏览器扩展的安装位置
  20. html好看的文字特效

热门文章

  1. 深圳网络信息安全员(NSACE初级)认证招生简章
  2. WJMZBMR打osu! / Easy (Lougu1365)
  3. android 文字闪烁效果,Android Shader应用开发之霓虹闪烁文字效果
  4. 8k30视频拼接器方案 - 多接口2x2拼接器实现方法
  5. CListCtrl::SortItems的用法
  6. KeyBert关键词提取 :原理、方法介绍、代码实践
  7. 同轴“衰减器”的介绍
  8. camunda 多租户
  9. 测试代理IP有效性的几种方式-芝麻ip
  10. visio中公式太小_串联管道/并联管道中调节阀可调比R的计算