Android Watchdog 机制

早期手机平台上通常是在设备中增加一个硬件看门狗(WatchDog), 软件系统必须定时的向看门狗硬件中写值来表示自己没出故障(俗称“喂狗”), 否则超过了规定的时间看门狗就会重新启动设备. 大体原理是, 在系统运行以后启动了看门狗的计数器, 看门狗就开始自动计数,如果到了一定的时间还不去清看门狗,那么看门狗计数器就会溢出从而引起看门狗中断,造成系统复位。

而手机, 其实是一个超强超强的单片机, 其运行速度比单片机快N倍, 存储空间比单片机大N倍, 里面运行了若干个线程, 各种软硬件协同工作, Android 的 SystemServer 是一个非常复杂的进程,里面运行的服务超过五十种,是最可能出问题的进程,因此有必要对 SystemServer 中运行的各种线程实施监控。

但是如果使用硬件看门狗的工作方式,每个线程隔一段时间去喂狗,不但非常浪费CPU,而且会导致程序设计更加复杂。因此 Android 开发了 Watchdog 类作为软件看门狗来监控 SystemServer 中的线程。一旦发现问题,Watchdog 会杀死 SystemServer 进程。

Watchdog的功能

Watchdog主要有两个作用

  1. Blocked in Monitor 被监控线程的monitor接口实现阻塞
  2. Blocked int handler 被监控线程的消息队列不处理消息

判断线程是否卡住的方法

MessageQueue.isPolling
Monitor.monitor
---
HandlerChecker 检查looper是否阻塞
monitor 检查是否死锁

Watchdog的工作机制

Watchdog的工作机制 https://img-blog.csdnimg.cn/img_convert/e5c8133c7f86583251c775de4ceae9c0.jpeg

Watchdog 的启动

Watchdog 是在 SystemServer 进程中被初始化和启动的,在 SystemServer 的 run 方法中,各种Android 服务被注册和启动,其中也包括了Watchdog 的初始化和启动,代码如下:

final Watchdog watchdog = Watchdog.getInstance();//line: 864
watchdog.init(context, mActivityManagerService);

在 SystemServer 中 startOtherServices() 的后半段,在 AMS(ActivityManagerService) 的 SystemReady 接口的 CallBack 函数中实现 Watchdog 的启动:

Watchdog.getInstance().start();//line: 1852

Watchdog的构造方法

super("watchdog");
//初始化每一个我们希望检查的线程
//这里没有检查后台线程
//共享的前台线程是主检查器, 还有分配其monitor检查其它线程
mMonitorChecker = new HandlerChecker(FgThread.getHandler(),"foreground thread", DEFAULT_TIMEOUT);
mHandlerCheckers.add(mMonitorChecker);
// 为主线程添加检查器
mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),"main thread", DEFAULT_TIMEOUT));
// 为共享UI线程添加检查器
mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),"ui thread", DEFAULT_TIMEOUT));
// 为共享IO线程添加检查器
mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),"i/o thread", DEFAULT_TIMEOUT));
// 为共享display线程添加检查器.
mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),"display thread", DEFAULT_TIMEOUT));// 初始化检查器 binder线程.
addMonitor(new BinderThreadMonitor());mOpenFdMonitor = OpenFdMonitor.create();// See the notes on DEFAULT_TIMEOUT.
assert DB ||DEFAULT_TIMEOUT > ZygoteConnectionConstants.WRAPPED_PID_TIMEOUT_MILLIS;

Watchdog的构造方法中创建了一些HandlerChecker对象, 并添加到自己的监听队列中.

Watchdog添加的监听handler

线程名 对应handler 说明 Timeout
foreground thread FgThread.getHandler() 前台线程 60s
main thread new Handler(Looper.getMainLooper()) 主线程 60s
ui thread UiThread.getHandler() UI线程 60s
i/o thread IoThread.getHandler() IO线程 60s
display thread DisplayThread.getHandler() Display线程 60s
PackageManager addThread(mHandler, time) PackageManagerService主动add的线程 10min
PackageManager addThread(mHandler, time) PermissionManagerService主动add的线程 60s
PowerManagerService addThread(mHandler, time) PowerManagerService主动add的线程 60s
ActivityManagerService addThread(mHandler, time) ActivityManagerService主动add的线程 60s

Watchdog添加的监听monitor

monitor程名 说明 Timeout
BinderThreadMonitor 检查Binder线程 60s
OpenFdMonitor 检查fd线程 60s
TvRemoteService addMonitor(this) mLock
ActivityManagerService addMonitor(this) this
MediaProjectionManagerService addMonitor(this) mLock
MediaRouterService addMonitor(this) mLock
MediaSessionService addMonitor(this) mLock
InputManagerService addMonitor(this) mInputFilterLock
nativeMonitor(mPtr);
PowerManagerService addMonitor(this) mLock
NetworkManagementService addMonitor(this) mConnector
StorageManagerService addMonitor(this) mVold
WindowManagerService addMonitor(this) mWindowMap

HandlerChecker

public final class HandlerChecker implements Runnable

HandlerChecker用于检查句柄线程的状态和调度监视器回调, 其原理就是通过各个Handler的looper的MessageQueue来判断该线程是否卡住了。当然,该线程是运行在SystemServer进程中的线程。

Watchdog中会构建很多的HandlerChecker, 可以分为两类

  • Monitor Checker,用于检查是Monitor对象可能发生的死锁, AMS, PKMS, WMS等核心的系统服务都是Monitor对象。
  • Looper Checker,用于检查线程的消息队列是否长时间处于工作状态。Watchdog自身的消息队列,ui, Io, display这些全局的消息队列都是被检查的对象。此外,一些重要的线程的消息队列,也会加入到Looper Checker中,譬如AMS, PKMS,这些是在对应的对象初始化时加入的。

两类HandlerChecker的侧重点不同

  • Monitor Checker 预警我们不能长时间持有核心系统服务的对象锁,否则会阻塞很多函数的运行
  • Looper Checker预警我们不能长时间的霸占消息队列,否则其他消息将得不到处理

HandlerChecker的构造函数

public final class HandlerChecker implements Runnable {private final Handler mHandler;private final String mName;private final long mWaitMax;private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();private boolean mCompleted;private Monitor mCurrentMonitor;private long mStartTime;HandlerChecker(Handler handler, String name, long waitMaxMillis) {mHandler = handler; //线程handlermName = name; //名称mWaitMax = waitMaxMillis; //等待超时时间mCompleted = true; //线程状态}
}

HandlerChecker::scheduleCheckLocked

这个方法是在Watchdog中的run方法会调用, 是HandlerChecker的核心方法, 用来检查HandlerChecker是否发生了死锁.

public void scheduleCheckLocked() {if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) {// If the target looper has recently been polling, then// there is no reason to enqueue our checker on it since that// is as good as it not being deadlocked.  This avoid having// to do a context switch to check the thread.  Note that we// only do this if mCheckReboot is false and we have no// monitors, since those would need to be executed at this point.mCompleted = true;return;}if (!mCompleted) {// we already have a check in flight, so no needreturn;}mCompleted = false;mCurrentMonitor = null;mStartTime = SystemClock.uptimeMillis();mHandler.postAtFrontOfQueue(this);
}
  1. isPolling() 这个方法是判断当前线程Looper是否就绪的核心方法. 如果true 当前正在轮询事件, 正常运行, 会继续向下执行
  2. 如果没有mCompleted, 说明已经在检查了
  3. `mHandler.postAtFrontOfQueue(this)将自己post到队列中, 之后会执行run方法

在scheduleCheckLocked 中,其实主要是处理mMonitorChecker 的情况,对于其他的没有monitor 注册进来的且处于polling 状态的 HandlerChecker 是不去检查的,例如,UiThread,肯定一直处于polling 状态。

MessageQueue::isPolling

mHandler.getLooper().getQueue().isPolling() 这个方法可以判断当前线程是否被卡住.
true: 表示looper当前正在轮询事件,

这个方法的实现在MessageQueue中,可以看到上面的注释写到:返回当前的looper线程是否在polling工作来做,这个是个很好的用于检测loop是否存活的方法。

frameworks/base/core/java/android/os/MessageQueue.java

/*** Returns whether this looper's thread is currently polling for more work to do.* This is a good signal that the loop is still alive rather than being stuck* handling a callback.  Note that this method is intrinsically racy, since the* state of the loop can change before you get the result back.** <p>This method is safe to call from any thread.** @return True if the looper is currently polling for events.* @hide*/
public boolean isPolling() {synchronized (this) {return isPollingLocked();}
}

HandlerChecker::run

@Override
public void run() {final int size = mMonitors.size();for (int i = 0 ; i < size ; i++) {synchronized (Watchdog.this) {mCurrentMonitor = mMonitors.get(i);}mCurrentMonitor.monitor();}synchronized (Watchdog.this) {mCompleted = true;mCurrentMonitor = null;}
}
  1. 里面对自己的Monitors遍历并进行monitor。若有monitor发生了阻塞,那么mComplete会一直是false。
  2. for循环用来检测监听列表中是否有阻塞,而且只有mMonitorChecker会走进此循环
  3. 其余的handlerChecker因为mMonitors为空,都不会执行此循环

HandlerChecker::getCompletionStateLocked

public int getCompletionStateLocked() {if (mCompleted) {return COMPLETED;} else {long latency = SystemClock.uptimeMillis() - mStartTime;if (latency < mWaitMax/2) {return WAITING;} else if (latency < mWaitMax) {return WAITED_HALF;}}return OVERDUE;
}
  1. 获取完成时间标识, mStartTime初值是在scheduleCheckLocked中设置的
  2. 在系统检测调用这个获取未完成状态时,就会进入else里面,进行了时间的计算,并返回相应的时间状态码。

线程的状态

状态 描述
COMPLETED 对应消息已处理完毕线程无阻塞
WAITING 对应消息处理花费0~29秒,继续运行
WAITED_HALF 对应消息处理花费30~59秒,线程可能已经被阻塞,需要保存当前AMS堆栈状态, 继续监听
OVERDUE 对应消息处理已经花费超过60, 准备 kill 当前进程. 能够走到这里,说明已经发生了超时60秒了。那么下面接下来全是应对超时的情况

HandlerThread的继承关系

这里的HandlerChecker使用的传入参数都是创建的HandlerThread线程的Handler

java.lang.Object↳ Thread implements Runnable↳ HandlerThread extends Thread↳ ServiceThread extends HandlerThread↳ FgThread extends ServiceThread

初始化的HandlerChecker

public ServiceThread(String name, int priority, boolean allowIo)private FgThread() {super("android.fg", android.os.Process.THREAD_PRIORITY_DEFAULT, true /*allowIo*/);
}private UiThread() {super("android.ui", Process.THREAD_PRIORITY_FOREGROUND, false /*allowIo*/);
}private IoThread() {super("android.io", android.os.Process.THREAD_PRIORITY_DEFAULT, true /*allowIo*/);
}private DisplayThread() {//DisplayThread运行重要的东西,但这些东西不如AnimationThread中运行的东西重要。//因此,将优先级设置为较低的一个。super("android.display", Process.THREAD_PRIORITY_DISPLAY + 1, false /*allowIo*/);
}

Android线程优先级

frameworks/base/core/java/android/os/Process.java

public static final int THREAD_PRIORITY_DEFAULT = 0; //默认的线程优先级
public static final int THREAD_PRIORITY_LOWEST = 19; //最低的线程级别
public static final int THREAD_PRIORITY_BACKGROUND = 10; //后台线程建议设置这个优先级
public static final int THREAD_PRIORITY_FOREGROUND = -2; //用户正在交互的UI线程,代码中无法设置该优先级,系统会按照情况调整到该优先级
public static final int THREAD_PRIORITY_DISPLAY = -4; //也是与UI交互相关的优先级界别,但是要比THREAD_PRIORITY_FOREGROUND优先
public static final int THREAD_PRIORITY_URGENT_DISPLAY = -8; //显示线程的最高级别,用来处理绘制画面和检索输入事件
public static final int THREAD_PRIORITY_AUDIO = -16; //声音线程的标准级别
public static final int THREAD_PRIORITY_URGENT_AUDIO = -19; //声音线程的最高级别,优先程度较THREAD_PRIORITY_AUDIO要高。
public static final int THREAD_PRIORITY_MORE_FAVORABLE = -1; //相对THREAD_PRIORITY_DEFAULT稍微优先
public static final int THREAD_PRIORITY_LESS_FAVORABLE = 1; // 相对THREAD_PRIORITY_DEFAULT稍微落后一些

应用设置线程优先级的方法如下, 但是有一些级别是不允许应用设置的, 是由系统进行分配的.

Process.setThreadPriority(Process.THREAD_PRIORITY_BACKGROUND +Process.THREAD_PRIORITY_LESS_FAVORABLE)

describeBlockedStateLocked

public String describeBlockedStateLocked() {if (mCurrentMonitor == null) {return "Blocked in handler on " + mName + " (" + getThread().getName() + ")";} else {return "Blocked in monitor " + mCurrentMonitor.getClass().getName()+ " on " + mName + " (" + getThread().getName() + ")";}
}

打印Monitor信息

Monitor

Monitor是一个接口, 用来

public interface Monitor {void monitor();
}

实现Watchdog.Monitor接口的类

ActivityManagerService
WindowManagerService
PowerManagerService
InputManagerService
MediaSessionService
MediaRouterService
StorageManagerService
NetworkManagementService
NativeDaemonConnector
MediaProjectionManagerService
TvRemoteService

BinderThreadMonitor
OpenFdMonitor

Monitor是一个接口,实现这个接口的类有好几个。比如:如下是android9.0搜出来的结果

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-QpJfi2aa-1666612570217)(/home/jun/Desktop/Plane3/CoreSystemServer/watchdog/WatchdogImplClass.png)]

使用Watchdog

这么多的类实现了该接口, 他们都注册到了Watchdog中, 如AMS中

public class ActivityManagerService extends IActivityManager.Stubimplements Watchdog.Monitor, BatteryStatsImpl.BatteryCallback {......public ActivityManagerService(Context systemContext) {......Watchdog.getInstance().addMonitor(this);Watchdog.getInstance().addThread(mHandler);......}....../** In this method we try to acquire our lock to make sure that we have not deadlocked */public void monitor() {synchronized (this) { }}......
}

Watchdog::addThread

public void addThread(Handler thread) {addThread(thread, DEFAULT_TIMEOUT); //60s
}public void addThread(Handler thread, long timeoutMillis) {synchronized (this) {if (isAlive()) {throw new RuntimeException("Threads can't be added once the Watchdog is running");}final String name = thread.getLooper().getThread().getName();mHandlerCheckers.add(new HandlerChecker(thread, name, timeoutMillis));}
}
  1. addThread是将线程的Hander传给Watchdog, 然后Watchdog会根据Handler创建一个新的HandlerChecker,
  2. 将新的HandlerChecker添加到监听队列中

Watchdog::addMonitor

public void addMonitor(Monitor monitor) {synchronized (this) {if (isAlive()) {throw new RuntimeException("Monitors can't be added once the Watchdog is running");}mMonitorChecker.addMonitor(monitor);}
}
  1. 传递monitor, Watchdog会调用monitor方法, 来判断是否发生阻塞
  2. 所有的Monitor都添加到了mMonitorChecker, 所以只有mMonitorChecker里是有Monitor的

Watchdog::run()

Watchdog的核心方法, 检查线程死锁, looper阻塞, 收集信息和kill掉system_server进程, 重启

@Override
public void run() {boolean waitedHalf = false;while (true) {final List<HandlerChecker> blockedCheckers;final String subject;final boolean allowRestart;int debuggerWasConnected = 0;synchronized (this) {long timeout = CHECK_INTERVAL;// Make sure we (re)spin the checkers that have become idle within// this wait-and-check intervalfor (int i=0; i<mHandlerCheckers.size(); i++) {//调用每个HandlerChecker的scheduleCheckLocked() 方法HandlerChecker hc = mHandlerCheckers.get(i);hc.scheduleCheckLocked();}if (debuggerWasConnected > 0) {debuggerWasConnected--;}// NOTE: We use uptimeMillis() here because we do not want to increment the time we// wait while asleep. If the device is asleep then the thing that we are waiting// to timeout on is asleep as well and won't have a chance to run, causing a false// positive on when to kill things.long start = SystemClock.uptimeMillis(); while (timeout > 0) {if (Debug.isDebuggerConnected()) {debuggerWasConnected = 2;}try {wait(timeout);} catch (InterruptedException e) {Log.wtf(TAG, e);}if (Debug.isDebuggerConnected()) {debuggerWasConnected = 2;}timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);}boolean fdLimitTriggered = false;if (mOpenFdMonitor != null) {fdLimitTriggered = mOpenFdMonitor.monitor();}if (!fdLimitTriggered) {final int waitState = evaluateCheckerCompletionLocked();if (waitState == COMPLETED) { //线程状态正常,重新轮询// The monitors have returned; resetwaitedHalf = false;continue;} else if (waitState == WAITING) {//处于阻塞状态,但监测时间小于30s,继续监测// still waiting but within their configured intervals; back off and recheckcontinue;} else if (waitState == WAITED_HALF) {//处于阻塞状态,监测时间已经超过30s,开始dump一些系统信息,然后继续监测30sif (!waitedHalf) {// We've waited half the deadlock-detection interval.  Pull a stack// trace and wait another half.ArrayList<Integer> pids = new ArrayList<Integer>();pids.add(Process.myPid());ActivityManagerService.dumpStackTraces(true, pids, null, null,getInterestingNativePids());waitedHalf = true;}continue;}// something is overdue!blockedCheckers = getBlockedCheckersLocked();subject = describeCheckersLocked(blockedCheckers);} else {blockedCheckers = Collections.emptyList();subject = "Open FD high water mark reached";}allowRestart = mAllowRestart;}// If we got here, that means that the system is most likely hung.// First collect stack traces from all threads of the system process.// Then kill this process so that the system will restart.EventLog.writeEvent(EventLogTags.WATCHDOG, subject);ArrayList<Integer> pids = new ArrayList<>();pids.add(Process.myPid());if (mPhonePid > 0) pids.add(mPhonePid);// Pass !waitedHalf so that just in case we somehow wind up here without having// dumped the halfway stacks, we properly re-initialize the trace file.final File stack = ActivityManagerService.dumpStackTraces(!waitedHalf, pids, null, null, getInterestingNativePids());// Give some extra time to make sure the stack traces get written.// The system's been hanging for a minute, another second or two won't hurt much.SystemClock.sleep(2000);// Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel logdoSysRq('w');doSysRq('l');// Try to add the error to the dropbox, but assuming that the ActivityManager// itself may be deadlocked.  (which has happened, causing this statement to// deadlock and the watchdog as a whole to be ineffective)Thread dropboxThread = new Thread("watchdogWriteToDropbox") {public void run() {mActivity.addErrorToDropBox("watchdog", null, "system_server", null, null,subject, null, stack, null);}};dropboxThread.start();try {dropboxThread.join(2000);  // wait up to 2 seconds for it to return.} catch (InterruptedException ignored) {}IActivityController controller;synchronized (this) {controller = mController;}if (controller != null) {Slog.i(TAG, "Reporting stuck state to activity controller");try {Binder.setDumpDisabled("Service dumps disabled due to hung system process.");// 1 = keep waiting, -1 = kill systemint res = controller.systemNotResponding(subject);if (res >= 0) {Slog.i(TAG, "Activity controller requested to coninue to wait");waitedHalf = false;continue;}} catch (RemoteException e) {}}// Only kill the process if the debugger is not attached.if (Debug.isDebuggerConnected()) {debuggerWasConnected = 2;}if (debuggerWasConnected >= 2) {Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");} else if (debuggerWasConnected > 0) {Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");} else if (!allowRestart) {Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");} else {Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);Slog.w(TAG, "*** GOODBYE!");Process.killProcess(Process.myPid());System.exit(10);}waitedHalf = false;}
}
  1. run() 方法就是死循环, 不断的去遍历所有HandlerChecker,并调其监控方法,等待三十秒,评估状态。

  2. 遍历所有的HandlerChecker, 并调用其scheduleCheckLocked方法, 记录开始时间

    for (int i=0; i<mHandlerCheckers.size(); i++) {HandlerChecker hc = mHandlerCheckers.get(i);hc.scheduleCheckLocked();
    }
    
  3. 等待 30 秒

    // 等待30秒
    //使用uptimeMills是为了不把手机睡眠时间算进入,手机睡眠时系统服务同样睡眠
    long start = SystemClock.uptimeMillis();
    while (timeout > 0) {if (Debug.isDebuggerConnected()) {debuggerWasConnected = 2;}try {wait(timeout);} catch (InterruptedException e) {Log.wtf(TAG, e);}if (Debug.isDebuggerConnected()) {debuggerWasConnected = 2;}timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
    }
    
  4. 评估Checker的状态,里面会遍历所有的HandlerChecker,并获取最大的返回值。
    最大的返回值有四种情况:

    • COMPLETED 对应消息已处理完毕线程无阻塞
    • WAITING 对应消息处理花费0~29秒,继续运行
    • WAITED_HALF 对应消息处理花费30~59秒,线程可能已经被阻塞,需要保存当前AMS堆栈状态, 继续监听
    • OVERDUE 对应消息处理已经花费超过60, 准备 kill 当前进程. 能够走到这里,说明已经发生了超时60秒了。那么下面接下来全是应对超时的情况
    boolean fdLimitTriggered = false;
    if (mOpenFdMonitor != null) {fdLimitTriggered = mOpenFdMonitor.monitor();
    }
    if (!fdLimitTriggered) {final int waitState = evaluateCheckerCompletionLocked();if (waitState == COMPLETED) {// The monitors have returned; resetwaitedHalf = false;continue;} else if (waitState == WAITING) {// still waiting but within their configured intervals; back off and recheckcontinue;} else if (waitState == WAITED_HALF) {if (!waitedHalf) {// We've waited half the deadlock-detection interval.  Pull a stack// trace and wait another half.ArrayList<Integer> pids = new ArrayList<Integer>();pids.add(Process.myPid());ActivityManagerService.dumpStackTraces(true, pids, null, null,getInterestingNativePids());waitedHalf = true;}continue;}// something is overdue!blockedCheckers = getBlockedCheckersLocked();subject = describeCheckersLocked(blockedCheckers);
    } else {blockedCheckers = Collections.emptyList();subject = "Open FD high water mark reached";
    }
    
  5. fdMonitor

    public boolean monitor() {if (mFdHighWaterMark.exists()) {dumpOpenDescriptors();return true;}return false;
    }
    
  6. 收集信息

  7. 杀死系统进程

Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);
Slog.w(TAG, "*** GOODBYE!");
Process.killProcess(Process.myPid());
System.exit(10);

HandlerChecker::scheduleCheckLocked

HandlerChecker::run

Watchdog::evaluateCheckerCompletionLocked

评估Checker的状态,里面会遍历所有的HandlerChecker,并获取最大的返回值。

private int evaluateCheckerCompletionLocked() {int state = COMPLETED;// COMPLETED = 0for (int i=0; i<mHandlerCheckers.size(); i++) {HandlerChecker hc = mHandlerCheckers.get(i);state = Math.max(state, hc.getCompletionStateLocked());}return state;
}

HandlerChecker::getCompletionStateLocked

Watchdog::getBlockedCheckersLocked

Watchdog::describeCheckersLocked

private ArrayList<HandlerChecker> getBlockedCheckersLocked() {ArrayList<HandlerChecker> checkers = new ArrayList<HandlerChecker>();for (int i=0; i<mHandlerCheckers.size(); i++) {HandlerChecker hc = mHandlerCheckers.get(i);if (hc.isOverdueLocked()) {checkers.add(hc);}}return checkers;
}private String describeCheckersLocked(List<HandlerChecker> checkers) {StringBuilder builder = new StringBuilder(128);for (int i=0; i<checkers.size(); i++) {if (builder.length() > 0) {builder.append(", ");}builder.append(checkers.get(i).describeBlockedStateLocked());}return builder.toString();
}
  1. 打印阻塞或死锁线程的信息

注意

通过 monitor() 方法检查死锁针对不同线程之间的,而服务主线程是否阻塞是针对主线程,所以通过 sendMessage() 方式是只能检测主线程是否阻塞,而不能检测是否死锁,因为如果服务主线程和另外一个线程发生死锁(如另外一个线程synchronized 关键字长时间持有某个锁,不释放),此时向主线程发送 Message,主线程的Handler是可以继续处理的。

触发方法

  1. Blocked in Monitor
    使用Monitor接口中的锁一直无法释放即可
  2. Blocked in handler
    可以在Service的onCreate中做crash, 这样长时间就会导致systemServer重启.

触发log

常见Log有下面两种,一种是Blocked in handler 、另外一种是: Blocked in monitor

Blocked in handler

11-15 06:56:39.696 24203 24902 W Watchdog: *** WATCHDOG KILLING SYSTEM PROCESS: Blocked in handler on main thread (main), Blocked in handler on ui thread (android.ui)
11-15 06:56:39.696 24203 24902 W Watchdog: main thread stack trace:
11-15 06:56:39.696 24203 24902 W Watchdog:     at android.os.MessageQueue.nativePollOnce(Native Method)
11-15 06:56:39.696 24203 24902 W Watchdog:     at android.os.MessageQueue.next(MessageQueue.java:323)
11-15 06:56:39.696 24203 24902 W Watchdog:     at android.os.Looper.loop(Looper.java:142)
11-15 06:56:39.696 24203 24902 W Watchdog:     at com.android.server.SystemServer.run(SystemServer.java:377)
11-15 06:56:39.696 24203 24902 W Watchdog:     at com.android.server.SystemServer.main(SystemServer.java:239)
11-15 06:56:39.696 24203 24902 W Watchdog:     at java.lang.reflect.Method.invoke(Native Method)
11-15 06:56:39.696 24203 24902 W Watchdog:     at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:901)
11-15 06:56:39.696 24203 24902 W Watchdog:     at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:791)
11-15 06:56:39.696 24203 24902 W Watchdog: ui thread stack trace:
......

Blocked in monitor

10-26 00:07:00.884 1000 17132 17312 W Watchdog: *** WATCHDOG KILLING SYSTEM PROCESS: Blocked in monitor com.android.server.Watchdog$BinderThreadMonitor on foreground thread (android.fg)
10-26 00:07:00.884 1000 17132 17312 W Watchdog: foreground thread stack trace:
10-26 00:07:00.885 1000 17132 17312 W Watchdog: at android.os.Binder.blockUntilThreadAvailable(Native Method)
10-26 00:07:00.885 1000 17132 17312 W Watchdog: at com.android.server.Watchdog$BinderThreadMonitor.monitor(Watchdog.java:381)
10-26 00:07:00.885 1000 17132 17312 W Watchdog: at com.android.server.Watchdog$HandlerChecker.run(Watchdog.java:353)
10-26 00:07:00.885 1000 17132 17312 W Watchdog: at android.os.Handler.handleCallback(Handler.java:873)
10-26 00:07:00.886 1000 17132 17312 W Watchdog: at android.os.Handler.dispatchMessage(Handler.java:99)
10-26 00:07:00.886 1000 17132 17312 W Watchdog: at android.os.Looper.loop(Looper.java:193)
10-26 00:07:00.886 1000 17132 17312 W Watchdog: at android.os.HandlerThread.run(HandlerThread.java:65)
10-26 00:07:00.886 1000 17132 17312 W Watchdog: at com.android.server.ServiceThread.run(ServiceThread.java:44)
10-26 00:07:00.886 1000 17132 17312 W Watchdog: *** GOODBYE!

reference

Android SystemServer 中 WatchDog 机制介绍

Android系统层Watchdog机制源码分析

Watchdog原理和问题分析

Android 系统中的 WatchDog 详解

应用与系统稳定性第五篇—Watchdog原理和问题分析

Watchdog 日志分析

Watchdog识别到SystemServer线程死锁后, 会收集打印信息, 代码在run函数中

while (true) {//如果发生了死锁或者消息队列阻塞就会走到下面   // If we got here, that means that the system is most likely hung.// First collect stack traces from all threads of the system process.// Then kill this process so that the system will restart.EventLog.writeEvent(EventLogTags.WATCHDOG, subject);ArrayList<Integer> pids = new ArrayList<>();pids.add(Process.myPid());if (mPhonePid > 0) pids.add(mPhonePid);// Pass !waitedHalf so that just in case we somehow wind up here without having// dumped the halfway stacks, we properly re-initialize the trace file.final File stack = ActivityManagerService.dumpStackTraces(!waitedHalf, pids, null, null, getInterestingNativePids());// Give some extra time to make sure the stack traces get written.// The system's been hanging for a minute, another second or two won't hurt much.SystemClock.sleep(2000);// Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel logdoSysRq('w');doSysRq('l');// Try to add the error to the dropbox, but assuming that the ActivityManager// itself may be deadlocked.  (which has happened, causing this statement to// deadlock and the watchdog as a whole to be ineffective)Thread dropboxThread = new Thread("watchdogWriteToDropbox") {public void run() {mActivity.addErrorToDropBox("watchdog", null, "system_server", null, null,subject, null, stack, null);}};dropboxThread.start();try {dropboxThread.join(2000);  // wait up to 2 seconds for it to return.} catch (InterruptedException ignored) {}IActivityController controller;synchronized (this) {controller = mController;}if (controller != null) {Slog.i(TAG, "Reporting stuck state to activity controller");try {Binder.setDumpDisabled("Service dumps disabled due to hung system process.");// 1 = keep waiting, -1 = kill systemint res = controller.systemNotResponding(subject);if (res >= 0) {Slog.i(TAG, "Activity controller requested to coninue to wait");waitedHalf = false;continue;}} catch (RemoteException e) {}}// Only kill the process if the debugger is not attached.if (Debug.isDebuggerConnected()) {debuggerWasConnected = 2;}if (debuggerWasConnected >= 2) {Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");} else if (debuggerWasConnected > 0) {Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");} else if (!allowRestart) {Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");} else {Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);Slog.w(TAG, "*** GOODBYE!");Process.killProcess(Process.myPid());System.exit(10);}waitedHalf = false;
}
  1. 输出event log

    EventLog.writeEvent(EventLogTags.WATCHDOG, subject);
    
  2. dump 堆栈信息

ArrayList<Integer> pids = new ArrayList<>();
pids.add(Process.myPid());
if (mPhonePid > 0) pids.add(mPhonePid);
// Pass !waitedHalf so that just in case we somehow wind up here without having
// dumped the halfway stacks, we properly re-initialize the trace file.
final File stack = ActivityManagerService.dumpStackTraces(!waitedHalf, pids, null, null, getInterestingNativePids());
// Give some extra time to make sure the stack traces get written.
// The system's been hanging for a minute, another second or two won't hurt much.
SystemClock.sleep(2000);
  1. dump kerner info

    // Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel log
    doSysRq('w');
    doSysRq('l');
    
  2. 收集dropbox信息

    // Try to add the error to the dropbox, but assuming that the ActivityManager
    // itself may be deadlocked.  (which has happened, causing this statement to
    // deadlock and the watchdog as a whole to be ineffective)
    Thread dropboxThread = new Thread("watchdogWriteToDropbox") {public void run() {mActivity.addErrorToDropBox("watchdog", null, "system_server", null, null,subject, null, stack, null);}
    };
    dropboxThread.start();
    try {dropboxThread.join(2000);  // wait up to 2 seconds for it to return.
    } catch (InterruptedException ignored) {}
    
  3. kill 掉系统进程, 如果不在debug模式, 就kill掉自己

    // Only kill the process if the debugger is not attached.
    if (Debug.isDebuggerConnected()) {debuggerWasConnected = 2;
    }
    if (debuggerWasConnected >= 2) {Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
    } else if (debuggerWasConnected > 0) {Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
    } else if (!allowRestart) {Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
    } else {Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);Slog.w(TAG, "*** GOODBYE!");Process.killProcess(Process.myPid());System.exit(10);
    }
    

prop dalvik.vm.stack-trace-dir

指的是 /data/anr

final String tracesDirProp = SystemProperties.get("dalvik.vm.stack-trace-dir", "");

reference

Android 系统中WatchDog 日志分析

Java基础之—反射

android watchdog机制相关推荐

  1. Android消息处理机制

    Google参考了Windows的消息处理机制,在Android系统中实现了一套类似的消息处理机制.学习Android的消息处理机制,有几个概念(类)必须了解: 1.       Message 消息 ...

  2. linux的watchdog代码分析,Watchdog机制以及问题分析

    目录 1. 概览 Watchdog的中文的"看门狗",有保护的意思.最早引入Watchdog是在单片机系统中,由于单片机的工作环境容易受到外界磁场的干扰,导致程序"跑飞& ...

  3. Watchdog机制原理

    Watchdog机制 1.什么是SWT: Softwere Watchdog Timeout,顾名思义就是软件超时监控狗. Watchdog.java 位于frameworks/base/servic ...

  4. Android lmkd 机制从R到T

    源码基于:Android T 相关博文: Android lmkd 机制详解(一) Android lmkd 机制详解(二) 0. 前言 之前有粉丝在问笔者,如上面详解的两篇博文都是基于 Androi ...

  5. Android消息机制Handler用法

    这篇文章介绍了Android消息机制Handler用法总结,对大家的学习或者工作具有一定的参考学习价值,需要的朋友们下面随着小编来一起学习学习吧 1.简述 Handler消息机制主要包括: Messa ...

  6. 【腾讯Bugly干货分享】经典随机Crash之二:Android消息机制

    为什么80%的码农都做不了架构师?>>>    本文作者:鲁可--腾讯SNG专项测试组 测试工程师 背景 承上经典随机Crash之一:线程安全 问题的模型 好几次灰度top1.top ...

  7. Android刷新机制-View绘制原理

    Android刷新机制-View绘制原理 Android刷新机制-SurfaceFlinger原理 Android刷新机制-Choreographer原理 一.概述 本文将从startActivity ...

  8. android消息池,回转寿司你一定吃过!——Android消息机制(构造)

    消息机制的故事寿司陈放在寿司碟上,寿司碟按先后顺序被排成队列送上传送带.传送带被启动后,寿司挨个呈现到你面前,你有三种享用寿司的方法. 将Android概念带入后,就变成了Android消息机制的故事 ...

  9. android handler的机制和原理_一文搞懂handler:彻底明白Android消息机制的原理及源码

    提起Android消息机制,想必都不陌生.其中包含三个部分:Handler,MessageQueue以及Looper,三者共同协作,完成消息机制的运行.本篇文章将由浅入深解析Android消息机制的运 ...

最新文章

  1. 简单介绍python装饰器
  2. RDKit | 基于随机森林的化合物活性二分类模型
  3. @程序员,什么才是“2020-1024”的正确打开姿势?
  4. GP TEE_ObjectInfo结构体在不同的版本之间的变化
  5. 归并排序 java 迭代_经典排序算法之归并排序(示例代码)
  6. leetcode114. 二叉树展开为链表(深度优先搜索)
  7. 重磅更新!YoloV4最新论文!解读yolov4框架
  8. 红帽 安装oracle11g,64位RedHat 5.6下安装Oracle 11g
  9. 让我们深入了解PP YOLO做出的贡献
  10. MABN论文的译读笔记
  11. 【现代编译器】语法分析——正则表达式,上下文无关文法,递归下降分析,分析树...
  12. sqlserver中的循环遍历(普通循环和游标循环)(转载)
  13. 【Python】使用分隔符拆分字符串
  14. Windows安装Redis并设置为开机启动
  15. package.json browserslist
  16. 基础(四)之java后端根据经纬度获取地址
  17. 300例注册表应用技巧
  18. 合工大计算机与信息学院保研,合肥工业大学计算机与信息学院(专业学位)计算机技术保研夏令营...
  19. HTML开发过程中遇到的尺寸问题
  20. 汉语言文学专业c学校,自考汉语言文学专业哪个学校好?

热门文章

  1. 【持续更新中...】抖音火山快手视频去水印小程序
  2. Cisco携Citrix推桌面虚拟化 新终端给力VXI
  3. pyyaml 3.11版本的安装
  4. elc character system_小米MIUI11系统解锁system分区软件分享和操作教程
  5. 实时操作系统 rhino
  6. 相机模型(内参数,外参数)
  7. Python-基础练习
  8. 江苏省C语言二级备考(9/20)
  9. 关于GPIO你真的懂了吗?这篇文章都给你整理好了
  10. Chakra-UI【现代化 React UI 框架 Chakra-UI】