
一:背景
1. 讲故事
前天收到了一个.NET程序崩溃的dump,经过一顿分析之后,发现祸根是因为一个.NET托管线程(DBG=XXXX)的异常退出所致,参考如下:
0:011> !t ThreadCount: 17 UnstartedThread: 0 BackgroundThread: 16 PendingThread: 0 DeadThread: 0 Hosted Runtime: no Lock DBG ID OSID ThreadOBJ State GC Mode GC Alloc Context Domain Count Apt Exception 0 1 84d8 000001C0801EAC20 26020 Preemptive 0000000000000000:0000000000000000 000001c080266300 -00001 STA 3 2 9d78 000001C0801F8210 2b220 Preemptive 0000000000000000:0000000000000000 000001c080266300 -00001 MTA (Finalizer) 4 4 8760 000001C08466C800 102b220 Preemptive 0000000000000000:0000000000000000 000001c080266300 -00001 MTA (Threadpool Worker) ... 44 16 b2fc 000001C08F949450 102b220 Preemptive 0000000000000000:0000000000000000 000001c080266300 -00001 MTA (GC) (Threadpool Worker) 46 15 9904 000001C08F9487B0 102b220 Preemptive 0000000000000000:0000000000000000 000001c080266300 -00001 MTA (Threadpool Worker) XXXX 3 a23c 000001C08F948E00 102b220 Preemptive 0000000000000000:0000000000000000 000001c080266300 -00001 Ukn (Threadpool Worker)
由于线程异常退出,CLR此时完全不知情,当 GC 触发时会在这个XXXX线程上寻找引用根,由于是一个不存在的线程,所以访问它的空间自然就是访问违例,从ScanStackRoots函数调用栈上可以清晰的看到,参考如下:
0:011> .ecxr rax=00007ffdbefcc8a0 rbx=000000a42007f5f0 rcx=000000a42187f688 rdx=0000000000000000 rsi=000000a42007ee60 rdi=000000a42007f100 rip=00007ffdbec36cbb rsp=000000a42007f828 rbp=000001c08f948e00 r8=000000a42007f910 r9=000001c08f948e00 r10=00000fffb7da5860 r11=0555501544555545 r12=ffffffffffffffff r13=0000000000000000 r14=0000000000000000 r15=00007ffdbec14fb0 iopl=0 nv up ei pl nz ac pe cy cs=0033 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00010211 coreclr!InlinedCallFrame::FrameHasActiveCall+0x13: 00007ffd`bec36cbb 483b01 cmp rax,qword ptr [rcx] ds:000000a4`2187f688=???????????????? 0:011> k *** Stack trace for last set context - .thread/.cxr resets it # Child-SP RetAddr Call Site 00 000000a4`2007f828 00007ffd`bec36c2e coreclr!InlinedCallFrame::FrameHasActiveCall+0x13 [D:\a\_work\1\s\src\coreclr\vm\frames.h @ 2927] 01 000000a4`2007f830 00007ffd`bec36aef coreclr!ScanStackRoots+0x3a [D:\a\_work\1\s\src\coreclr\vm\gcenv.ee.cpp @ 121] 02 000000a4`2007f8a0 00007ffd`bec29627 coreclr!GCToEEInterface::GcScanRoots+0x8f [D:\a\_work\1\s\src\coreclr\vm\gcenv.ee.cpp @ 282] 03 (Inline Function) --------`-------- coreclr!GCScan::GcScanRoots+0x73 [D:\a\_work\1\s\src\coreclr\gc\gcscan.cpp @ 152] 04 000000a4`2007f8e0 00007ffd`bec14865 coreclr!WKS::gc_heap::background_mark_phase+0xdf [D:\a\_work\1\s\src\coreclr\gc\gc.cpp @ 37866] 05 000000a4`2007f990 00007ffd`bed286a0 coreclr!WKS::gc_heap::gc1+0x511 [D:\a\_work\1\s\src\coreclr\gc\gc.cpp @ 22315] 06 000000a4`2007f9f0 00007ffd`bed391c1 coreclr!WKS::gc_heap::bgc_thread_function+0x68 [D:\a\_work\1\s\src\coreclr\gc\gc.cpp @ 39244] 07 000000a4`2007fa20 00007ffe`3533e8d7 coreclr!<lambda_7303b2ca2c5f80d5f81ddddfcd2de660>::operator()+0xa1 [D:\a\_work\1\s\src\coreclr\vm\gcenv.ee.cpp @ 1441] 08 000000a4`2007fa50 00007ffe`363f14fc kernel32!BaseThreadInitThunk+0x17 09 000000a4`2007fa80 00000000`00000000 ntdll!RtlUserThreadStart+0x2c
说实话这种崩溃我见过很多例,但更多的都是new Thread创建出来的,所以用 harmony 对它的Thread.StartCore进行拦截就能轻松找出,但这次崩溃有一些特殊,它并不是来自于new Thread而是线程池散养的线程(ThreadPool),这对问题分析增加了不少难度,既然是反思,那就好好的总结此类问题的解决思路吧。
二:故障重现
1. 问题代码
为了方便演示,我们用 C# 调用 C,然后在 C 中通过TerminateThread让程序异常退出,首先看下 C 代码:
extern "C" { _declspec(dllexport) void dowork(); } #include "iostream" #include <Windows.h> using namespace std; void dowork() { DWORD threadId = GetCurrentThreadId(); printf("C++:当前线程ID(十进制):%lu,十六进制:0x%X\n", threadId, threadId); printf("C++:我准备退出了哦。。。\n"); TerminateThread(GetCurrentThread(), 1); }
接下来在 C# 中调用导出的 dowork 方法,参考代码如下:
namespace Example_1_1 { internal class Program { static void Main(string[] args) { DoRequest(); Console.ReadLine(); } static void DoRequest() { Task.Run(() => { Console.WriteLine("1. 调用 C++ 代码..."); try { dowork(); Console.WriteLine("2. C++ 代码执行完毕..."); } catch (Exception ex) { Console.WriteLine($"2. C++ 代码执行异常: {ex.Message}"); } }); } [DllImport("Example_1_2", CallingConvention = CallingConvention.Cdecl)] public extern static void dowork(); } }
最后将程序运行起来,用windbg附加,可以看到果然有一个 XXXX 线程,截图如下:
故障已经复现,接下来就是寻找到底是谁让 ThreadPool 线程异常退出了。。。
三:如何寻找第一现场
1. process monitor
要想找到这个问题的祸根,需要找到调用TerminateThread函数的调用栈,一种简单粗暴的方法就是用process monitor,根据 Windows 的ETW 规则,一个线程退出时会发出一个 Event 事件,这种事件可以被 process monitor 捕获,并且还能记录到调用栈,有了想法之后说干就干,配置界面如下:
接下来运行程序,使用 windbg 附加进程,寻找问题线程ID,参考如下:
0:005> !t ThreadCount: 5 UnstartedThread: 0 BackgroundThread: 3 PendingThread: 0 DeadThread: 1 Hosted Runtime: no Lock DBG ID OSID ThreadOBJ State GC Mode GC Alloc Context Domain Count Apt Exception 0 1 153c 00000202C603C240 2a020 Preemptive 00000202CA819060:00000202CA81B020 00000202c6088980 -00001 MTA 3 2 afc 00000202C60F0DB0 2b220 Preemptive 0000000000000000:0000000000000000 00000202c6088980 -00001 MTA (Finalizer) XXXX 4 4718 00000202C6057D10 102b220 Preemptive 00000202CA80CF70:00000202CA80E740 00000202c6088980 -00001 Ukn (Threadpool Worker) 4 5 4420 00000202C605D510 302b220 Preemptive 00000202CA80EB40:00000202CA810760 00000202c6088980 -00001 MTA (Threadpool Worker) 0:005> ? 4718 Evaluate expression: 18200 = 00000000`00004718
从卦中可以看到是一个叫osid=18200的线程异常退出,接下来从 process monitor 界面上果然看到了一个Thread ID:18200的Thread Exit事件,完美,截图如下:
接下来就是双击,打开 Stack 选项卡,可以清晰的看到是有人调用了Example_1_2!dowork导致的退出,截图如下:
在真实项目中,我相信你看到 dowork 函数应该知道发生了什么,排查范围是不是一下子就小了很多。。。相信这个问题你能轻松搞定。
2. MinHook 注入
上面的 process monitor 虽好,但也有一个让人不如意的地方,那就是不能显示托管栈,这个确实没办法,那有没有办法让我看到托管栈呢?如果能看到就完美了,做法非常简单,对kernel32!TerminateThread进行注入即可,一旦有人执行了这个方法,记录 Terminate 线程的线程ID以及调用栈即可,完整代码如下:
namespace Example_1_1 { internal class Program { static void Main(string[] args) { // Install the hook before any TerminateThread calls can occur TerminateThreadHook.InstallHook(); Console.WriteLine("Hook installed. Starting test..."); DoRequest(); // Uninstall hook when done TerminateThreadHook.UninstallHook(); Console.ReadLine(); } static void DoRequest() { Task.Run(() => { Console.WriteLine("1. 调用 C++ 代码..."); try { dowork(); Console.WriteLine("2. C++ 代码执行完毕..."); } catch (Exception ex) { Console.WriteLine($"2. C++ 代码执行异常: {ex.Message}"); } }); } [DllImport("Example_1_2", CallingConvention = CallingConvention.Cdecl)] public extern static void dowork(); } public static class TerminateThreadHook { // TerminateThread function signature [UnmanagedFunctionPointer(CallingConvention.StdCall)] private delegate bool TerminateThreadDelegate(IntPtr hThread, uint dwExitCode); private static TerminateThreadDelegate _originalTerminateThread; private static IntPtr _terminateThreadPtr = IntPtr.Zero; public static void InstallHook() { // 1. Get TerminateThread address from kernel32.dll _terminateThreadPtr = MinHook.GetProcAddress( MinHook.GetModuleHandle("kernel32.dll"), "TerminateThread"); if (_terminateThreadPtr == IntPtr.Zero) { Console.WriteLine("Failed to find TerminateThread address."); return; } // 2. Initialize MinHook var status = MinHook.MH_Initialize(); if (status != MinHook.MH_STATUS.MH_OK) { Console.WriteLine($"MH_Initialize failed: {status}"); return; } // 3. Create Hook var detourPtr = Marshal.GetFunctionPointerForDelegate( new TerminateThreadDelegate(HookedTerminateThread)); status = MinHook.MH_CreateHook(_terminateThreadPtr, detourPtr, out var originalPtr); if (status != MinHook.MH_STATUS.MH_OK) { Console.WriteLine($"MH_CreateHook failed: {status}"); return; } _originalTerminateThread = Marshal.GetDelegateForFunctionPointer<TerminateThreadDelegate>(originalPtr); // 4. Enable Hook status = MinHook.MH_EnableHook(_terminateThreadPtr); if (status != MinHook.MH_STATUS.MH_OK) { Console.WriteLine($"MH_EnableHook failed: {status}"); return; } Console.WriteLine("TerminateThread hook installed successfully!"); } public static void UninstallHook() { if (_terminateThreadPtr == IntPtr.Zero) return; // 1. Disable Hook var status = MinHook.MH_DisableHook(_terminateThreadPtr); if (status != MinHook.MH_STATUS.MH_OK) Console.WriteLine($"MH_DisableHook failed: {status}"); // 2. Uninitialize MinHook status = MinHook.MH_Uninitialize(); if (status != MinHook.MH_STATUS.MH_OK) Console.WriteLine($"MH_Uninitialize failed: {status}"); _terminateThreadPtr = IntPtr.Zero; Console.WriteLine("Hook uninstalled."); } private static bool HookedTerminateThread(IntPtr hThread, uint dwExitCode) { // Get current thread ID uint currentThreadId = GetCurrentThreadId(); uint targetThreadId = GetThreadId(hThread); Console.WriteLine($"[HOOK] TerminateThread intercepted!"); Console.WriteLine($" Attempting to terminate thread: 0x{targetThreadId.ToString("X")} (ID: {targetThreadId})"); Console.WriteLine($" Called from thread ID: {currentThreadId}"); // Print managed call stack Console.WriteLine("\n [Managed Call Stack]:"); Console.WriteLine(Environment.StackTrace); return _originalTerminateThread(hThread, dwExitCode); } [DllImport("kernel32.dll")] private static extern uint GetCurrentThreadId(); [DllImport("kernel32.dll")] private static extern uint GetThreadId(IntPtr hThread); } public static class MinHook { public enum MH_STATUS { MH_OK = 0, MH_ERROR_ALREADY_INITIALIZED, MH_ERROR_NOT_INITIALIZED, // ... other status codes } [DllImport("MinHook.x64.dll", CallingConvention = CallingConvention.Cdecl)] public static extern MH_STATUS MH_Initialize(); [DllImport("MinHook.x64.dll", CallingConvention = CallingConvention.Cdecl)] public static extern MH_STATUS MH_Uninitialize(); [DllImport("MinHook.x64.dll", CallingConvention = CallingConvention.Cdecl)] public static extern MH_STATUS MH_CreateHook(IntPtr pTarget, IntPtr pDetour, out IntPtr ppOriginal); [DllImport("MinHook.x64.dll", CallingConvention = CallingConvention.Cdecl)] public static extern MH_STATUS MH_EnableHook(IntPtr pTarget); [DllImport("MinHook.x64.dll", CallingConvention = CallingConvention.Cdecl)] public static extern MH_STATUS MH_DisableHook(IntPtr pTarget); [DllImport("kernel32.dll", CharSet = CharSet.Unicode)] public static extern IntPtr GetModuleHandle(string lpModuleName); [DllImport("kernel32.dll", CharSet = CharSet.Ansi)] public static extern IntPtr GetProcAddress(IntPtr hModule, string lpProcName); } }
从卦中信息看果然拦截到了,通过Environment.StackTrace属性将托管栈完美的展示出来,但这里也有一个小遗憾就是没看到非托管部分,如果真想要的话可以借助 dbghelp.dll,这个就不细说了,总之根据这些调用栈日志 再比对 dump 中的异常退出线程,最终就会真相大白。。。
四:总结
如今.NET的主战场在工控,而工控中有大量的C#和C++交互的场景,C++处理不慎就会导致C#灾难性后果,这篇文章所输出的经验希望给后来者少踩坑吧!