stall-detector: Try hard not to crash while collecting backtrace #2420

xemul · 2024-09-04T11:05:16Z

Sometimes stall-detector signal comes in the middle of exception handling. If the stall is detected, stack unwiding starts to collect the stalled backtrace. Since exception handling means unwiding the stack as well, those two unwinders need to cooperate carefully, which is not guaranteed (spoiler: they don't cooperate carefully). In unlucky case, segmentation fault happens, the app is killed with SEGV.

This patch helps stall detector to bail out in case of SEGV arrival while collecting the backtrace with minimally possible yet detailed enough stall report.

fee-mendes · 2024-09-04T11:10:53Z

src/core/reactor.cc

 static void print_with_backtrace(backtrace_buffer& buf, bool oneline) noexcept {
+    if (sigsetjmp(stall_detector_env, 0)) {
+        buf.append(" ¯\\_(ツ)_/¯\n");


Sometimes stall-detector signal comes in the middle of exception handling. If the stall is detected, stack unwiding starts to collect the stalled backtrace. Since exception handling means unwiding the stack as well, those two unwinders need to cooperate carefully, which is not guaranteed (spoiler: they don't cooperate carefully). In unlucky case, segmentation fault happens, the app is killed with SEGV. This patch helps stall detector to bail out in case of SEGV arrival while collecting the backtrace with minimally possible yet detailed enough stall report. Signed-off-by: Pavel Emelyanov <[email protected]>

michoecho · 2024-09-04T11:21:08Z

Doesn't solve the problem entirely, since SIGSEGV isn't the only possible symptom (you could get an infinite loop for example, why not), but I guess it prevents a crash in the cases it's enough (which is probably a great majority of cases), and doesn't hurt in the others, so why not.

avikivity · 2024-09-11T10:49:08Z

src/core/reactor.cc

+        goto out;
+    }
+    in_stall_detector = true;
+


To be technically correct, we need an std::atomic_signal_fence(std::memory_order_relaxed). This prevents a magical compiler from delaying the write to memory because no one reads it.

avikivity · 2024-09-11T10:51:01Z

tests/unit/stall_detector_test.cc

+    reactor::test::set_stall_detector_crash_collecting_backtrace();
+    engine().update_blocked_reactor_notify_ms(100ms);
+    spin(500ms);
+}


Did you also reproduce the crash during unwinding? It's not given that siglongjmp is a safe way to unwind. If the unwinder takes a lock, it will leak it (though I'm guessing it doesn't).

Did you also reproduce the crash during unwinding?

In labs -- unfortunately, no :(

It's not given that siglongjmp is a safe way to unwind.

Yes, sure, at this point the situation is already screwed up, and it's questionable whether these tricks are making things even worse or not

Perhaps we can override __cxa_throw and whatever function it uses to exist unwinding (but maybe there isn't one), and call them via RTLD_NEXT. Then we can set flags when unwinding is in progress, and just avoid going into the stall detector again (or perhaps: ask the stall detector to run on the exit path of __cxa_throw).

I don't think it will work.

Also, tracing exception throwers is important.

Perhaps we can override __cxa_throw and whatever function it uses to exist unwinding (but maybe there isn't one)

There isn't one.

Maybe have a blacklist of functions that are known to crash. Every time we see a crash, add the triggering function to the blacklist. In a few short years we'll have a robust filter.

xemul requested review from avikivity and michoecho September 4, 2024 11:05

fee-mendes reviewed Sep 4, 2024

View reviewed changes

xemul force-pushed the br-avoid-segv-in-stall-detector branch from 6b368ce to ce84a03 Compare September 4, 2024 11:17

avikivity reviewed Sep 11, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stall-detector: Try hard not to crash while collecting backtrace #2420

stall-detector: Try hard not to crash while collecting backtrace #2420

xemul commented Sep 4, 2024

fee-mendes Sep 4, 2024

michoecho commented Sep 4, 2024

avikivity Sep 11, 2024

avikivity Sep 11, 2024

xemul Sep 11, 2024

avikivity Sep 16, 2024

avikivity Sep 16, 2024

michoecho Sep 16, 2024

avikivity Sep 17, 2024

stall-detector: Try hard not to crash while collecting backtrace #2420

Are you sure you want to change the base?

stall-detector: Try hard not to crash while collecting backtrace #2420

Conversation

xemul commented Sep 4, 2024

Choose a reason for hiding this comment

michoecho commented Sep 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment