Skip to content

fix: use WaitForExit() instead of WaitForExitAsync to drain stderr#15667

Open
nohwnd wants to merge 3 commits intomicrosoft:mainfrom
nohwnd:fix-stderr-race-condition
Open

fix: use WaitForExit() instead of WaitForExitAsync to drain stderr#15667
nohwnd wants to merge 3 commits intomicrosoft:mainfrom
nohwnd:fix-stderr-race-condition

Conversation

@nohwnd
Copy link
Copy Markdown
Member

@nohwnd nohwnd commented Apr 13, 2026

Problem

When testhost crashes with a stack overflow, vstest sometimes reports just \Test host process crashed\ instead of \Test host process crashed : Stack overflow.. The stderr output from the crash is intermittently lost.

Root Cause

Race condition in \ProcessHelper.cs\ Exited event handler:

  1. Testhost crashes → \Process.Exited\ event fires
  2. Handler calls \WaitForExitAsync(cts.Token)\
  3. Handler calls \�xitCallBack\ → \TestHostManagerCallbacks.ExitCallBack\ reads \ estHostProcessStdError\
  4. But \ErrorDataReceived\ may not have delivered the final stderr data yet

\WaitForExitAsync\ does NOT guarantee that all asynchronous readers (\ErrorDataReceived, \OutputDataReceived) have reached EOF before returning. The parameterless \WaitForExit()\ does.

Fix

Replace \WaitForExitAsync(cts.Token)\ with \�wait Task.Run(() => p.WaitForExit(), cts.Token)\ on .NET 5+.

  • *\WaitForExit()* waits for both process exit AND async reader EOF — no race
  • *\Task.Run* keeps it non-blocking in the event handler
  • *\CancellationTokenSource(500ms)* handles zombie processes (e.g. Selenium WebDriver detach)
  • Matches the pattern already used in the pre-.NET 5 Unix code path

Repro

Reproduced with a stress test (50-200 iterations of crashing a child process and capturing stderr):

  • \WaitForExitAsync: 49/50 captured (1 missed) under load
  • \WaitForExit(): 50/50 captured ✅

The race is load/timing dependent — consistent with flaky CI.

Flaky Test Fixed

\TheRunTestsShouldThrowOnStackOverflowException\ in \TranslationLayerTests/RunTests.cs\

Relates to #3375

Copilot AI review requested due to automatic review settings April 13, 2026 08:43
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses intermittent loss of stderr output when a child testhost process crashes (notably stack overflow), by ensuring the process’ asynchronous stderr/stdout readers are fully drained before invoking exit callbacks.

Changes:

  • Replace WaitForExitAsync(cts.Token) with Task.Run(() => p.WaitForExit(), cts.Token) on .NET 5+ to ensure async stream readers reach EOF.
  • Update inline documentation in ProcessHelper to explain the race and why WaitForExit() is required.

Comment on lines 200 to 202
#else
await Task.Run(() => p.WaitForExit(), cts.Token).ConfigureAwait(false);
#endif
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On NET5_0_OR_GREATER, Task.Run(() => p.WaitForExit(), cts.Token) does not actually cancel the underlying WaitForExit() call. If the process gets into the “zombie/hang” state mentioned in the comment, canceling the CTS will only stop awaiting, but the thread-pool thread will remain blocked indefinitely, which can become a resource leak under repeated hangs. Consider using a wait strategy that both (a) avoids blocking a thread when timing out and (b) still guarantees stream drain (e.g., Windows-specific WaitForExit(timeout) + final drain where safe, and/or explicitly terminating the process on timeout similar to the pre-NET5 Unix branch).

Copilot uses AI. Check for mistakes.
Comment on lines +160 to +162
// See https://github.com/microsoft/vstest/issues/3375 for details.
var cts = new CancellationTokenSource(timeout);
#if NET5_0_OR_GREATER
await p.WaitForExitAsync(cts.Token);
#else
#if !NET5_0_OR_GREATER
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CancellationTokenSource is IDisposable. Consider using using var cts = new CancellationTokenSource(timeout); here to avoid leaking timer resources across many process launches (and similarly dispose any CancellationTokenRegistration returned from cts.Token.Register(...) when used).

Copilot uses AI. Check for mistakes.
await Task.Run(() => p.WaitForExit(), cts.Token).ConfigureAwait(false);
}
#else
await Task.Run(() => p.WaitForExit(), cts.Token).ConfigureAwait(false);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is extremely bad. It synchronously blocks a threadpool thread for potentially long time.

WaitForExitAsync(cancellationToken) AFAIK does drain stdout/stderr, probably except if the cancellation token gets cancelled, but in this case we will also abort the Task.Run and still have the same issue?

Copy link
Copy Markdown
Member

@Youssef1313 Youssef1313 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer we have better understanding of what's going on and then sync with @adamsitnik.

await Task.Run(() => p.WaitForExit(), cts.Token).ConfigureAwait(false);
}
#else
await Task.Run(() => p.WaitForExit(), cts.Token).ConfigureAwait(false);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this Task.Run call preserves the original semantics.

I tried to run the following on sharplab.io:

using System;
using System.Diagnostics;
using System.Threading;
using System.Threading.Tasks;

var sw = Stopwatch.StartNew();
var cts = new CancellationTokenSource(TimeSpan.FromSeconds(1));
await Task.Run(() => Thread.Sleep(5000), cts.Token);
Console.WriteLine(sw.Elapsed);

Output is

00:00:05.0168665

So we don't unblock the task when the token is cancelled, it seems.

When testhost crashes with a stack overflow, the Exited event handler
calls WaitForExitAsync and then reads testHostProcessStdError. However,
WaitForExitAsync does not guarantee that all asynchronous output readers
(ErrorDataReceived) have reached EOF before returning. This causes a
race where the 'Stack overflow.' stderr message is sometimes lost,
resulting in a generic 'Test host process crashed' error.

Add a follow-up p.WaitForExit() call after WaitForExitAsync on .NET 5+.
The parameterless WaitForExit() ensures all async readers are fully
drained. Since the process has already exited (WaitForExitAsync returned),
this drain completes near-instantly. If WaitForExitAsync is cancelled
(zombie/hung process via CTS timeout), the catch block skips the drain.

Also dispose the CancellationTokenSource via 'using var'.

Fixes flaky test: TheRunTestsShouldThrowOnStackOverflowException

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@nohwnd nohwnd force-pushed the fix-stderr-race-condition branch from 8acbd25 to 3573b9c Compare April 13, 2026 11:14
@nohwnd
Copy link
Copy Markdown
Member Author

nohwnd commented Apr 13, 2026

🤖 Thanks for the review feedback! Updated the approach:

Before (v1): Replaced WaitForExitAsync entirely with Task.Run(() => p.WaitForExit(), cts.Token) — which as @Youssef1313 correctly pointed out does not actually cancel the blocking WaitForExit(), leaking a threadpool thread on zombie processes.

After (v2): Keep WaitForExitAsync(cts.Token) for the non-blocking, cancellable wait, then follow it with p.WaitForExit() to drain async readers:

await p.WaitForExitAsync(cts.Token);
// Process has exited; drain async readers. Near-instant since process is dead.
// If WaitForExitAsync was cancelled (zombie), we skip this via the catch block.
p.WaitForExit();

This addresses all three concerns:

  • No threadpool thread leak on zombie processes (WaitForExitAsync is properly cancellable)
  • Async readers are fully drained before the exit callback reads stderr
  • CancellationTokenSource is now disposed via using var

Comment on lines +198 to +202
await p.WaitForExitAsync(cts.Token);
// Process has exited; drain async readers. This should be near-instant
// since the process is already dead. If WaitForExitAsync was cancelled
// (zombie/hung process), we skip this via the catch block.
p.WaitForExit();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is extremely confusing, still.

I just realized this is under Exited event, so at this point we are sure that the process in question already exited.

The original code was giving 500ms to drain stdout/stderr by using await WaitForExitAsync(cts.Token);

Now we introduce additional WaitForExit which will wait forever until the stdout/stderr are drained. I don't think it will throw any exceptions if we cannot drain due to a left over grand child process. So this will hang in such case.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, copilot not really driving this into the right direction. Thanks for the pointers.

Add a test that verifies vstest.console does not hang when testhost
starts a child process that inherits the stderr pipe handle and
outlives testhost. This simulates the Selenium WebDriver scenario
where Edge Driver (child) keeps the pipe handle open after testhost
exits, causing the parameterless WaitForExit() to block forever.

The protection is the 500ms CancellationTokenSource timeout in
ProcessHelper.cs's Exited handler (WaitForExitAsync(cts.Token)).

Test assets:
- sleeping-child: minimal exe that sleeps for a configurable time
- zombie-child-process: MSTest project that starts sleeping-child
  with UseShellExecute=false (inheriting pipe handles) and passes
  immediately without waiting for the child

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 13, 2026 12:24
@nohwnd
Copy link
Copy Markdown
Member Author

nohwnd commented Apr 13, 2026

🤖 Update: Reverted the ProcessHelper.cs change. The PR now adds only the zombie child process protection test.

Why the revert

Investigation revealed that WaitForExitAsync does drain async readers (ErrorDataReceived/OutputDataReceived) — it waits for EOF on both streams. The original v1 fix (Task.Run(() => p.WaitForExit())) was wrong (thread leak, non-cancellable), and v2 (WaitForExitAsync + WaitForExit()) was redundant.

The real cause of the flaky StackOverflowException test is threadpool starvation — confirmed with a local stress test (200 blocking tasks → 15/100 misses). Under starvation, the 500ms CTS cancels WaitForExitAsync before the drain completes. Fixing this properly requires careful balancing of two competing concerns:

  1. Stderr drain (stack overflow message) — needs the drain to complete
  2. Zombie pipe protection (Selenium/WebDriver) — needs a timeout to avoid hanging forever when a child process inherits the pipe

What this PR now does

Adds an integration test that verifies the zombie pipe protection works — vstest.console does NOT hang when testhost starts a child process (simulating Edge Driver) that inherits the stderr pipe handle and outlives testhost. This test must pass before any future changes to the exit handling logic.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Comment on lines +203 to +206
// vstest.console should complete well within 60 seconds.
// The child process sleeps for 30s, so if vstest.console waits for the pipe EOF
// it would take at least 30s. A healthy run completes in a few seconds.
stopwatch.Elapsed.Should().BeLessThan(TimeSpan.FromSeconds(60),
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test aims to detect regressions where vstest.console waits for pipe EOF (child sleeps 30s), but the assertion allows up to 60 seconds. If vstest.console blocks until the child exits (≈30s), the test would still pass. Consider tightening the upper bound to something comfortably below the child sleep duration (while allowing normal CI variance) so the regression is actually caught.

Suggested change
// vstest.console should complete well within 60 seconds.
// The child process sleeps for 30s, so if vstest.console waits for the pipe EOF
// it would take at least 30s. A healthy run completes in a few seconds.
stopwatch.Elapsed.Should().BeLessThan(TimeSpan.FromSeconds(60),
// vstest.console should complete well before the child process exits.
// The child process sleeps for 30s, so if vstest.console waits for the pipe EOF
// it would take at least 30s. A healthy run completes in a few seconds, so keep
// the upper bound comfortably below 30s while allowing for normal CI variance.
stopwatch.Elapsed.Should().BeLessThan(TimeSpan.FromSeconds(20),

Copilot uses AI. Check for mistakes.
Comment on lines +37 to +63
var childName = RuntimeInformation.IsOSPlatform(OSPlatform.Windows)
? "sleeping-child.exe"
: "sleeping-child";
var childPath = Path.Combine(testAssetsDir, "sleeping-child", configDir, tfmDir, childName);

if (!File.Exists(childPath))
{
Assert.Inconclusive($"sleeping-child not found at: {childPath}");
}

// Start the child process with UseShellExecute=false so it INHERITS testhost's
// stdio pipe handles (the default on .NET Core). This is the key mechanism that
// causes the "zombie pipe" issue — the child holds the stderr pipe handle open
// even after testhost exits.
var psi = new ProcessStartInfo
{
FileName = RuntimeInformation.IsOSPlatform(OSPlatform.Windows)
? childPath
: "dotnet",
UseShellExecute = false,
CreateNoWindow = true,
};

if (!RuntimeInformation.IsOSPlatform(OSPlatform.Windows))
{
psi.Arguments = $"{childPath} 30000";
}
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On non-Windows, childName is set to "sleeping-child" but the process is launched via "dotnet" with Arguments="{childPath} 30000". In this repo other exe test assets typically run the produced .dll via dotnet on non-Windows; passing the apphost filename (no extension) to dotnet can fail depending on how the app is built/published. Consider either (a) using "sleeping-child.dll" for non-Windows when invoking dotnet, or (b) executing the apphost directly (FileName=childPath) on non-Windows.

Copilot uses AI. Check for mistakes.
</PropertyGroup>

<PropertyGroup>
<TargetFrameworks>$(NetCoreAppTargetFrameworks)</TargetFrameworks>
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These test assets multi-target all of $(NetCoreAppTargetFrameworks) (net8.0;net9.0;net10.0;net11.0). That can significantly increase TestAssets build time. If the scenario only needs coverage for the same subset other TestAssets use (often net8.0/net9.0), consider reducing the TargetFrameworks list to the minimum needed for the acceptance test.

Suggested change
<TargetFrameworks>$(NetCoreAppTargetFrameworks)</TargetFrameworks>
<TargetFrameworks>net8.0;net9.0</TargetFrameworks>

Copilot uses AI. Check for mistakes.
<Project Sdk="Microsoft.NET.Sdk">
<PropertyGroup>
<OutputType>Exe</OutputType>
<TargetFrameworks>$(NetCoreAppTargetFrameworks)</TargetFrameworks>
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This utility asset targets all of $(NetCoreAppTargetFrameworks) (net8.0;net9.0;net10.0;net11.0), which increases TestAssets build time. If it’s only used by the zombie-child-process acceptance test, consider targeting only the TFMs that test actually runs against to keep CI/build costs down.

Suggested change
<TargetFrameworks>$(NetCoreAppTargetFrameworks)</TargetFrameworks>
<TargetFramework>net8.0</TargetFramework>

Copilot uses AI. Check for mistakes.
? childPath
: "dotnet",
UseShellExecute = false,
CreateNoWindow = true,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please verify the test is correct and that it would fail when the bug happens:

  • I think, but please double check, that redirecting stdout/stderr is required to reproduce a potential hang.
  • I'm not sure about CreateNoWindow and if it has anything to do with a potential hang.

{
FileName = RuntimeInformation.IsOSPlatform(OSPlatform.Windows)
? childPath
: "dotnet",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Why are we using dotnet instead of the apphost for non-Windows?
  • childPath under non-Windows seem that it doesn't have the .dll extension so I assume it would fail to run? dotnet expects a .NET assembly, not the native apphost.

@nohwnd
Copy link
Copy Markdown
Member Author

nohwnd commented Apr 13, 2026

🤖 v3 — proper fix pushed.

Root Cause

WaitForExitAsync(cts.Token) with a 500ms CancellationTokenSource. Under threadpool starvation (common in CI), the CTS fires before the async reader drain completes. WaitForExitAsync does drain ErrorDataReceived/OutputDataReceived, but the 500ms CTS also cancels that drain.

Confirmed with stress test: 200 blocking threadpool tasks → 15/100 misses with the 500ms CTS.

Fix (v3)

// .NET 5+: drain without cancellation, race against zombie pipe timeout
var exitTask = p.WaitForExitAsync(CancellationToken.None);
if (await Task.WhenAny(exitTask, Task.Delay(5_000)) != exitTask)
{
    // Timed out — child process likely holds the pipe handle open
}
  • CancellationToken.None — drain runs to completion, immune to starvation
  • Task.WhenAny + Task.Delay(5s) — zombie pipe guard (child inheriting stderr handle)
  • No thread blocking, no CTS disposal issues
  • Pre-.NET 5 paths also updated to 5s for consistency

Tests

  • New zombie pipe test (VsTestShouldNotHangWhenChildProcessHoldsPipeOpen) — verifies vstest.console completes when a child process holds the stderr pipe open. Passes with both old (500ms) and new (5s) timeout.
  • New test assets (sleeping-child, zombie-child-process) — simulate the Selenium WebDriver scenario.

The Exited handler used WaitForExitAsync(cts.Token) with a 500ms
CancellationTokenSource. Under threadpool starvation (common in CI),
the CTS fires before the async reader drain completes, causing
WaitForExitAsync to throw. The exit callback then reads incomplete
stderr, losing the 'Stack overflow.' message from crashed testhosts.

Replace the cancellable WaitForExitAsync with an uncancellable one
(CancellationToken.None) raced against Task.Delay via Task.WhenAny.
This ensures the drain runs to completion without being interrupted
by the timeout. The timeout (increased from 500ms to 2s) only serves
as a zombie pipe guard — when a child process (e.g. Selenium's Edge
Driver) inherits the stderr handle and prevents pipe EOF.

The Exited handler runs asynchronously (async void) after the process
is killed, so the timeout does not block the abort/Close path. The
abort test verifies this.

Key changes:
- WaitForExitAsync(CancellationToken.None) — drain cannot be cancelled
- Task.WhenAny(exitTask, Task.Delay(2000)) — zombie pipe protection
- Timeout increased from 500ms to 2s — more resilient to starvation
- Pre-.NET 5 paths also use the 2s timeout for consistency
- CTS in pre-.NET 5 Unix path now disposed via 'using var'
- New test: AbortOnTimeoutShouldCompleteQuickly — guards abort speed

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@nohwnd nohwnd force-pushed the fix-stderr-race-condition branch from bc9a95d to b3e44ad Compare April 13, 2026 16:04
Copilot AI review requested due to automatic review settings April 13, 2026 16:04
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Comment on lines +150 to 156
// On .NET 5+, WaitForExitAsync waits for process exit AND drains all
// asynchronous output/error readers (ErrorDataReceived, OutputDataReceived).
// We run it WITHOUT a CancellationToken so the drain cannot be interrupted
// by threadpool starvation. Instead, we race it against a delay task to
// handle zombie pipes (child process inheriting stdout/stderr handles
// prevents EOF, which would block the drain indefinitely).
//
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation and comments here conflict with the PR title/description: the PR metadata states WaitForExitAsync does not guarantee async stdout/stderr readers reach EOF (hence switching to parameterless WaitForExit via Task.Run), but this code still uses WaitForExitAsync and asserts it drains the readers. Please align the code + comments with the intended fix (or update the PR description/title if the approach changed).

Copilot uses AI. Check for mistakes.
Comment on lines +162 to +166
if (await Task.WhenAny(exitTask, Task.Delay(zombiePipeTimeout)).ConfigureAwait(false) != exitTask)
{
// Timed out — a child process likely holds the stderr/stdout pipe
// handle open (e.g. Selenium WebDriver's Edge Driver). The exitTask
// continues in the background and will complete when the child exits.
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exitTask is never awaited/observed. If WaitForExitAsync throws (e.g., process disposed while the background wait is still running in the timeout path), the exception becomes unobserved. Consider awaiting exitTask when it wins the race, and in the timeout path attach a continuation to observe/ignore any faulted state to avoid UnobservedTaskException noise.

Suggested change
if (await Task.WhenAny(exitTask, Task.Delay(zombiePipeTimeout)).ConfigureAwait(false) != exitTask)
{
// Timed out — a child process likely holds the stderr/stdout pipe
// handle open (e.g. Selenium WebDriver's Edge Driver). The exitTask
// continues in the background and will complete when the child exits.
var completedTask = await Task.WhenAny(exitTask, Task.Delay(zombiePipeTimeout)).ConfigureAwait(false);
if (completedTask is exitTask)
{
await exitTask.ConfigureAwait(false);
}
else
{
// Timed out — a child process likely holds the stderr/stdout pipe
// handle open (e.g. Selenium WebDriver's Edge Driver). The exitTask
// continues in the background and will complete when the child exits.
_ = exitTask.ContinueWith(
faultedTask => _ = faultedTask.Exception,
CancellationToken.None,
TaskContinuationOptions.OnlyOnFaulted | TaskContinuationOptions.ExecuteSynchronously,
TaskScheduler.Default);

Copilot uses AI. Check for mistakes.
/// Selenium WebDriver scenario where Edge Driver (child) keeps the pipe handle open
/// after testhost exits.
///
/// The protection is the 500ms timeout in ProcessHelper.cs's Exited handler.
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment references a 500ms timeout in ProcessHelper's Exited handler, but ProcessHelper now uses zombiePipeTimeout = 2_000. Update this documentation to match the current behavior (or adjust the constant if 500ms is still the intended value).

Suggested change
/// The protection is the 500ms timeout in ProcessHelper.cs's Exited handler.
/// The protection is the 2000ms timeout in ProcessHelper.cs's Exited handler.

Copilot uses AI. Check for mistakes.
const int zombiePipeTimeout = 2_000;
#if NET5_0_OR_GREATER
var exitTask = p.WaitForExitAsync(CancellationToken.None);
if (await Task.WhenAny(exitTask, Task.Delay(zombiePipeTimeout)).ConfigureAwait(false) != exitTask)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change really needed? Isn't this effectively similar to just increasing the timeout and keeping the original code?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants