fix: use WaitForExit() instead of WaitForExitAsync to drain stderr#15667
fix: use WaitForExit() instead of WaitForExitAsync to drain stderr#15667nohwnd wants to merge 3 commits intomicrosoft:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses intermittent loss of stderr output when a child testhost process crashes (notably stack overflow), by ensuring the process’ asynchronous stderr/stdout readers are fully drained before invoking exit callbacks.
Changes:
- Replace
WaitForExitAsync(cts.Token)withTask.Run(() => p.WaitForExit(), cts.Token)on .NET 5+ to ensure async stream readers reach EOF. - Update inline documentation in
ProcessHelperto explain the race and whyWaitForExit()is required.
| #else | ||
| await Task.Run(() => p.WaitForExit(), cts.Token).ConfigureAwait(false); | ||
| #endif |
There was a problem hiding this comment.
On NET5_0_OR_GREATER, Task.Run(() => p.WaitForExit(), cts.Token) does not actually cancel the underlying WaitForExit() call. If the process gets into the “zombie/hang” state mentioned in the comment, canceling the CTS will only stop awaiting, but the thread-pool thread will remain blocked indefinitely, which can become a resource leak under repeated hangs. Consider using a wait strategy that both (a) avoids blocking a thread when timing out and (b) still guarantees stream drain (e.g., Windows-specific WaitForExit(timeout) + final drain where safe, and/or explicitly terminating the process on timeout similar to the pre-NET5 Unix branch).
| // See https://github.com/microsoft/vstest/issues/3375 for details. | ||
| var cts = new CancellationTokenSource(timeout); | ||
| #if NET5_0_OR_GREATER | ||
| await p.WaitForExitAsync(cts.Token); | ||
| #else | ||
| #if !NET5_0_OR_GREATER |
There was a problem hiding this comment.
CancellationTokenSource is IDisposable. Consider using using var cts = new CancellationTokenSource(timeout); here to avoid leaking timer resources across many process launches (and similarly dispose any CancellationTokenRegistration returned from cts.Token.Register(...) when used).
| await Task.Run(() => p.WaitForExit(), cts.Token).ConfigureAwait(false); | ||
| } | ||
| #else | ||
| await Task.Run(() => p.WaitForExit(), cts.Token).ConfigureAwait(false); |
There was a problem hiding this comment.
This is extremely bad. It synchronously blocks a threadpool thread for potentially long time.
WaitForExitAsync(cancellationToken) AFAIK does drain stdout/stderr, probably except if the cancellation token gets cancelled, but in this case we will also abort the Task.Run and still have the same issue?
Youssef1313
left a comment
There was a problem hiding this comment.
I prefer we have better understanding of what's going on and then sync with @adamsitnik.
| await Task.Run(() => p.WaitForExit(), cts.Token).ConfigureAwait(false); | ||
| } | ||
| #else | ||
| await Task.Run(() => p.WaitForExit(), cts.Token).ConfigureAwait(false); |
There was a problem hiding this comment.
I'm not sure this Task.Run call preserves the original semantics.
I tried to run the following on sharplab.io:
using System;
using System.Diagnostics;
using System.Threading;
using System.Threading.Tasks;
var sw = Stopwatch.StartNew();
var cts = new CancellationTokenSource(TimeSpan.FromSeconds(1));
await Task.Run(() => Thread.Sleep(5000), cts.Token);
Console.WriteLine(sw.Elapsed);
Output is
00:00:05.0168665
So we don't unblock the task when the token is cancelled, it seems.
When testhost crashes with a stack overflow, the Exited event handler calls WaitForExitAsync and then reads testHostProcessStdError. However, WaitForExitAsync does not guarantee that all asynchronous output readers (ErrorDataReceived) have reached EOF before returning. This causes a race where the 'Stack overflow.' stderr message is sometimes lost, resulting in a generic 'Test host process crashed' error. Add a follow-up p.WaitForExit() call after WaitForExitAsync on .NET 5+. The parameterless WaitForExit() ensures all async readers are fully drained. Since the process has already exited (WaitForExitAsync returned), this drain completes near-instantly. If WaitForExitAsync is cancelled (zombie/hung process via CTS timeout), the catch block skips the drain. Also dispose the CancellationTokenSource via 'using var'. Fixes flaky test: TheRunTestsShouldThrowOnStackOverflowException Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
8acbd25 to
3573b9c
Compare
|
🤖 Thanks for the review feedback! Updated the approach: Before (v1): Replaced After (v2): Keep await p.WaitForExitAsync(cts.Token);
// Process has exited; drain async readers. Near-instant since process is dead.
// If WaitForExitAsync was cancelled (zombie), we skip this via the catch block.
p.WaitForExit();This addresses all three concerns:
|
| await p.WaitForExitAsync(cts.Token); | ||
| // Process has exited; drain async readers. This should be near-instant | ||
| // since the process is already dead. If WaitForExitAsync was cancelled | ||
| // (zombie/hung process), we skip this via the catch block. | ||
| p.WaitForExit(); |
There was a problem hiding this comment.
This is extremely confusing, still.
I just realized this is under Exited event, so at this point we are sure that the process in question already exited.
The original code was giving 500ms to drain stdout/stderr by using await WaitForExitAsync(cts.Token);
Now we introduce additional WaitForExit which will wait forever until the stdout/stderr are drained. I don't think it will throw any exceptions if we cannot drain due to a left over grand child process. So this will hang in such case.
There was a problem hiding this comment.
yeah, copilot not really driving this into the right direction. Thanks for the pointers.
Add a test that verifies vstest.console does not hang when testhost starts a child process that inherits the stderr pipe handle and outlives testhost. This simulates the Selenium WebDriver scenario where Edge Driver (child) keeps the pipe handle open after testhost exits, causing the parameterless WaitForExit() to block forever. The protection is the 500ms CancellationTokenSource timeout in ProcessHelper.cs's Exited handler (WaitForExitAsync(cts.Token)). Test assets: - sleeping-child: minimal exe that sleeps for a configurable time - zombie-child-process: MSTest project that starts sleeping-child with UseShellExecute=false (inheriting pipe handles) and passes immediately without waiting for the child Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
🤖 Update: Reverted the ProcessHelper.cs change. The PR now adds only the zombie child process protection test. Why the revertInvestigation revealed that The real cause of the flaky
What this PR now doesAdds an integration test that verifies the zombie pipe protection works — vstest.console does NOT hang when testhost starts a child process (simulating Edge Driver) that inherits the stderr pipe handle and outlives testhost. This test must pass before any future changes to the exit handling logic. |
| // vstest.console should complete well within 60 seconds. | ||
| // The child process sleeps for 30s, so if vstest.console waits for the pipe EOF | ||
| // it would take at least 30s. A healthy run completes in a few seconds. | ||
| stopwatch.Elapsed.Should().BeLessThan(TimeSpan.FromSeconds(60), |
There was a problem hiding this comment.
This test aims to detect regressions where vstest.console waits for pipe EOF (child sleeps 30s), but the assertion allows up to 60 seconds. If vstest.console blocks until the child exits (≈30s), the test would still pass. Consider tightening the upper bound to something comfortably below the child sleep duration (while allowing normal CI variance) so the regression is actually caught.
| // vstest.console should complete well within 60 seconds. | |
| // The child process sleeps for 30s, so if vstest.console waits for the pipe EOF | |
| // it would take at least 30s. A healthy run completes in a few seconds. | |
| stopwatch.Elapsed.Should().BeLessThan(TimeSpan.FromSeconds(60), | |
| // vstest.console should complete well before the child process exits. | |
| // The child process sleeps for 30s, so if vstest.console waits for the pipe EOF | |
| // it would take at least 30s. A healthy run completes in a few seconds, so keep | |
| // the upper bound comfortably below 30s while allowing for normal CI variance. | |
| stopwatch.Elapsed.Should().BeLessThan(TimeSpan.FromSeconds(20), |
| var childName = RuntimeInformation.IsOSPlatform(OSPlatform.Windows) | ||
| ? "sleeping-child.exe" | ||
| : "sleeping-child"; | ||
| var childPath = Path.Combine(testAssetsDir, "sleeping-child", configDir, tfmDir, childName); | ||
|
|
||
| if (!File.Exists(childPath)) | ||
| { | ||
| Assert.Inconclusive($"sleeping-child not found at: {childPath}"); | ||
| } | ||
|
|
||
| // Start the child process with UseShellExecute=false so it INHERITS testhost's | ||
| // stdio pipe handles (the default on .NET Core). This is the key mechanism that | ||
| // causes the "zombie pipe" issue — the child holds the stderr pipe handle open | ||
| // even after testhost exits. | ||
| var psi = new ProcessStartInfo | ||
| { | ||
| FileName = RuntimeInformation.IsOSPlatform(OSPlatform.Windows) | ||
| ? childPath | ||
| : "dotnet", | ||
| UseShellExecute = false, | ||
| CreateNoWindow = true, | ||
| }; | ||
|
|
||
| if (!RuntimeInformation.IsOSPlatform(OSPlatform.Windows)) | ||
| { | ||
| psi.Arguments = $"{childPath} 30000"; | ||
| } |
There was a problem hiding this comment.
On non-Windows, childName is set to "sleeping-child" but the process is launched via "dotnet" with Arguments="{childPath} 30000". In this repo other exe test assets typically run the produced .dll via dotnet on non-Windows; passing the apphost filename (no extension) to dotnet can fail depending on how the app is built/published. Consider either (a) using "sleeping-child.dll" for non-Windows when invoking dotnet, or (b) executing the apphost directly (FileName=childPath) on non-Windows.
| </PropertyGroup> | ||
|
|
||
| <PropertyGroup> | ||
| <TargetFrameworks>$(NetCoreAppTargetFrameworks)</TargetFrameworks> |
There was a problem hiding this comment.
These test assets multi-target all of $(NetCoreAppTargetFrameworks) (net8.0;net9.0;net10.0;net11.0). That can significantly increase TestAssets build time. If the scenario only needs coverage for the same subset other TestAssets use (often net8.0/net9.0), consider reducing the TargetFrameworks list to the minimum needed for the acceptance test.
| <TargetFrameworks>$(NetCoreAppTargetFrameworks)</TargetFrameworks> | |
| <TargetFrameworks>net8.0;net9.0</TargetFrameworks> |
| <Project Sdk="Microsoft.NET.Sdk"> | ||
| <PropertyGroup> | ||
| <OutputType>Exe</OutputType> | ||
| <TargetFrameworks>$(NetCoreAppTargetFrameworks)</TargetFrameworks> |
There was a problem hiding this comment.
This utility asset targets all of $(NetCoreAppTargetFrameworks) (net8.0;net9.0;net10.0;net11.0), which increases TestAssets build time. If it’s only used by the zombie-child-process acceptance test, consider targeting only the TFMs that test actually runs against to keep CI/build costs down.
| <TargetFrameworks>$(NetCoreAppTargetFrameworks)</TargetFrameworks> | |
| <TargetFramework>net8.0</TargetFramework> |
| ? childPath | ||
| : "dotnet", | ||
| UseShellExecute = false, | ||
| CreateNoWindow = true, |
There was a problem hiding this comment.
Please verify the test is correct and that it would fail when the bug happens:
- I think, but please double check, that redirecting stdout/stderr is required to reproduce a potential hang.
- I'm not sure about CreateNoWindow and if it has anything to do with a potential hang.
| { | ||
| FileName = RuntimeInformation.IsOSPlatform(OSPlatform.Windows) | ||
| ? childPath | ||
| : "dotnet", |
There was a problem hiding this comment.
- Why are we using
dotnetinstead of the apphost for non-Windows? - childPath under non-Windows seem that it doesn't have the
.dllextension so I assume it would fail to run?dotnetexpects a .NET assembly, not the native apphost.
|
🤖 v3 — proper fix pushed. Root Cause
Confirmed with stress test: 200 blocking threadpool tasks → 15/100 misses with the 500ms CTS. Fix (v3)// .NET 5+: drain without cancellation, race against zombie pipe timeout
var exitTask = p.WaitForExitAsync(CancellationToken.None);
if (await Task.WhenAny(exitTask, Task.Delay(5_000)) != exitTask)
{
// Timed out — child process likely holds the pipe handle open
}
Tests
|
The Exited handler used WaitForExitAsync(cts.Token) with a 500ms CancellationTokenSource. Under threadpool starvation (common in CI), the CTS fires before the async reader drain completes, causing WaitForExitAsync to throw. The exit callback then reads incomplete stderr, losing the 'Stack overflow.' message from crashed testhosts. Replace the cancellable WaitForExitAsync with an uncancellable one (CancellationToken.None) raced against Task.Delay via Task.WhenAny. This ensures the drain runs to completion without being interrupted by the timeout. The timeout (increased from 500ms to 2s) only serves as a zombie pipe guard — when a child process (e.g. Selenium's Edge Driver) inherits the stderr handle and prevents pipe EOF. The Exited handler runs asynchronously (async void) after the process is killed, so the timeout does not block the abort/Close path. The abort test verifies this. Key changes: - WaitForExitAsync(CancellationToken.None) — drain cannot be cancelled - Task.WhenAny(exitTask, Task.Delay(2000)) — zombie pipe protection - Timeout increased from 500ms to 2s — more resilient to starvation - Pre-.NET 5 paths also use the 2s timeout for consistency - CTS in pre-.NET 5 Unix path now disposed via 'using var' - New test: AbortOnTimeoutShouldCompleteQuickly — guards abort speed Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bc9a95d to
b3e44ad
Compare
| // On .NET 5+, WaitForExitAsync waits for process exit AND drains all | ||
| // asynchronous output/error readers (ErrorDataReceived, OutputDataReceived). | ||
| // We run it WITHOUT a CancellationToken so the drain cannot be interrupted | ||
| // by threadpool starvation. Instead, we race it against a delay task to | ||
| // handle zombie pipes (child process inheriting stdout/stderr handles | ||
| // prevents EOF, which would block the drain indefinitely). | ||
| // |
There was a problem hiding this comment.
The implementation and comments here conflict with the PR title/description: the PR metadata states WaitForExitAsync does not guarantee async stdout/stderr readers reach EOF (hence switching to parameterless WaitForExit via Task.Run), but this code still uses WaitForExitAsync and asserts it drains the readers. Please align the code + comments with the intended fix (or update the PR description/title if the approach changed).
| if (await Task.WhenAny(exitTask, Task.Delay(zombiePipeTimeout)).ConfigureAwait(false) != exitTask) | ||
| { | ||
| // Timed out — a child process likely holds the stderr/stdout pipe | ||
| // handle open (e.g. Selenium WebDriver's Edge Driver). The exitTask | ||
| // continues in the background and will complete when the child exits. |
There was a problem hiding this comment.
exitTask is never awaited/observed. If WaitForExitAsync throws (e.g., process disposed while the background wait is still running in the timeout path), the exception becomes unobserved. Consider awaiting exitTask when it wins the race, and in the timeout path attach a continuation to observe/ignore any faulted state to avoid UnobservedTaskException noise.
| if (await Task.WhenAny(exitTask, Task.Delay(zombiePipeTimeout)).ConfigureAwait(false) != exitTask) | |
| { | |
| // Timed out — a child process likely holds the stderr/stdout pipe | |
| // handle open (e.g. Selenium WebDriver's Edge Driver). The exitTask | |
| // continues in the background and will complete when the child exits. | |
| var completedTask = await Task.WhenAny(exitTask, Task.Delay(zombiePipeTimeout)).ConfigureAwait(false); | |
| if (completedTask is exitTask) | |
| { | |
| await exitTask.ConfigureAwait(false); | |
| } | |
| else | |
| { | |
| // Timed out — a child process likely holds the stderr/stdout pipe | |
| // handle open (e.g. Selenium WebDriver's Edge Driver). The exitTask | |
| // continues in the background and will complete when the child exits. | |
| _ = exitTask.ContinueWith( | |
| faultedTask => _ = faultedTask.Exception, | |
| CancellationToken.None, | |
| TaskContinuationOptions.OnlyOnFaulted | TaskContinuationOptions.ExecuteSynchronously, | |
| TaskScheduler.Default); |
| /// Selenium WebDriver scenario where Edge Driver (child) keeps the pipe handle open | ||
| /// after testhost exits. | ||
| /// | ||
| /// The protection is the 500ms timeout in ProcessHelper.cs's Exited handler. |
There was a problem hiding this comment.
The comment references a 500ms timeout in ProcessHelper's Exited handler, but ProcessHelper now uses zombiePipeTimeout = 2_000. Update this documentation to match the current behavior (or adjust the constant if 500ms is still the intended value).
| /// The protection is the 500ms timeout in ProcessHelper.cs's Exited handler. | |
| /// The protection is the 2000ms timeout in ProcessHelper.cs's Exited handler. |
| const int zombiePipeTimeout = 2_000; | ||
| #if NET5_0_OR_GREATER | ||
| var exitTask = p.WaitForExitAsync(CancellationToken.None); | ||
| if (await Task.WhenAny(exitTask, Task.Delay(zombiePipeTimeout)).ConfigureAwait(false) != exitTask) |
There was a problem hiding this comment.
Is this change really needed? Isn't this effectively similar to just increasing the timeout and keeping the original code?
Problem
When testhost crashes with a stack overflow, vstest sometimes reports just \Test host process crashed\ instead of \Test host process crashed : Stack overflow.. The stderr output from the crash is intermittently lost.
Root Cause
Race condition in \ProcessHelper.cs\ Exited event handler:
\WaitForExitAsync\ does NOT guarantee that all asynchronous readers (\ErrorDataReceived, \OutputDataReceived) have reached EOF before returning. The parameterless \WaitForExit()\ does.
Fix
Replace \WaitForExitAsync(cts.Token)\ with \�wait Task.Run(() => p.WaitForExit(), cts.Token)\ on .NET 5+.
Repro
Reproduced with a stress test (50-200 iterations of crashing a child process and capturing stderr):
The race is load/timing dependent — consistent with flaky CI.
Flaky Test Fixed
\TheRunTestsShouldThrowOnStackOverflowException\ in \TranslationLayerTests/RunTests.cs\
Relates to #3375