Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests failing with docker command 'CreateNetwork' returned with non-zero exit code #5702

Open
radical opened this issue Sep 13, 2024 · 7 comments
Assignees
Labels
Milestone

Comments

@radical
Copy link
Member

radical commented Sep 13, 2024

Build Information

Build: https://dev.azure.com/dnceng-public/public/_build/results?buildId=807077
Build error leg or test failing: Aspire.Playground.Tests.*
Pull request: #5684

Full error message: docker command 'CreateNetwork' returned with non-zero exit code 1: command output: Stdout: '' Stderr: 'Error response from daemon: could not find an available, non-overlapping IPv4 address pool among the defaults to assign to the network

Error Message

Fill the error message using step by step known issues guidance.

{
  "ErrorMessage": "",
  "ErrorPattern": "docker command 'CreateNetwork' returned with non-zero exit code",
  "BuildRetry": false,
  "ExcludeConsoleLog": false
}

Known issue validation

Build: 🔎 https://dev.azure.com/dnceng-public/public/_build/results?buildId=807077
Error message validated: [docker command 'CreateNetwork' returned with non-zero exit code]
Result validation: ✅ Known issue matched with the provided build.
Validation performed at: 9/13/2024 6:49:19 AM UTC

Report

Build Definition Test Pull Request
807341 dotnet/aspire Aspire.Hosting.Milvus.Tests.MilvusFunctionalTests.Aspire.Hosting.Milvus.Tests.MilvusFunctionalTests.WithDataShouldPersistStateBetweenUsages #5701
807292 dotnet/aspire Aspire.Playground.Tests.AppHostTests.Aspire.Playground.Tests.AppHostTests.TestEndpointsReturnOk #5699
807319 dotnet/aspire Aspire.Hosting.SqlServer.Tests.SqlServerFunctionalTests.Aspire.Hosting.SqlServer.Tests.SqlServerFunctionalTests.WithDataShouldPersistStateBetweenUsages(useVolume: False)
807077 dotnet/aspire Aspire.Playground.Tests.AppHostTests.Aspire.Playground.Tests.AppHostTests.TestEndpointsReturnOk #5682

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
0 3 4
@Alirexaa
Copy link
Contributor

After multiple running, I got this error too:

Aspire.Hosting.Dcp.dcpctrl.NetworkReconciler: Error: could not create a network	{"NetworkName": {"name":"aspire-network"}, "Reconciliation": 2, "error": "docker command 'CreateNetwork' returned with non-zero exit code 1: command output: Stdout: '' Stderr: 'Error response from daemon: could not find an available, non-overlapping IPv4 address pool among the defaults to assign to the network\n'"}

This problem occurs because the app-host doesn't clean up networks after stop/shutdown.

This is the list of networks on my machine.

docker networks

cc @davidfowl @danegsta

@davidfowl davidfowl added area-orchestrator bug Something isn't working labels Sep 16, 2024
@danegsta
Copy link
Member

This seems to primarily be an issue with Docker being slower to respond when running multiple tests in parallel; there's a timeout set on the resource cleanup to avoid leaving zombie processes if cleanup never completes, but it's too aggressive when running this many simultaneous resources. In the short term we're making some tweaks to try to improve the reliability of resource cleanup and test runs (including a longer default+configurable duration for the resource cleanup phase and retry on network creation if a subnet wasn't available from the default pools) that should land today.

@dbreshears dbreshears added this to the 9.0 milestone Sep 16, 2024
@danegsta
Copy link
Member

@radical are we still seeing this issue with the latest DCP updates? We should be better about cleaning up resources as well as retrying to reconnect if we can't initially allocate a network subnet.

@radical
Copy link
Member Author

radical commented Sep 19, 2024

@radical are we still seeing this issue with the latest DCP updates?

I haven't seen this been hit in the last few days. We can close the issue after some time, since it still has entries in the last 7 days.

We should be better about cleaning up resources as well as retrying to reconnect if we can't initially allocate a network subnet.

We are doing explicit docker network prune -f at the beginning, and end of each test run on helix. Is that no longer needed?

@danegsta
Copy link
Member

It doesn't hurt to keep that prune in to be on the safe side; Docker is pretty limiting in the maximum number of default networks you can create.

@radical
Copy link
Member Author

radical commented Sep 19, 2024

What's the timeout for these networks getting cleaned up by dcp? Just trying to get a sense of how many networks would be too many (IOW, dcp hasn't been able to cleanup) at end of a test run of N minutes.

@danegsta
Copy link
Member

Current default timeout is 2 minutes to cleanup resources, but it can also be configured via environment variable (DCP_SHUTDOWN_TIMEOUT_SECONDS sets the timeout in seconds).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants