Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dns-server] only return answers to the incoming question #6308

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

iximeow
Copy link
Member

@iximeow iximeow commented Aug 13, 2024

As noted in #4051, queries to the internal DNS server would get any records we have for a name in response, rather than only records matching the query incoming query type. That behavior is confusing, but worse, wrong.

Subtly, this is not actually a misbehavior I believe we can observe through trust_dns_resolver: the resolver from that crate includes its own CachingClient. As a side effect of upstream answers going through that caching client, incorrect Answers records are cached and only correct answers actually make it out to us as consumers of trust_dns_resolver.

But as is plenty clear in #4051, dig and other DNS clients can get incoherent answers!

Simple enough to fix: only return answers that are answers to the question we were asked.

As noted in #4051,
queries to the internal DNS server would get any records for a name in
response, rather than only records matching the query incoming query type.
That behavior is confusing, but worse, wrong.

Subtly, this is not actually a misbehavior I believe we can observe
through `trust_dns_resolver`: the resolver from that crate includes its
own `CachingClient`. As a side effect of upstream answers going
through that caching client, incorrect Answers sections are cached and
only correct answers actually make it out to us as consumers of
`trust_dns_resolver`.

But as is plenty clear in #4051, `dig` and other DNS clients can get
incoherent answers!

Simple enough to fix: only return answers that are answers to the
question we were asked.
@iximeow
Copy link
Member Author

iximeow commented Aug 13, 2024

errr... didn't realize i'd broken something in internal-dns with this change. back to draft with it.

@iximeow iximeow marked this pull request as draft August 13, 2024 15:52
@iximeow
Copy link
Member Author

iximeow commented Aug 13, 2024

ah, i think the test failures are in fact a bug in internal-dns that this exposes: we're sending an AAAA query for the domain which has a SRV record, we'd incorrectly respond with the SRV record, that would have the AAAA for the target in Additionals, and so there would be an ipv6 address for ipv6_lookup to return.

seems simple enough to fix?

@davepacheco
Copy link
Collaborator

If I'm understand right, you're saying: the DNS client code is incorrect because it wants to be doing an SRV lookup, but it's doing a AAAA lookup. But this is working by accident because the DNS server is returning the SRV records for the AAAA query. So fixing that would break all of our DNS consumers. Is that right?

And presumably the fix is to fix the DNS client to do an actual SRV lookup, which should continue to return the additional AAAA records?

That all sounds good. The only question I have is whether we need a transitional release where the server still has the old behavior. I think not because the way we do upgrades by shutting down the whole stack, by the time the new DNS servers come up, there can be nothing still using the old client behavior. It'd be worth double-checking that. If wicket or wicketd rely on this DNS client code, for example, it seems like it could be a problem during the mupdate.

@iximeow
Copy link
Member Author

iximeow commented Aug 13, 2024

So fixing that would break all of our DNS consumers. Is that right?

yes, though i wouldn't say all.. definitely would break service discovery over DNS (which is presumably most of it). NTP lookups, for example, seem like that should still work...

And presumably the fix is to fix the DNS client to do an actual SRV lookup, which should continue to return the additional AAAA records?

precisely.

whether we need a transitional release where the server still has the old behavior

i was wondering about that, since fixing resolution in internal-dns could be totally independent of changing the DNS server behavior. comprehensively, i think, i'd want to look for binaries that use our resolver crate and which we might expect to talk to neighboring internal DNS servers?

since the changes are both small i'm leaning towards a staged fix just to be very sure this won't cause an issue in the transition.

@iximeow
Copy link
Member Author

iximeow commented Aug 13, 2024

and with 338b7f6 picked onto this branch, tests all pass again. instead of "fix lookup_ipv6" i went for "remove lookup_ipv6 and change its uses", there.

Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

I thin you'll have to sync up with main to pick up the hickory-dns update. Hopefully that won't break much here.

@@ -268,6 +268,19 @@ async fn handle_dns_message(
let mut additional_records = vec![];
let response_records = records
.into_iter()
.filter(|record| {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine but I'm curious why you decided to filter here and not in the store?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly expecting that we would reasonably want related records in other cases. an immediate example might be returning AAAA records as additionals for an A query, or vice versa (admittedly tenuous, since there's not much reason A records would be anywhere.)

maybe more reasonably: this makes it a bit easier to detect (and warn?) about clients doing AAAA queries directly to the name for a SRV record.

if we weren't storing all the records for a name together, and filtering in the store could have saved some work, i'd be strongly in favor of filtering there.

// transparently correct the server misbehavior "server sent AAAA answers to
// an A query": the caching client will cache the extra record, then
// determine that there are no A records to respond with, and send a
// matching response.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "matching response" here? A response indicating no records?

Does this comment mean that even if we'd had this test before, it would have passed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"matching response" is an attempt to nod to the language a reader might find in RFC 1034 - in this case, yes, a response with all zero records matching the query. re-reading the comment, i could have phrased that better...

about the comment generally: if the test hinged on asserting that resolver.ipv4_lookup(name) returned no records when we've added only an AAAA record, it would pass on main. the test as written here, avoiding CachingClient would fail on main. (this was very confusing to discover, as i started by using ipv4_lookup and realizing i couldn't make this test fail as expected..)

assert_eq!(response.header().response_code(), ResponseCode::NoError);
assert_eq!(response.answers(), &[]);
assert_eq!(response.additionals(), &[]);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about adding a case with an SRV lookup that produces AAAAs with additional records so that we verify that's still working? edit: I guess we test this behavior already in srv_crud()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, exactly, srv_crud() already covers that. i would kind of like to assert that the AAAAs are in Additionals specifically, but i didn't want to go too wild with raw_dns_client_query off the bat.

@@ -479,3 +562,36 @@ async fn dns_records_list(
.map(|z| z.records)
.unwrap_or_else(HashMap::new))
}

/// issue a DNS query of `record_ty` records for `name`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I (we?) use traditional capitalization. I see a lot of this file already doesn't use caps or punctuation, but that seems to be mostly the older one-line comments.

dns-server/tests/basic_test.rs Outdated Show resolved Hide resolved
dns-server/tests/basic_test.rs Outdated Show resolved Hide resolved
let stream = UdpClientStream::<tokio::net::UdpSocket>::new(resolver_addr);
let (mut trust_client, bg) = AsyncClient::connect(stream).await.unwrap();

tokio::spawn(bg);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we cancel and and then wait for this after we do the query?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think we'd get anything for the effort to do that: bg does need to be polled, since that's what drives trust_client's I/O. so if we held bg and cancelled it after trust_client.query(), i'm pretty sure the query would just time out .. :D

so we'd need to thread some signal through that the query is done and we should tear down bg. but dropping trust_client has a really similar outcome in getting DnsExchangeBackground shutdown here.

is there state you're thinking might persist if bg were to leak?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just worried about leaking those tasks (and bg itself) every time we run a query, plus general principles -- it makes me nervous when stuff kicks stuff off in the background and forgets about it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, fair. i'm less concerned about that here because this is only used in the context of some individual test, and the whole executor gets torn down when the test finishes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants