Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bugfix] Deduplicate spans based upon their hashcode #6009

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

cdanis
Copy link

@cdanis cdanis commented Sep 23, 2024

Which problem is this PR solving?

Description of the changes

  • Rename the Zipkin-legacy span "deduper" to a less confusing name
  • Add a real span deduper that deduplicates based on span hashcodes

How was this change tested?

  • Unit tested

Checklist

@cdanis cdanis requested a review from a team as a code owner September 23, 2024 14:33
@dosubot dosubot bot added the enhancement label Sep 23, 2024
@cdanis cdanis force-pushed the dupe-dedupe branch 2 times, most recently from fd74590 to 888ca59 Compare September 23, 2024 16:52
@cdanis
Copy link
Author

cdanis commented Sep 23, 2024

@yurishkuro as discussed! let me know if you want anything else changed

@cdanis cdanis force-pushed the dupe-dedupe branch 2 times, most recently from fa8c498 to 8022f67 Compare September 23, 2024 17:12
Copy link

codecov bot commented Sep 23, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.80%. Comparing base (b463a46) to head (a5c29f0).
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #6009   +/-   ##
=======================================
  Coverage   96.79%   96.80%           
=======================================
  Files         348      349    +1     
  Lines       16559    16571   +12     
=======================================
+ Hits        16029    16041   +12     
  Misses        342      342           
  Partials      188      188           
Flag Coverage Δ
badger_v1 8.01% <0.00%> (-0.02%) ⬇️
badger_v2 1.82% <0.00%> (-0.01%) ⬇️
cassandra-4.x-v1 16.59% <0.00%> (-0.03%) ⬇️
cassandra-4.x-v2 1.75% <0.00%> (-0.01%) ⬇️
cassandra-5.x-v1 16.59% <0.00%> (-0.03%) ⬇️
cassandra-5.x-v2 1.75% <0.00%> (-0.01%) ⬇️
elasticsearch-6.x-v1 18.74% <0.00%> (-0.04%) ⬇️
elasticsearch-7.x-v1 18.80% <0.00%> (-0.03%) ⬇️
elasticsearch-8.x-v1 19.01% <0.00%> (-0.02%) ⬇️
elasticsearch-8.x-v2 1.81% <0.00%> (-0.02%) ⬇️
grpc_v1 9.50% <3.84%> (-0.02%) ⬇️
grpc_v2 7.14% <3.84%> (-0.02%) ⬇️
kafka-v1 9.72% <3.84%> (-0.02%) ⬇️
kafka-v2 1.82% <0.00%> (-0.01%) ⬇️
memory_v2 1.82% <0.00%> (-0.01%) ⬇️
opensearch-1.x-v1 18.85% <0.00%> (-0.04%) ⬇️
opensearch-2.x-v1 18.86% <0.00%> (-0.03%) ⬇️
opensearch-2.x-v2 1.81% <0.00%> (-0.02%) ⬇️
tailsampling-processor 0.46% <0.00%> (-0.01%) ⬇️
unittests 95.28% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@cdanis cdanis changed the title Deduplicate spans based upon their spanID [bugfix] Deduplicate spans based upon their hashcode Sep 23, 2024
Signed-off-by: Chris Danis <[email protected]>
Signed-off-by: Chris Danis <[email protected]>
Signed-off-by: Chris Danis <[email protected]>
Signed-off-by: Chris Danis <[email protected]>
Signed-off-by: Chris Danis <[email protected]>
Signed-off-by: Chris Danis <[email protected]>
@cdanis
Copy link
Author

cdanis commented Sep 23, 2024

PTAL :D

func (d *spanHashDeduper) groupSpansByHash() {
spansByHash := make(map[uint64][]*model.Span)
for _, span := range d.trace.Spans {
hash, _ := model.HashCode(span)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recall there was some issue in the past with hash function requiring proper ordering of tags or something.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is still the case:

func (s *Span) Hash(w io.Writer) (err error) {
	// gob is not the most efficient way, but it ensures we don't miss any fields.
	// See BenchmarkSpanHash in span_test.go
	enc := gob.NewEncoder(w)
	return enc.Encode(s)
}

You can only compare hashes if the tags are properly sorted. Surprisingly, we don't have an explicit adjuster for that, but some of that sorting happens in ip_tag.go and sort_log_fields.go adjusters. Perhaps it would make sense to pull the sorting explicitly into SortTagsAndLogs adjuster and makes sure it is called before DedupeBySpanHash adjuster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: ElasticSearch storage not deduplicating multiply-archived spans
2 participants