Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming data deduplication #265

Open
sridharpattem opened this issue Nov 11, 2018 · 3 comments
Open

Streaming data deduplication #265

sridharpattem opened this issue Nov 11, 2018 · 3 comments

Comments

@sridharpattem
Copy link

Hi,
Is it possible to check for duplicates within an unbounded streaming data set, not checking against another static data source but against the data that has streamed so far?

The flow is as follows.

Source Database ---> CDC ---> Kafka ----> Stream Processing (invoke Duke for duplicate check) -> Target Database

I would like to build the index as data is streaming in from the CDC, keep incrementing the index with new data and search the index at the same time for each message coming along. What is the way to do this? Or, do we always need at least two static data sets to find duplicates?

Thank you.

@larsga
Copy link
Owner

larsga commented Nov 11, 2018

Yes, this is possible, if you index the records as you process them. The most efficient approach is to take some batch of records (say 10,000 records), index them all, commit the index, then search for duplicates. The API has methods for this.

@ashubitm
Copy link

i have a similar scenario where i have to dedupe records coming in streams against couchbase data as quickly as possible .
Do we have couchbase source which can use index and call findCandidateMatches() against the couchbase and do quick deduping .

@uderline
Copy link

Hi @sridharpattem

I had the same issue with a data flow DB -> NiFi -> Logstash -> Elastic.
I basically made an elastic plugin. If you want to have an idea on how to implement Duke in you code, feel free: https://github.com/minibigio/miniduke/blob/d0b51619cf2f080348a2f17f6c7932ce3617f89c/src/main/java/io/minibig/miniduke/ingest/MinidukeProcessor.java#L143

Good luck !

@larsga I have a question concerning the batch size . If you don't have any idea on how many records you are going to receive, what value do you assign ? How much does it matter if the batch size is too high ?

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants