Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Inserter Service Requirements #68

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

RFC: Inserter Service Requirements #68

wants to merge 3 commits into from

Conversation

fuziontech
Copy link
Member

Draft RFC for requirements, reasoning, and phases of development for an Inserter Service that would be used to power, among other things, the new SQL Interface.

Copy link

@yakkomajuri yakkomajuri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not much to say here - I think this makes sense. The one point from a practical perspective is that we've been thinking about this over at Team Pipeline as well. We currently have no bandwidth to support another service, but likely will in a couple of months with the two new joiners.

Happy for you to move fast here and just hook onto the topics and build a Cloud-specific service in a way that unblocks the Experimentation team, but this is something we'll probably take over in the medium-term I suspect? Since it pertains to ingestion / pipeline. We might also bring you over to the dark side instead ;)

Comment on lines +31 to +32
- [ ] We will not materialize columns. We will use new `[JSON` datatype](https://clickhouse.com/docs/en/sql-reference/data-types/json/)
- [ ] We will not use `Distributed` or `sharded` tables

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these two things are anyway not part of the ingestion pipeline. Also, ingestion into tables with a sharded setup is the same as ingestion in a non-sharded way. I suspect you're more laying out the requirements for the ClickHouse Cloud setup here rather than the inserter service?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The schemas will be different between hosted ClickHouse and ClickHouse.cloud. This is saying that we will not be building or inserting into Distributed tables as we are currently. Think specifically about how the current ingestion pipeline is inserting into posthog.events_writable. That pattern is not a goal here.


It’s time now for us to move on from our friend the `kafka` table engine and move this functionality outside of ClickHouse.

## Why?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree with every why here, thanks for the section

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😎

@fuziontech
Copy link
Member Author

Happy for you to move fast here and just hook onto the topics and build a Cloud-specific service in a way that unblocks the Experimentation team, but this is something we'll probably take over in the medium-term I suspect? Since it pertains to ingestion / pipeline. We might also bring you over to the dark side instead ;)

I appreciate your ambition to expand the scope of your team by increasing the number of services that it is responsible for. I don't believe a team should simply annex a project because the name of the team warrants it. I would point to Conway's law here and suggest that having this not be owned by the ingestion / pipeline team might be a good idea.

Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure.

HBS and MIT have also found evidence to support this:
product developed by the loosely-coupled organization is significantly more modular than the product from the tightly-coupled organization

So in the end I would suggest we should aim to reduce the scope of Ingestion / Pipelines into smaller pieces (events, plugins, schedule/async tasks, etc). This would help us develop much more clear interfaces between each in the inevitable event we want to untangle the bowl of noodles and replace components in place.

@timgl
Copy link

timgl commented Nov 8, 2022

I don't believe a team should simply annex a project because the name of the team warrants it. I would point to Conway's law here and suggest that having this not be owned by the ingestion / pipeline team might be a good idea.

Which other team would own this?

@fuziontech
Copy link
Member Author

Which other team would own this?

Why does a team need to own this? This could be a new team of one (me) doing DB integrations?
I just want to make sure we don't end up back where we were with large teams unnecessarily.

Since it pertains to ingestion / pipeline. We might also bring you over to the dark side instead ;)

I don't think this will work.

  • I'm not in team ingestion's timezone
  • I don't think attending your sprint planning or stand-ups would be terribly relevant
  • This will just end up becoming tightly coupled

Let me flip the question back on you: Why should this be owned by team pipelines? Why not infrastructure? Where are the boundaries of what each team is responsible for?

Copy link
Contributor

@macobo macobo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a real cost to this project:

  • makes self-hosting more costly
  • for months we'll have more errors and work due to this service rather than less
  • little immediate payoff for pipeline robustness - we could be handling typing errors better in plugin-server

@fuziontech
Copy link
Member Author

@macobo I totally think your points are reasonable too. This entire thing adds complexity. I hate complexity.

  • makes self-hosting more costly

This would not be shipped with self-host for now. This is entirely to support SQL interface on cloud for now. All normal stuff would go through the existing pipelines.

  • for months we'll have more errors and work due to this service rather than less

This would be 💯 true if we were cutting over existing pipeline to this, but this will only support data available in SQL interface

  • little immediate payoff for pipeline robustness - we could be handling typing errors better in plugin-server

I think this is actually worth doing by itself and I consider it somewhat of a requirement for this to work. I really want the interface to be clean between plugin server and both this and clickhouse or whatever else is consuming from Kafka. Payloads on Kafka should be strongly typed and validated before producing to a topic (in most cases). This means that before the plugin service produces to kafka for the inserter service or ClickHouse it should enforce some sort of schema on it and validate that it complies with that schema. Whether that is JSON Schema or Proto or Avro or whatever. I really believe it will make our lives easier for both development and operating.

I hope this makes you feel a bit better about this all. I don't intend on this replacing our current pipeline anytime soon, maybe in the future it will...but not yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants