Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(alerts): add a regular job to detect anomalies #22762

Merged
merged 33 commits into from
Aug 15, 2024

Conversation

nikitaevg
Copy link
Contributor

@nikitaevg nikitaevg commented Jun 6, 2024

Problem

#14331

Changes

This PR adds an initial version of the alerts notifications job. In the next PRs I'll introduce

  1. Sending notifications if the alert calculation fails
  2. Metrics for the number of alerts and number of anomalous alerts
  3. UI warnings if users change the insight type to the ones that don't support alerts

Does this work well for both Cloud and self-hosted?

Probably

How did you test this code?

Automatic + manual testing

@nikitaevg nikitaevg marked this pull request as ready for review June 8, 2024 16:48
@nikitaevg
Copy link
Contributor Author

Hi @mariusandra, PTAL!

@@ -288,7 +288,7 @@ export const insightNavLogic = kea<insightNavLogicType>([
},
})),
urlToAction(({ actions }) => ({
'/insights/:shortId(/:mode)(/:subscriptionId)': (
'/insights/:shortId(/:mode)(/:itemId)': (
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 25 to 26
if not insight.query:
insight.query = filter_to_query(insight.filters)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks a bit dirty, I wonder if there's a better way to do what I want to do here. I just want to get the aggregated_value for an insight.

IIUC there are two ways to represent an insight: one through filters (old) and one through query (new). When I create an insight locally, the old way is used. But I think it's better to use the new approach so I convert the filters to a query. This all is mainly based on compare_hogql_insights.py file.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is correct 👍 . Currently we still have several insights floating around that only have filters (and no query), but the plan is to migrate everything over eventually.

Copy link
Collaborator

@mariusandra mariusandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic seems reasonable to me, and since it's behind a flag I think we can merge it as is.

However I do have some concerns about the longer term plan and would like to get a second opinion from @PostHog/team-product-analytics and also @benjackwhite (hog question below) and @pauldambra (loves dashboard reload cron jobs)

  • Currently this will run once per hour at x:20 and schedule a query to run for each alert immediately. Assuming we have 1000 alerts set up, this is a 1000 simultaneous queries every hour at the same time. We will need to stagger them somehow. For example cohort and dashboard reloads run more frequently, but then only run n oldest items every run leading to eventual "good enough" consistency.
  • The problem with dashboard and cohort calculations is nobody checks in. We periodically discover things have gotten worse when users complain. This will be worse if users will start to rely on alerts for their business. We'd need to establish some practices around this, hence all the @ tagging above.
  • Finally, we're hard at work on Hog and our CDP. It would be really cool to hook alerts into this system. @benjackwhite any thoughts on how to build the bridge?

<LemonField name="upper" label="Upper threshold">
<LemonField
name="upper"
label="Upper threshold "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
label="Upper threshold "
label="Upper threshold"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, thanks, done

Comment on lines 25 to 26
if not insight.query:
insight.query = filter_to_query(insight.filters)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is correct 👍 . Currently we still have several insights floating around that only have filters (and no query), but the plan is to migrate everything over eventually.

@mariusandra mariusandra requested review from pauldambra, benjackwhite and a team June 13, 2024 08:38
Copy link
Contributor

@benjackwhite benjackwhite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the work here, but given how important and tricky this will be I'd like to consider a more minimal solution with input from @PostHog/team-product-analytics to make sure this is something we can actually scale.

campaign_key = f"alert-anomaly-notification-{alert.id}-{timezone.now().timestamp()}"
insight_url = f"/project/{alert.team.pk}/insights/{alert.insight.short_id}"
alert_url = f"{insight_url}/alerts/{alert.id}"
message = EmailMessage(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is definitely not what we should do for a bunch of reasons:

  1. No way to configure rate of delivery, backoffs, etc.
  2. Email only is not the typical way people want to get alerted of this

We are building a new generic delivery system for the CDP (webhooks etc.) which would be the right place to have a destination and I think this could play into that.

I don't want to pour water on the fire that is getting this work done as its super cool 😅 but I know that immediately we will have configuration and scaling issues here that I'm not sure we want to support.

I'm wondering if instead we could have an in-app only alert for now which we can then later hook up to the delivery service instead?

Copy link
Contributor Author

@nikitaevg nikitaevg Jun 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'd argue here.

No way to configure rate of delivery, backoffs, etc.

It's in my plans - to allow changing the frequency of the notifications. You can check the TODO list here.

Email only is not the typical way people want to get alerted of this

1. Users want email, slack and webhooks. Why not start with email then.
2. Mixpanel provides emails+slack, Amplitude provides emails and webhooks.
3. In my commercial experience emails were the way to notify about alerts.

IMO emails is a good starting point, it's cheap af, but also it's a necessary communication channel for this.

Ok, I misinterpreted this in the first place, you suggest email only is not a typical way. Can't agree or disagree here, I don't know.

I'm wondering if instead we could have an in-app only alert for now which we can then later hook up to the delivery service instead?

Don't quite understand, wdym here? A screen of ongoing alerts? I'd argue that the notifications is the most important part of the alerts module, and honestly I really wouldn't want to be blocked on the CDP development, especially given how cheap sending emails is. Once CDP is launched, I don't think it'd be difficult to migrate, right? I'll do it myself when needed. OTOH, if it's planned to launch soon (this month), I could wait.

I don't want to pour water on the fire that is getting this work done as its super cool

No worries at all, thanks for looking at this!


def check_all_alerts() -> None:
alerts = Alert.objects.all().only("id")
for alert in alerts:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know for sure but this also feels like a scaling nightmare... We struggle sometimes to keep up with dashboard / insight refreshes in general and this is another form of refresh, just with a higher demand on reliability. I think this would require strong co-ordination with @PostHog/team-product-analytics to make sure this fits in with their existing plans for improving background refreshing otherwise this will hit scaling issues fast.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the internals of Posthog, but in my experience this is the way to do this. I don't have experience with celery, but I have experience with similar tools, it should scale horizontally pretty easily - add a separate queue for these events, increase the number of parallel tasks in flight and add more servers if needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would require strong co-ordination with @PostHog/team-product-analytics to make sure this fits in with their existing plans for improving background refreshing otherwise this will hit scaling issues fast.

Just wanted to chime in here. I can take a look at this, but am currently busy with being on support for this sprint. I'll see what we can do.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scaling celery is not the issue, but ClickHouse will struggle and ultimately go down if suddenly 1000 simultaneous queries appear.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should scale horizontally pretty easily - add a separate queue for these events, increase the number of parallel tasks in flight and add more servers if needed.

yep, was going to add "Should" is doing a lot of work in this sentence 😅

@webjunkie I'm too far removed from how query code and caching interacts here

we already have one set of jobs that (is|should be) staying on top of having insight results readily available. does this use that cache? we should really overlap them so we have one set of tasks keeping a cache warm and then another that reads the fast access data in that cache for anomaly detection

humans aren't visiting insights once a minute so we know this will generate sustained load.

we should totally, totally build this feature - it's long overdue


i'm not opposed to getting a simple version in just for our team or select beta testers so we can validate the flow, but this 100% needs an internal sponsor since the work of rolling this out and scaling it simply can't be given to an external contributor (it wouldn't be fair or possible 🙈)

i would love to be the internal sponsor but it's both not possible and completely outside of my current wheelhouse

(these concerns might be addressed elsewhere - i've not dug in here at all 🙈)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but ClickHouse will struggle and ultimately go down if suddenly 1000 simultaneous queries appear

Can't I limit the number of celery queries in flight? I understand this will introduce a problem of throughput, but then if the servers can't process N alerts each hour, maybe more read replicas or more servers are needed. I don't have much experience with column oriented databases though, so it's just a speculation.

we already have one set of jobs that (is|should be) staying on top of having insight results readily available. does this use that cache?

🤷 , well the query_runner has some "cache" substrings in it's code, so one could assume... But I don't know

humans aren't visiting insights once a minute so we know this will generate sustained load.

Just to clarify, it's once an hour

but this 100% needs an internal sponsor since the work of rolling this out and scaling it simply can't be given to an external contributor (it wouldn't be fair or possible 🙈)

I completely agree and I would be really happy to have a mentor on this task.

BTW, an interesting data point - Mixpanel limits their number of alerts to 50 per project.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will talk among @PostHog/team-product-analytics next week and discuss this regarding ownership and so on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

humans aren't visiting insights once a minute so we know this will generate sustained load.
Just to clarify, it's once an hour

👍

(same point but thanks for clarification :))

Copy link
Contributor Author

@nikitaevg nikitaevg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking at it!

Assuming we have 1000 alerts set up, this is a 1000 simultaneous queries every hour at the same time.

There's a way to set maximum parallel requests for celery, I think it should help to spread the load of this, no?

This will be worse if users will start to rely on alerts for their business. We'd need to establish some practices around this

I completely agree with that, it's not the final solution, just a skeleton. I'll need some help with this, but we need metrics and alerts about the job execution time to notice problems. I understand people will rely on alerts and it should be reliable alright.


def check_all_alerts() -> None:
alerts = Alert.objects.all().only("id")
for alert in alerts:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the internals of Posthog, but in my experience this is the way to do this. I don't have experience with celery, but I have experience with similar tools, it should scale horizontally pretty easily - add a separate queue for these events, increase the number of parallel tasks in flight and add more servers if needed.

campaign_key = f"alert-anomaly-notification-{alert.id}-{timezone.now().timestamp()}"
insight_url = f"/project/{alert.team.pk}/insights/{alert.insight.short_id}"
alert_url = f"{insight_url}/alerts/{alert.id}"
message = EmailMessage(
Copy link
Contributor Author

@nikitaevg nikitaevg Jun 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'd argue here.

No way to configure rate of delivery, backoffs, etc.

It's in my plans - to allow changing the frequency of the notifications. You can check the TODO list here.

Email only is not the typical way people want to get alerted of this

1. Users want email, slack and webhooks. Why not start with email then.
2. Mixpanel provides emails+slack, Amplitude provides emails and webhooks.
3. In my commercial experience emails were the way to notify about alerts.

IMO emails is a good starting point, it's cheap af, but also it's a necessary communication channel for this.

Ok, I misinterpreted this in the first place, you suggest email only is not a typical way. Can't agree or disagree here, I don't know.

I'm wondering if instead we could have an in-app only alert for now which we can then later hook up to the delivery service instead?

Don't quite understand, wdym here? A screen of ongoing alerts? I'd argue that the notifications is the most important part of the alerts module, and honestly I really wouldn't want to be blocked on the CDP development, especially given how cheap sending emails is. Once CDP is launched, I don't think it'd be difficult to migrate, right? I'll do it myself when needed. OTOH, if it's planned to launch soon (this month), I could wait.

I don't want to pour water on the fire that is getting this work done as its super cool

No worries at all, thanks for looking at this!

@pauldambra pauldambra removed their request for review June 19, 2024 08:02
@webjunkie webjunkie removed the stale label Jul 2, 2024
@PostHog PostHog deleted a comment from posthog-bot Jul 2, 2024
Copy link
Contributor

@webjunkie webjunkie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!

I think the general direction and scope of the PR is valid and something we can work with. Areas that needs to be improved before this can be merged is the workings of the Celery task and additional models and fields we need to sufficiently guide the execution.

I wrote up an RFC draft with how the Celery and model architecture could work:
PostHog/meta#216

Let me know if this helps or needs discussion. (Either here or in Slack).

posthog/models/alert.py Outdated Show resolved Hide resolved
posthog/models/alert.py Show resolved Hide resolved
posthog/tasks/detect_alerts_anomalies.py Outdated Show resolved Hide resolved
posthog/tasks/detect_alerts_anomalies.py Outdated Show resolved Hide resolved
posthog/tasks/detect_alerts_anomalies.py Outdated Show resolved Hide resolved
posthog/tasks/scheduled.py Show resolved Hide resolved
<LemonField name="upper" label="Upper threshold">
<LemonField
name="upper"
label="Upper threshold "
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, thanks, done

posthog/models/alert.py Outdated Show resolved Hide resolved
posthog/models/alert.py Show resolved Hide resolved
posthog/tasks/detect_alerts_anomalies.py Outdated Show resolved Hide resolved
posthog/tasks/detect_alerts_anomalies.py Outdated Show resolved Hide resolved
posthog/tasks/detect_alerts_anomalies.py Outdated Show resolved Hide resolved
posthog/tasks/scheduled.py Show resolved Hide resolved
@@ -24,7 +23,6 @@
@patch("ee.tasks.subscriptions.generate_assets")
@freeze_time("2022-02-02T08:55:00.000Z")
class TestSubscriptionsTasks(APIBaseTest):
subscriptions: list[Subscription] = None # type: ignore
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a redundant field

@nikitaevg nikitaevg requested a review from webjunkie July 2, 2024 19:59
Copy link
Contributor

@webjunkie webjunkie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine now considering the scope, but needs work in subsequent PRs as discussed.

Comment on lines 79 to 80
# Note, check_alert_task is used in Celery chains. Celery chains pass the previous
# function call result to the next function as an argument, hence args and kwargs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do check_alert_task.si (for immutable) above, then this doesn't happen/matter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it worked, thanks!

@nikitaevg
Copy link
Contributor Author

@webjunkie Could you please merge this?

@nikitaevg
Copy link
Contributor Author

@Twixes looks like Julian is unreachable, could you please merge this given the Julian's approval?

@posthog-bot
Copy link
Contributor

This PR hasn't seen activity in a week! Should it be merged, closed, or further worked on? If you want to keep it open, post a comment or remove the stale label – otherwise this will be closed in another week. If you want to permanentely keep it open, use the waiting label.

@nikitaevg
Copy link
Contributor Author

@webjunkie, gentle reminder, could you please merge this?

@posthog-bot posthog-bot removed the stale label Aug 6, 2024
# Conflicts:
#	frontend/src/scenes/insights/insightSceneLogic.tsx
Copy link
Contributor

@benjackwhite benjackwhite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of hanging TODOs that I believe should be removed and a couple of smaller comments

@@ -4370,6 +4370,7 @@ export type HogFunctionInvocationGlobals = {
>
}

// TODO: move to schema.ts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was planning to do it in a follow up PR, but yeah, I can fix it here, done

@@ -0,0 +1,10 @@
{% extends "email/base.html" %} {% load posthog_assets %} {% block section %}
<p>
Uh-oh, the <a href="{% absolute_uri alert_url %}">{{ alert_name }}</a> alert detected following anomalies for <a href="{% absolute_uri insight_url %}">{{ insight_name }}</a>:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not blocking but "uh-oh" feels unnecessarily negative. The alert could be a positive thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed 👍

check_alert(alert_id)


# TODO: make it a task
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this needs to be a task. The .send function by default will queue a celery task for the actual sending.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense, thanks, removed

Copy link
Contributor Author

@nikitaevg nikitaevg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -4370,6 +4370,7 @@ export type HogFunctionInvocationGlobals = {
>
}

// TODO: move to schema.ts
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was planning to do it in a follow up PR, but yeah, I can fix it here, done

check_alert(alert_id)


# TODO: make it a task
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense, thanks, removed

@@ -0,0 +1,10 @@
{% extends "email/base.html" %} {% load posthog_assets %} {% block section %}
<p>
Uh-oh, the <a href="{% absolute_uri alert_url %}">{{ alert_name }}</a> alert detected following anomalies for <a href="{% absolute_uri insight_url %}">{{ insight_name }}</a>:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed 👍

@nikitaevg
Copy link
Contributor Author

@webjunkie a gentle reminder, could you please take a look?

@nikitaevg
Copy link
Contributor Author

@webjunkie do you know why all E2E tests might fail with the "Error: missing API token, please run depot login" error after merging with master?

@nikitaevg
Copy link
Contributor Author

@benjackwhite could you please review this?

@webjunkie webjunkie dismissed benjackwhite’s stale review August 15, 2024 08:26

Dismissing after Slack discussion (and Ben OOO today)

@webjunkie webjunkie merged commit 5ae29f2 into PostHog:master Aug 15, 2024
90 of 92 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants