feat(alerts): add a regular job to detect anomalies #22762

nikitaevg · 2024-06-06T17:31:05Z

Problem

#14331

Changes

This PR adds an initial version of the alerts notifications job. In the next PRs I'll introduce

Sending notifications if the alert calculation fails
Metrics for the number of alerts and number of anomalous alerts
UI warnings if users change the insight type to the ones that don't support alerts

Does this work well for both Cloud and self-hosted?

Probably

How did you test this code?

Automatic + manual testing

nikitaevg · 2024-06-08T16:49:01Z

Hi @mariusandra, PTAL!

nikitaevg · 2024-06-08T15:47:12Z

frontend/src/scenes/insights/InsightNav/insightNavLogic.tsx

@@ -288,7 +288,7 @@ export const insightNavLogic = kea<insightNavLogicType>([
        },
    })),
    urlToAction(({ actions }) => ({
-        '/insights/:shortId(/:mode)(/:subscriptionId)': (
+        '/insights/:shortId(/:mode)(/:itemId)': (


As per #22554 (comment)

nikitaevg · 2024-06-08T15:52:17Z

posthog/tasks/detect_alerts_anomalies.py

+    if not insight.query:
+        insight.query = filter_to_query(insight.filters)


This looks a bit dirty, I wonder if there's a better way to do what I want to do here. I just want to get the aggregated_value for an insight.

IIUC there are two ways to represent an insight: one through filters (old) and one through query (new). When I create an insight locally, the old way is used. But I think it's better to use the new approach so I convert the filters to a query. This all is mainly based on compare_hogql_insights.py file.

Yeah, this is correct 👍 . Currently we still have several insights floating around that only have filters (and no query), but the plan is to migrate everything over eventually.

mariusandra

The logic seems reasonable to me, and since it's behind a flag I think we can merge it as is.

However I do have some concerns about the longer term plan and would like to get a second opinion from @PostHog/team-product-analytics and also @benjackwhite (hog question below) and @pauldambra (loves dashboard reload cron jobs)

Currently this will run once per hour at x:20 and schedule a query to run for each alert immediately. Assuming we have 1000 alerts set up, this is a 1000 simultaneous queries every hour at the same time. We will need to stagger them somehow. For example cohort and dashboard reloads run more frequently, but then only run n oldest items every run leading to eventual "good enough" consistency.
The problem with dashboard and cohort calculations is nobody checks in. We periodically discover things have gotten worse when users complain. This will be worse if users will start to rely on alerts for their business. We'd need to establish some practices around this, hence all the @ tagging above.
Finally, we're hard at work on Hog and our CDP. It would be really cool to hook alerts into this system. @benjackwhite any thoughts on how to build the bridge?

mariusandra · 2024-06-13T08:21:51Z

frontend/src/lib/components/Alerts/views/EditAlert.tsx

-                                <LemonField name="upper" label="Upper threshold">
+                                <LemonField
+                                    name="upper"
+                                    label="Upper threshold "


nit:

Suggested change

label="Upper threshold "

label="Upper threshold"

Oh, thanks, done

mariusandra · 2024-06-13T08:27:58Z

posthog/tasks/detect_alerts_anomalies.py

+    if not insight.query:
+        insight.query = filter_to_query(insight.filters)


Yeah, this is correct 👍 . Currently we still have several insights floating around that only have filters (and no query), but the plan is to migrate everything over eventually.

benjackwhite

Love the work here, but given how important and tricky this will be I'd like to consider a more minimal solution with input from @PostHog/team-product-analytics to make sure this is something we can actually scale.

benjackwhite · 2024-06-13T10:00:29Z

posthog/tasks/detect_alerts_anomalies.py

+    campaign_key = f"alert-anomaly-notification-{alert.id}-{timezone.now().timestamp()}"
+    insight_url = f"/project/{alert.team.pk}/insights/{alert.insight.short_id}"
+    alert_url = f"{insight_url}/alerts/{alert.id}"
+    message = EmailMessage(


I think this is definitely not what we should do for a bunch of reasons:

No way to configure rate of delivery, backoffs, etc.

Email only is not the typical way people want to get alerted of this

We are building a new generic delivery system for the CDP (webhooks etc.) which would be the right place to have a destination and I think this could play into that.

I don't want to pour water on the fire that is getting this work done as its super cool 😅 but I know that immediately we will have configuration and scaling issues here that I'm not sure we want to support.

I'm wondering if instead we could have an in-app only alert for now which we can then later hook up to the delivery service instead?

Hmm, I'd argue here.

No way to configure rate of delivery, backoffs, etc.

It's in my plans - to allow changing the frequency of the notifications. You can check the TODO list here.

Email only is not the typical way people want to get alerted of this

~~1. Users want email, slack and webhooks. Why not start with email then.~~
~~2. Mixpanel provides emails+slack, Amplitude provides emails and webhooks.~~
~~3. In my commercial experience emails were the way to notify about alerts.~~

~~IMO emails is a good starting point, it's cheap af, but also it's a necessary communication channel for this.~~

Ok, I misinterpreted this in the first place, you suggest email only is not a typical way. Can't agree or disagree here, I don't know.

I'm wondering if instead we could have an in-app only alert for now which we can then later hook up to the delivery service instead?

Don't quite understand, wdym here? A screen of ongoing alerts? I'd argue that the notifications is the most important part of the alerts module, and honestly I really wouldn't want to be blocked on the CDP development, especially given how cheap sending emails is. Once CDP is launched, I don't think it'd be difficult to migrate, right? I'll do it myself when needed. OTOH, if it's planned to launch soon (this month), I could wait.

I don't want to pour water on the fire that is getting this work done as its super cool

No worries at all, thanks for looking at this!

benjackwhite · 2024-06-13T10:02:04Z

posthog/tasks/detect_alerts_anomalies.py

+
+def check_all_alerts() -> None:
+    alerts = Alert.objects.all().only("id")
+    for alert in alerts:


I don't know for sure but this also feels like a scaling nightmare... We struggle sometimes to keep up with dashboard / insight refreshes in general and this is another form of refresh, just with a higher demand on reliability. I think this would require strong co-ordination with @PostHog/team-product-analytics to make sure this fits in with their existing plans for improving background refreshing otherwise this will hit scaling issues fast.

I don't know the internals of Posthog, but in my experience this is the way to do this. I don't have experience with celery, but I have experience with similar tools, it should scale horizontally pretty easily - add a separate queue for these events, increase the number of parallel tasks in flight and add more servers if needed.

I think this would require strong co-ordination with @PostHog/team-product-analytics to make sure this fits in with their existing plans for improving background refreshing otherwise this will hit scaling issues fast.

Just wanted to chime in here. I can take a look at this, but am currently busy with being on support for this sprint. I'll see what we can do.

Scaling celery is not the issue, but ClickHouse will struggle and ultimately go down if suddenly 1000 simultaneous queries appear.

should scale horizontally pretty easily - add a separate queue for these events, increase the number of parallel tasks in flight and add more servers if needed.

yep, was going to add "Should" is doing a lot of work in this sentence 😅

@webjunkie I'm too far removed from how query code and caching interacts here

we already have one set of jobs that (is|should be) staying on top of having insight results readily available. does this use that cache? we should really overlap them so we have one set of tasks keeping a cache warm and then another that reads the fast access data in that cache for anomaly detection

humans aren't visiting insights once a minute so we know this will generate sustained load.

we should totally, totally build this feature - it's long overdue

i'm not opposed to getting a simple version in just for our team or select beta testers so we can validate the flow, but this 100% needs an internal sponsor since the work of rolling this out and scaling it simply can't be given to an external contributor (it wouldn't be fair or possible 🙈)

i would love to be the internal sponsor but it's both not possible and completely outside of my current wheelhouse

(these concerns might be addressed elsewhere - i've not dug in here at all 🙈)

but ClickHouse will struggle and ultimately go down if suddenly 1000 simultaneous queries appear

Can't I limit the number of celery queries in flight? I understand this will introduce a problem of throughput, but then if the servers can't process N alerts each hour, maybe more read replicas or more servers are needed. I don't have much experience with column oriented databases though, so it's just a speculation.

we already have one set of jobs that (is|should be) staying on top of having insight results readily available. does this use that cache?

🤷 , well the query_runner has some "cache" substrings in it's code, so one could assume... But I don't know

humans aren't visiting insights once a minute so we know this will generate sustained load.

Just to clarify, it's once an hour

but this 100% needs an internal sponsor since the work of rolling this out and scaling it simply can't be given to an external contributor (it wouldn't be fair or possible 🙈)

I completely agree and I would be really happy to have a mentor on this task.

BTW, an interesting data point - Mixpanel limits their number of alerts to 50 per project.

We will talk among @PostHog/team-product-analytics next week and discuss this regarding ownership and so on.

humans aren't visiting insights once a minute so we know this will generate sustained load.
Just to clarify, it's once an hour

👍

(same point but thanks for clarification :))

nikitaevg

Thanks for looking at it!

Assuming we have 1000 alerts set up, this is a 1000 simultaneous queries every hour at the same time.

There's a way to set maximum parallel requests for celery, I think it should help to spread the load of this, no?

This will be worse if users will start to rely on alerts for their business. We'd need to establish some practices around this

I completely agree with that, it's not the final solution, just a skeleton. I'll need some help with this, but we need metrics and alerts about the job execution time to notice problems. I understand people will rely on alerts and it should be reliable alright.

nikitaevg · 2024-06-13T11:30:39Z

posthog/tasks/detect_alerts_anomalies.py

+
+def check_all_alerts() -> None:
+    alerts = Alert.objects.all().only("id")
+    for alert in alerts:


I don't know the internals of Posthog, but in my experience this is the way to do this. I don't have experience with celery, but I have experience with similar tools, it should scale horizontally pretty easily - add a separate queue for these events, increase the number of parallel tasks in flight and add more servers if needed.

nikitaevg · 2024-06-13T11:50:37Z

posthog/tasks/detect_alerts_anomalies.py

+    campaign_key = f"alert-anomaly-notification-{alert.id}-{timezone.now().timestamp()}"
+    insight_url = f"/project/{alert.team.pk}/insights/{alert.insight.short_id}"
+    alert_url = f"{insight_url}/alerts/{alert.id}"
+    message = EmailMessage(


Hmm, I'd argue here.

No way to configure rate of delivery, backoffs, etc.

It's in my plans - to allow changing the frequency of the notifications. You can check the TODO list here.

Email only is not the typical way people want to get alerted of this

~~1. Users want email, slack and webhooks. Why not start with email then.~~
~~2. Mixpanel provides emails+slack, Amplitude provides emails and webhooks.~~
~~3. In my commercial experience emails were the way to notify about alerts.~~

~~IMO emails is a good starting point, it's cheap af, but also it's a necessary communication channel for this.~~

Ok, I misinterpreted this in the first place, you suggest email only is not a typical way. Can't agree or disagree here, I don't know.

I'm wondering if instead we could have an in-app only alert for now which we can then later hook up to the delivery service instead?

Don't quite understand, wdym here? A screen of ongoing alerts? I'd argue that the notifications is the most important part of the alerts module, and honestly I really wouldn't want to be blocked on the CDP development, especially given how cheap sending emails is. Once CDP is launched, I don't think it'd be difficult to migrate, right? I'll do it myself when needed. OTOH, if it's planned to launch soon (this month), I could wait.

I don't want to pour water on the fire that is getting this work done as its super cool

No worries at all, thanks for looking at this!

webjunkie

Thanks for the contribution!

I think the general direction and scope of the PR is valid and something we can work with. Areas that needs to be improved before this can be merged is the workings of the Celery task and additional models and fields we need to sufficiently guide the execution.

I wrote up an RFC draft with how the Celery and model architecture could work:
PostHog/meta#216

Let me know if this helps or needs discussion. (Either here or in Slack).

posthog/models/alert.py

posthog/tasks/detect_alerts_anomalies.py

posthog/tasks/scheduled.py

nikitaevg · 2024-07-02T18:07:28Z

frontend/src/lib/components/Alerts/views/EditAlert.tsx

-                                <LemonField name="upper" label="Upper threshold">
+                                <LemonField
+                                    name="upper"
+                                    label="Upper threshold "


Oh, thanks, done

posthog/models/alert.py

posthog/tasks/detect_alerts_anomalies.py

posthog/tasks/scheduled.py

nikitaevg · 2024-07-02T19:55:16Z

ee/tasks/test/subscriptions/test_subscriptions.py

@@ -24,7 +23,6 @@
 @patch("ee.tasks.subscriptions.generate_assets")
 @freeze_time("2022-02-02T08:55:00.000Z")
 class TestSubscriptionsTasks(APIBaseTest):
-    subscriptions: list[Subscription] = None  # type: ignore


Just a redundant field

This reverts commit 712d780.

webjunkie

This looks fine now considering the scope, but needs work in subsequent PRs as discussed.

webjunkie · 2024-07-17T09:44:33Z

posthog/tasks/alerts/checks.py

+# Note, check_alert_task is used in Celery chains. Celery chains pass the previous
+# function call result to the next function as an argument, hence args and kwargs.


You can do check_alert_task.si (for immutable) above, then this doesn't happen/matter.

Yeah, it worked, thanks!

nikitaevg · 2024-07-23T08:18:24Z

@webjunkie Could you please merge this?

nikitaevg · 2024-07-23T16:12:53Z

@Twixes looks like Julian is unreachable, could you please merge this given the Julian's approval?

posthog-bot · 2024-07-31T07:30:53Z

This PR hasn't seen activity in a week! Should it be merged, closed, or further worked on? If you want to keep it open, post a comment or remove the stale label – otherwise this will be closed in another week. If you want to permanentely keep it open, use the waiting label.

nikitaevg · 2024-08-06T07:25:56Z

@webjunkie, gentle reminder, could you please merge this?

# Conflicts: # frontend/src/scenes/insights/insightSceneLogic.tsx

benjackwhite

A lot of hanging TODOs that I believe should be removed and a couple of smaller comments

benjackwhite · 2024-08-06T08:16:30Z

frontend/src/types.ts

@@ -4370,6 +4370,7 @@ export type HogFunctionInvocationGlobals = {
    >
 }

+// TODO: move to schema.ts


I was planning to do it in a follow up PR, but yeah, I can fix it here, done

benjackwhite · 2024-08-06T08:18:35Z

posthog/templates/email/alert_anomaly.html

@@ -0,0 +1,10 @@
+{% extends "email/base.html" %} {% load posthog_assets %} {% block section %}
+<p>
+    Uh-oh, the <a href="{% absolute_uri alert_url %}">{{ alert_name }}</a> alert detected following anomalies for <a href="{% absolute_uri insight_url %}">{{ insight_name }}</a>:


Not blocking but "uh-oh" feels unnecessarily negative. The alert could be a positive thing.

Removed 👍

benjackwhite · 2024-08-06T08:22:13Z

posthog/tasks/alerts/checks.py

+    check_alert(alert_id)
+
+
+# TODO: make it a task


I don't think this needs to be a task. The .send function by default will queue a celery task for the actual sending.

It makes sense, thanks, removed

nikitaevg

@benjackwhite PTAL!

nikitaevg · 2024-08-06T17:19:27Z

frontend/src/types.ts

@@ -4370,6 +4370,7 @@ export type HogFunctionInvocationGlobals = {
    >
 }

+// TODO: move to schema.ts


I was planning to do it in a follow up PR, but yeah, I can fix it here, done

nikitaevg · 2024-08-06T17:20:23Z

posthog/tasks/alerts/checks.py

+    check_alert(alert_id)
+
+
+# TODO: make it a task


It makes sense, thanks, removed

nikitaevg · 2024-08-06T17:20:57Z

posthog/templates/email/alert_anomaly.html

@@ -0,0 +1,10 @@
+{% extends "email/base.html" %} {% load posthog_assets %} {% block section %}
+<p>
+    Uh-oh, the <a href="{% absolute_uri alert_url %}">{{ alert_name }}</a> alert detected following anomalies for <a href="{% absolute_uri insight_url %}">{{ insight_name }}</a>:


Removed 👍

nikitaevg · 2024-08-08T12:14:38Z

@webjunkie a gentle reminder, could you please take a look?

nikitaevg · 2024-08-13T08:31:41Z

@webjunkie do you know why all E2E tests might fail with the "Error: missing API token, please run depot login" error after merging with master?

nikitaevg · 2024-08-13T18:07:28Z

@benjackwhite could you please review this?

Dismissing after Slack discussion (and Ben OOO today)

nikitaevg added 8 commits June 6, 2024 18:29

init

eb4cb49

initial version of the regular job

ba51409

small polishing

bb71f5a

Merge remote-tracking branch 'upstream/master' into 14331-regular-job

1d57843

small test fix

15e0b6b

fix types in tests

71c44d9

fix the crontab schedule to every hour

5e59a5d

add a newline to the template

29e613b

nikitaevg marked this pull request as ready for review June 8, 2024 16:48

nikitaevg commented Jun 11, 2024

View reviewed changes

nikitaevg added 3 commits June 11, 2024 17:52

add a test to check insight date range

b1a0219

Merge remote-tracking branch 'upstream/master' into 14331-regular-job

0b2e57a

use the new display type naming

29d9305

mariusandra reviewed Jun 13, 2024

View reviewed changes

mariusandra requested review from pauldambra, benjackwhite and a team June 13, 2024 08:38

benjackwhite requested changes Jun 13, 2024

View reviewed changes

nikitaevg commented Jun 13, 2024

View reviewed changes

pauldambra removed their request for review June 19, 2024 08:02

posthog-bot added the stale label Jun 27, 2024

webjunkie removed the stale label Jul 2, 2024

PostHog deleted a comment from posthog-bot Jul 2, 2024

webjunkie requested changes Jul 2, 2024

View reviewed changes

nikitaevg added 2 commits July 2, 2024 19:06

Merge remote-tracking branch 'upstream/master' into 14331-regular-job

0cd1a01

address PR comments

b8b4fec

nikitaevg commented Jul 2, 2024

View reviewed changes

nikitaevg requested a review from webjunkie July 2, 2024 19:59

Merge branch 'master' into 14331-regular-job

3475c37

nikitaevg and others added 3 commits July 13, 2024 11:23

fix typing

ed27736

Merge remote-tracking branch 'upstream/master' into 14331-regular-job

a1966b4

Revert "Fix scheduled task setup"

4e97160

This reverts commit 712d780.

webjunkie reviewed Jul 17, 2024

View reviewed changes

nikitaevg added 5 commits July 18, 2024 19:26

use si for chains

028d155

Merge remote-tracking branch 'upstream/master' into 14331-regular-job

1920482

use timestamp for the campaign key

fa7dc4c

Merge branch 'master' into 14331-regular-job

e5b44ca

Merge branch 'master' into 14331-regular-job

b8d4e60

posthog-bot added the stale label Jul 31, 2024

posthog-bot removed the stale label Aug 6, 2024

Merge branch 'master' into 14331-regular-job

e3db36a

webjunkie requested a review from benjackwhite August 6, 2024 08:14

webjunkie approved these changes Aug 6, 2024

View reviewed changes

Merge branch 'master' into 14331-regular-job

4183314

# Conflicts: # frontend/src/scenes/insights/insightSceneLogic.tsx

benjackwhite previously requested changes Aug 6, 2024

View reviewed changes

brush up the PR

66f0c4a

nikitaevg commented Aug 6, 2024

View reviewed changes

Merge branch 'master' into 14331-regular-job

d274bea

Merge branch 'master' into 14331-regular-job

0e8d28a

nikitaevg requested a review from benjackwhite August 13, 2024 18:07

webjunkie merged commit 5ae29f2 into PostHog:master Aug 15, 2024
90 of 92 checks passed

		if not insight.query:
		insight.query = filter_to_query(insight.filters)

		# Note, check_alert_task is used in Celery chains. Celery chains pass the previous
		# function call result to the next function as an argument, hence args and kwargs.

feat(alerts): add a regular job to detect anomalies #22762

feat(alerts): add a regular job to detect anomalies #22762

Conversation

nikitaevg commented Jun 6, 2024 • edited Loading

Problem

Changes

Does this work well for both Cloud and self-hosted?

How did you test this code?

nikitaevg commented Jun 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mariusandra left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benjackwhite left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nikitaevg Jun 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nikitaevg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nikitaevg Jun 13, 2024 • edited Loading

Choose a reason for hiding this comment

webjunkie left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

webjunkie left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nikitaevg commented Jul 23, 2024

nikitaevg commented Jul 23, 2024

posthog-bot commented Jul 31, 2024

nikitaevg commented Aug 6, 2024

benjackwhite left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nikitaevg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nikitaevg commented Aug 8, 2024

nikitaevg commented Aug 13, 2024

nikitaevg commented Aug 13, 2024

nikitaevg commented Jun 6, 2024 •

edited

Loading

mariusandra left a comment •

edited

Loading

nikitaevg Jun 13, 2024 •

edited

Loading

nikitaevg Jun 13, 2024 •

edited

Loading