Add alerting models, task architecture RFC #216

webjunkie · 2024-07-02T09:42:59Z

No description provided.

pauldambra · 2024-07-02T11:35:36Z

requests-for-comments/2024-07-01-alerting.md

+    - **AlertCheck**: Logs each evaluation of an alert, including results and notification statuses.
+
+2. **Task Execution**:
+    - Use Celery to handle periodic checks and notifications.


just checking whether it should be temporal rather than celery (i don't even know how we'd choose 🙈)

pauldambra · 2024-07-02T11:39:00Z

requests-for-comments/2024-07-01-alerting.md

+    alert_id_groups = [alert_ids[i:i+10] for i in range(0, len(alert_ids), 10)]
+
+    group_of_chains = group(
+        chain(check_alert.chunks(group, 10)) for group in alert_id_groups


when we do implement this it definitely needs a comment - i think i understand what it's doing but i bet i'm wrong 🙈

That's my concern as well, it looks a bit difficult, and that's why I suggest for + limit_concurrency

pauldambra · 2024-07-02T11:41:07Z

requests-for-comments/2024-07-01-alerting.md

we already have a set of scheduled tasks which try and keep the cache warm and we're not monitoring them well (IMHO as the person who left them not well monitored last time they were changed)

this is effectively the same mechanism for a different reason

we could overlap these mechanisms so that caching overall gets better too
but we should definitely learn from the past and make sure that we know how often these tasks run, when they fail, and have appropriate alerting in place

Worth noting that it might be not a good idea to use cache for alerts, users probably want the freshest value to be tested, at most 5-10 minutes of staleness.

I should look into it actually, because alerts that use cached stale values are not very good alerts.

mostly I don't think we should have a cache task processing X per minute and an alert task processing X for per minute, since we're going to overload processing infrastructure

or at least we shouldn't ignore that we have both

the more i think about it i think this might need limits - free customers can have 2 alerts, paid can have 10, (fill in numbers as you see fit)

we can always allow more, but can't restrict them (AWS has soft and hard limits to protect service, but you can ask them to increase soft limits)

(similarly we should consider if this is going to be in the open source or EE licensed portion of the code)

the more i think about it i think this might need limits - free customers can have 2 alerts, paid can have 10, (fill in numbers as you see fit)

Yeah, I agree with that, Mixpanel allows 50 alerts per client, probably there's a reason for that.

2 and 10 alerts seems a bit small though:) maybe anything above some number should be paid?

(similarly we should consider if this is going to be in the open source or EE licensed portion of the code)

I'd prefer keeping it in the open source, at least while I'm participating - I might require an extra approval from my employer to contribute to EE directories .

Worth noting that it might be not a good idea to use cache for alerts, users probably want the freshest value to be tested, at most 5-10 minutes of staleness.

We overhauled the query runners cache recently, so to get fresh data (and fill the cache) you can just use the execution mode CALCULATE_BLOCKING_ALWAYS in the alert task.

pauldambra · 2024-07-02T11:42:07Z

requests-for-comments/2024-07-01-alerting.md

+    alert_check = AlertCheck.objects.create(
+        alert=alert,
+        calculated_value=calculated_value,
+        threshold_met=threshold_met,


since we're adding thresholds... i'd love to be able to add thresholds in the UI that don't alert just so I can get some nice visuals on insights without needing alerts

that has the benefit we can ship it as soon as its made and start benefitting and getting user feedback on the UX

Uh, that's a nice idea... So basically a threshold in the UI could be an alert that doesn't notify anyone (yet) and doesn't even need to be checked/calculated. But you could easily turn on notifications for that threshold.

yep, i want to see the threshold on the graph because that's useful but don't need an alert for every threshold

e.g. measuring web vitals, i want to colour the graphs because its hard to remember the 4 different thresholds (well, 8 because they have good, medium, poor) but I don't need an alert when they change cos they don't always indicate something i need to immediately react to

pauldambra · 2024-07-02T11:43:05Z

it's so cool we're working on this

nikitaevg

Thanks for writing this!

nikitaevg · 2024-07-02T16:52:15Z

requests-for-comments/2024-07-01-alerting.md

+- `created`: Time of the check.
+- `calculated_value`: Result of the check.
+- `anomaly_condition`: Copy of the condition for the alert in case original got updated.
+- `threshold_met`: Whether the condition was met.


nit: maybe is_anomaly? My first impression was that threshold_met=true means the value is below the threshold.

Also, we might consider other anomaly conditions, not only threshold.

On the second thought, maybe it shouldn't be a boolean, but an enum. If there's an error in calculation, I don't know what boolean value can be assigned here. So what about check_result = ANOMALY | NORMAL_VALUE | ERROR?

maybe trigger_condition and trigger_condition_met ??

trigger_condition

🤷 sounds good

trigger_condition_met

I'd prefer having an enum here, IMO it will need more values than 2. You think check_result is not good?

Regarding "anomaly": Is there any background or reasoning for this term? I think it evokes other expectations (anomaly detection) and is not quite what we have here with strict set thresholds. I would actually vote to use another term.

On the second thought, maybe it shouldn't be a boolean, but an enum. If there's an error in calculation, I don't know what boolean value can be assigned here.

We need to decide whether to send out a notification (= threshold met) yes or no. That is boolean.
Errors can be stored separately (see error_message).

Is there any background or reasoning for this term?

It just popped in my head. What term would you prefer? I would say a value that is above or below a user-defined threshold is an anomaly.

We need to decide whether to send out a notification (= threshold met) yes or no.

What if there's a problem with the alert or insight configuration, for example the user misconfigured something, we should notify users about that, no? So whether we send out a notification might depend not only on meeting the threshold.

If there's an error with the calculation, what value would the "threshold_met" have? I would say it's neither true nor false, and I wouldn't like to have an optional boolean as it doesn't help readability.

In the future alerts might get other anomaly detection mechanisms, so it won't be only about thresholds. That's why I think it's better not mention "threshold" here.

Enums are cheap, IMO provide better readability, and are extensible, why do you think it's better to use a boolean?

Sorry, missed this comment last time.

i've built a couple of similar matchers in the past and they always evolve to support && and || and i always ended up returning something like {matches: bool, explanation: string } - so familiarity with something similar is altering my opinion here (doesn't mean i'm right 😊)

i'd definitely have error as a separate case, if there's an error the condition was neither met nor not met

in other parts of posthog we ended up having error_count and last_error_date so that we could decide whether to carry on trying something

tbh there's a predicate condition and a predicate result

maybe it's better to have it that generic

this might not always be an anomaly to the user - "tell me when an order value is about $1k" is a happy thing

but i should leave you @webjunkie to it 😊

if there's an error the condition was neither met nor not met

That worries me a bit, I usually find difficult to interpret optional booleans. My rule of thumb is that if something is slightly more difficult than yes/no then I default to an enum - it's more explicit and more extensible. I'm not advocating for different error types in the check_result. I'm just suggesting we use an enum to express this absence of the outcome more explicitly, have a required enum with three states.

And still "Enums are cheap, IMO provide better readability, and are extensible". It's better to have a door open to extensions, especially with databases, isn't it?

One other thing that comes to mind, which might influence the modeling here: We currently would alert each time we check, if the threshold is continued to be breached, right?
If we find the threshold is met, we might check previous AlertChecks to see if this was already the cases there and if a notification has been sent already.

Regarding the boolean discussion: I'm happy to talk about a specific alternative if you may present one.

this might not always be an anomaly to the user - "tell me when an order value is about $1k" is a happy thing

This was also my thinking.. anomaly is a too strong word, and we can leave it more generic.. But naming is hard 🤷🏻

nikitaevg · 2024-07-02T16:57:54Z

requests-for-comments/2024-07-01-alerting.md

+    )
+
+    if threshold_met:
+        send_notification.s(alert_check.id).delay() #  Launch notification task based on check object


I wonder maybe we don't extract it into a separate task? The only profit I see is that if a notification fails and we want to retry it, we won't calculate the metric value again. I'd say profit is negligible assuming that notifications fail rarely 🤷

I'm not saying notifications should be in the same function with alerts though, I might as well extract it for readability

The only profit I see is that if a notification fails and we want to retry it, we won't calculate the metric value again

importantly

we won't calculate the metric value again.

that's a really valuable thing
some queries are inherently very slow, if we can avoid running them too often we should

we absolutely should be able to retry a notification N times but only calculate the value once (or at least <N times since we don't want to retry for 6 hours without checking if the alert has recovered 🙈)

The other reason I put this is that we can easily launch other things based on the trigger, e.g. hand it off to CDP, launch a Slack notification task, and so on. In this way it's prepared to be pluggable.

nikitaevg · 2024-07-02T17:16:00Z

requests-for-comments/2024-07-01-alerting.md

+**Groups**: Allow multiple tasks to be executed in parallel, providing the ability to run concurrent checks for 
+different alerts. By creating groups of tasks, we can limit the number of parallel operations to control system load 
+and ensure efficient resource utilization.


I wonder maybe limit_concurrency (that you introduced) would work here? Basically that's all we need - limit the number of tasks in flight, no?

If we add limit_concurrency and simply process all alerts in a for, IMO it would have the same effect as if we used chains + groups + chunks, but it would be simpler, or do I miss something?

the trick here (however we do it) is to limit the number of running tasks such that if the number of celery pods scale the max number of tasks does not.

e.g. 10 per worker is fine with 4 pods, then under load we scale to 100 pods and we're running 1000 concurrent not 40

(maybe teaching everyone to suck eggs 🙈)

the trick here (however we do it) is to limit the number of running tasks such that if the number of celery pods scale the max number of tasks does not.

IIUC the limit_concurrency implementation, it looks at the number of tasks in flight and drops the new one if it's above the limit, so I'd say it will keep the limit if we increase the number of pods, right?

maybe teaching everyone to suck eggs 🙈

I haven't worked with Celery before, so not at all! What an odd expression though.

What an odd expression though.

🤣

i have a bad habit of speaking in idioms - sorry!

The limit concurrency implementation I did recently acts more as a safeguard. The reason is that if you put 100 tasks in the queue and they hit the limit, they will start to retry and retry but only do so a couple of times and then fail. So this system isn't good for "slam stuff in the queue and wait for it to be done".

Setting up groups and chains allows us to control that there are 10 groups launched (= 10 running at the same time) and they sequentially check the alerts. That's a simple way I would say to control how much is done at a time.

You are right this would also break the chain 🤔

I also thought the same if we can limit worker concurrency... the problem is, this concurrency setting is for one worker. We might spin up several pods though. And they also might autoscale to even more. Unfortunately we currently cannot control the worker amount and autoscaling granular per worker type.

I can think about it more, but maybe the case that alerts fail from time to time is not that severe for a first version?

I suppose there's a way to create a dedicated pod with exact number of CPUs for this queue. But yeah, I see it's not that simple. Let's try the chains and see how it goes.

It is indeed possible, but someone would need to dig into the configs and templates in the other repo 😓

I just noticed that our infra team pushed an unrelated adjustment to auto scaling, which now allows for exact worker/pod configuration of maximum instances. I'll test this with some other tasks and we might end up just using the same mechanism here then after all 🎉

Oh, cool!

What do you say we keep the mechanism with chains for now (to unblock development of other features and to merge that old PR) and if the configuration works out I'll adjust it?

nikitaevg · 2024-07-02T17:19:21Z

requests-for-comments/2024-07-01-alerting.md

+    alert_id_groups = [alert_ids[i:i+10] for i in range(0, len(alert_ids), 10)]
+
+    group_of_chains = group(
+        chain(check_alert.chunks(group, 10)) for group in alert_id_groups


That's my concern as well, it looks a bit difficult, and that's why I suggest for + limit_concurrency

nikitaevg · 2024-07-02T17:24:27Z

requests-for-comments/2024-07-01-alerting.md

Worth noting that it might be not a good idea to use cache for alerts, users probably want the freshest value to be tested, at most 5-10 minutes of staleness.

I should look into it actually, because alerts that use cached stale values are not very good alerts.

Add alerting models, task architecture RFC

fc7aedc

webjunkie marked this pull request as ready for review July 2, 2024 09:43

webjunkie mentioned this pull request Jul 2, 2024

feat(alerts): add a regular job to detect anomalies PostHog/posthog#22762

Merged

pauldambra reviewed Jul 2, 2024

View reviewed changes

nikitaevg reviewed Jul 2, 2024

View reviewed changes

Add alerting models, task architecture RFC #216

Are you sure you want to change the base?

Add alerting models, task architecture RFC #216

Conversation

webjunkie commented Jul 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pauldambra commented Jul 2, 2024

nikitaevg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pauldambra Jul 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nikitaevg Jul 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pauldambra Jul 7, 2024 •

edited

Loading

nikitaevg Jul 7, 2024 •

edited

Loading