Temporal CV #62

pfistfl · 2020-08-25T16:01:34Z

I currently have a task with a column that is a date.
As the task is to basically predict values in the future, a cross-validation strategy that can take this into account would be required. Similar to see RollingWindowCV.
As this is a very common use-case, we should perhaps think about implementing this.

This is implemented in mlr3forecasting, but for forecasting tasks instead of regular Classif|Regr Tasks.
Where should such a method live? mlr3spatiotempcv ?
How would we go about implementing this.

The text was updated successfully, but these errors were encountered:

issue-label-bot · 2020-08-25T16:01:36Z

Issue-Label Bot is automatically applying the label feature_request to this issue, with a confidence of 0.56. Please mark this comment with 👍 or 👎 to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

pat-s · 2020-08-25T18:58:53Z

A standalone temporal CV would definitely fit here, yes.

One can take different approaches of accounting for this, the most common is probably clustering (with kmeans as the default approach).
Other approaches I know are predefined groups of temporal clusters (if the groups are clearly separated).

The latter is already doable, the kmeans clustering can be quickly adopted by spcv-coords.

In the end you want to ensure to decluster observations that are close in time because they show a high correlation among them naturally. This is the same issue as observations in space.

And yes, we can port over RollingWindowCV.

pfistfl · 2020-08-25T20:32:43Z

I think the thing we want here is not clustering, but instead basically splitting train / test such, that
max(train$date) < min(test$date), i.e. we always test how well our algorithms generate to future settings.
This means the train data grows in each fold, just as RollingWindowCV.

I think @mllg wanted this as well, have you already started something there?

pat-s · 2020-08-26T10:01:49Z

Ah ok, this is also an interesting approach!

In RollingWindowCV, if you specify folds, you discard some obs in some folds, is that correct? (judging from the example and test fold 1).

I am not sure if folds is a good name here since usually folds are characterized to be unique in the test sets - this is more a bootstrapping approach? Would iters be a better term in this case?

An argument supporting a percentage increase could be interesting?

have you already started something there?

Nope, nothing exists in this way, never had such a dataset yet.
But I'd say it would fit really good into this package.

pfistfl · 2020-08-26T11:56:34Z

Datasets:

The bikesharing dataset is an example of such a dataset, we use it in two gallery posts:
Post: Feature Engineering of Date-Time Variables mlr3gallery#13
I2mlr bikesharing mlr3gallery#64

A ressource on time-series cross-validation:
https://robjhyndman.com/hyndsight/tscv/
Here they call it fold, but i do not care a lot about the naming.

mllg · 2020-08-26T12:25:13Z

I believe this would also fit nicely in mlr3. Tasks already have column role "order" which can be used in something like "ResamplingOrderedCV" or "ResamplingOrderedHoldout".

pat-s · 2020-08-28T15:22:27Z

If we have already a dedicated package for spatial and temporal CV stuff, I'd argue it should live there, simply because users might look for it there?

pat-s · 2021-04-30T15:25:28Z

@pfistfl

Coming back to this after a while, I now have a different view on this:

I think it would be neat if we would have one dedicated package to spatiotemporal tasks and resampling methods and I think {mlr3spatiotempcv} would be a good fit. Also I think having different tasks classes is more confusing than it helps as spatial or temporal tasks share many properties. Thus, TaskRegrST and TaskClassifST already have temporal in their name.
I see {mlr3forecasting} more on the same level as {mlr3raster}, i.e. taking care of the prediction calls while leaving task and resampling to {mlr3spatiotempcv}

From a user point of view, task and resampling stuff could then be done with one extension package (i.e. {mlr3spatiotempcv}.
When it comes to prediction/measures/learners, {mlr3raster} (or maybe {mlr3spatial}) and {mlr3forecasting} would come into play.

Thoughts?

pat-s · 2021-05-18T08:26:36Z

Oliveira et al 2021 could be an interesting read.

pat-s · 2021-06-13T15:57:02Z

I think I would like to postpone the implementation after the paper has been submitted. Including it before would require to introduce and discuss a somewhat distinct field which I would like to avoid right now.

ck37 · 2022-11-14T13:35:47Z

I need this kind of method to use mlr3 for EHR-based machine learning - specifically the ability to define training/test/validation sets using date-based splits.

Is it possible for me to provide the splits to mlr3 and use the existing framework? I wasn't able to see how to do that in the documentation so far. It seems like I will need to use tidymodels otherwise.

pfistfl · 2022-11-14T13:49:53Z

Hey @ck37

If you are able to compute indices for yourself, you can do it already, see (https://mlr3.mlr-org.com/reference/mlr_resamplings_custom.html).

library(mlr3)
task = tsk("penguins")
task$filter(1:10)

# Instantiate Resampling
custom = rsmp("custom")
train_sets = list(1:5, 5:10)
test_sets = list(5:10, 1:5)
custom$instantiate(task, train_sets, test_sets)

custom$train_set(1)
custom$test_set(1)

ck37 · 2022-11-14T13:53:31Z

Ah ok, awesome - appreciate the help & fast response 🙏

issue-label-bot bot added the feature_request label Aug 25, 2020

pat-s removed the feature_request label Oct 20, 2020

pat-s added Priority: Medium Status: Accepted Type: Enhancement labels Oct 30, 2020

alexanderbrenning mentioned this issue May 6, 2021

2D+time versus 3D #127

Closed

pat-s mentioned this issue May 6, 2021

Expand Table listing all resampling methods #110

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Temporal CV #62

Temporal CV #62

pfistfl commented Aug 25, 2020

issue-label-bot bot commented Aug 25, 2020

pat-s commented Aug 25, 2020

pfistfl commented Aug 25, 2020 •

edited

Loading

pat-s commented Aug 26, 2020

pfistfl commented Aug 26, 2020

mllg commented Aug 26, 2020

pat-s commented Aug 28, 2020

pat-s commented Apr 30, 2021

pat-s commented May 18, 2021

pat-s commented Jun 13, 2021

ck37 commented Nov 14, 2022

pfistfl commented Nov 14, 2022

ck37 commented Nov 14, 2022

Temporal CV #62

Temporal CV #62

Comments

pfistfl commented Aug 25, 2020

issue-label-bot bot commented Aug 25, 2020

pat-s commented Aug 25, 2020

pfistfl commented Aug 25, 2020 • edited Loading

pat-s commented Aug 26, 2020

pfistfl commented Aug 26, 2020

mllg commented Aug 26, 2020

pat-s commented Aug 28, 2020

pat-s commented Apr 30, 2021

pat-s commented May 18, 2021

pat-s commented Jun 13, 2021

ck37 commented Nov 14, 2022

pfistfl commented Nov 14, 2022

ck37 commented Nov 14, 2022

pfistfl commented Aug 25, 2020 •

edited

Loading