Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot resume in offline mode due to lack of sys/id field #588

Open
wjaskowski opened this issue May 28, 2021 · 31 comments
Open

Cannot resume in offline mode due to lack of sys/id field #588

wjaskowski opened this issue May 28, 2021 · 31 comments

Comments

@wjaskowski
Copy link
Contributor

import neptune.new as neptune
run = neptune.init(mode='offline')
run.sync()
run.wait()
rid = run['sys/id'].fetch()
run = neptune.init(mode='offline', run=rid)
rid = run['sys/id'].fetch()

ends up with:

offline/1b7c5e70-695d-4d1c-8587-a5ca2e3d222c
Traceback (most recent call last):
  File "err4.py", line 5, in <module>
    run.sync()
  File "/home/wojciech/miniconda3/envs/nori/lib/python3.8/site-packages/neptune/new/run.py", line 453, in sync
    attributes = self._backend.get_attributes(self._uuid)
  File "/home/wojciech/miniconda3/envs/nori/lib/python3.8/site-packages/neptune/new/internal/backends/offline_neptune_backend.py", line 42, in get_attributes
    raise NeptuneOfflineModeFetchException
neptune.new.exceptions.NeptuneOfflineModeFetchException: 

----NeptuneOfflineModeFetchException---------------------------------------------------

It seems you are trying to fetch data from the server, while working in an offline mode.
You need to work in non-offline connection mode to fetch data from the server.

The thing is that I don't try to fetch data from the server but from the run, whenever it stores its data.

@Herudaio
Copy link
Contributor

Herudaio commented Jun 7, 2021

(I've removed my previous comment)

@wjaskowski initially we didn't plan to enable resuming runs in the offline mode. If I may ask why do you need to resume an offline run? Are you working with a multiprocessing / multi-script setup or is there a time break between the execution of the script and it's resume?

@wjaskowski
Copy link
Contributor Author

wjaskowski commented Jun 7, 2021 via email

@Diagrama3
Copy link

Diagrama3 commented Oct 4, 2022

Switching from spreadssheets to
Neptune.ai and How it Pushed...

@Diagrama3
Copy link

Switching from spreadssheets to Neptune.ai and How it Pushed...

@Blaizzy
Copy link
Contributor

Blaizzy commented Oct 4, 2022

Hi @Diagrama3

How can I help you?

@ljstrnadiii
Copy link

@Blaizzy I would also like to be able to resume an init_project in debug mode for testing purposes. Can this be achieved?

@Blaizzy
Copy link
Contributor

Blaizzy commented Dec 12, 2022

Hi @ljstrnadiii,

Thanks for reaching out.

Yes, it is.

Example:

import neptune.new as neptune
project = neptune.init_project(mode="debug")

Docs: https://docs.neptune.ai/api/neptune/#init_project

@ljstrnadiii
Copy link

@Blaizzy , I tried to stop and init_project again in a separate process, but the key was not present.

@Blaizzy
Copy link
Contributor

Blaizzy commented Dec 13, 2022

@ljstrnadiii by key you mean api_token, right?

If so, you can read more about setting your api_token here:
https://docs.neptune.ai/setup/setting_api_token/

@Blaizzy
Copy link
Contributor

Blaizzy commented Dec 17, 2022

Hey there!
Just checking in to see if you still need help with this or if you need help with anything else. Feel free to drop me a message. 😊

@ljstrnadiii
Copy link

@Blaizzy thanks for checking in. What I want to do is use debug mode in two separate processes:

# in one process
import neptune.new as neptune
project = neptune.init_project(mode="debug")
project['key1'] = 1
project.stop()

# then in another process (a test script)
import neptune.new as neptune
project = neptune.init_project(mode="debug")
assert project['key1'] == 1
project.stop()

but this is not possible from what I understand (even though it seems some files get written to tmp somewhere).

@Blaizzy
Copy link
Contributor

Blaizzy commented Dec 19, 2022

In debug mode, no data is stored or sent anywhere.
Docs: https://docs.neptune.ai/api/connection_modes/

For the use case you want to test, currently, you have to log metadata to Neptune servers in async or sync mode.

But I can definitely see your point and I'll submit your comment as a feature request to the product team.

@Blaizzy
Copy link
Contributor

Blaizzy commented Dec 21, 2022

Hey @ljstrnadiii!

Just checking in to see if you still need help with this or if you need help with anything else. Feel free to drop me a message. 😊

@ljstrnadiii
Copy link

@Blaizzy that is what I thought. We test in debug mode and use a neptune run in debug mode as a fixture where we can and that works well, but for some e2e tests, we can only pass a reference to a neptune run or project location. We have created a tests project in neptune for our e2e tests to keep things isolated a bit.

Thanks for the clarification!

@Blaizzy
Copy link
Contributor

Blaizzy commented Dec 27, 2022

It's my pleasure :)

You are most welcome @ljstrnadiii!

Your solution is quite interesting, and I would love to learn more about it if you don't mind. I think it could provide us with valuable insight that we can incorporate into the product.

Let me know what you think

@bg4xsd
Copy link

bg4xsd commented Jan 22, 2023

The function of resuming offline runs is very useful. Many guys are using commercial GPU servers to train their models, the GPU server often has the longest running time limit for a single run, for example, Kaggle's time limit is 12 hours, so we have to divide the training work into several parts. While using the offline model, the training speed will be faster and the offline mode is preferred. When the work is done, the offline training data will be uploaded to the Neptune server.

For my code
run = neptune.init_run( mode="offline", custom_run_id='test-offline', .... }

Neptune will generate several offline outputs to .neptune directory. I use the command:
neptune sync --path .neptune --project aaa/bbb --offline-only

It is executed ok, but only the last run is displayed on the website. It seems the last run overwrites the prior one.

@Blaizzy
Copy link
Contributor

Blaizzy commented Jan 23, 2023

Hi @bg4xsd

Thanks for reaching out and sharing your use case!

I have also passed it as feedback to the product team.

Regarding your code, I notice that you are using the custom_run_id argument in offline mode. Currently, offline runs have no sys/id; consequently, custom_run_id doesn't work.

Each time you run that script and then use the neptune sync CLI command, it will create a separate run.

But I can see your point; thanks to your feedback and others, we can now start thinking of a potential solution to this use case.

@bg4xsd
Copy link

bg4xsd commented Jan 23, 2023

Hi @Blaizzy ,
Thanks for your quick response.
For the students in University, in the lab, the GPU server always lacks, because training a neural network is time-consuming work, and the training process often is terminated by other students, so I think the function of resume offline run must be useful and popular, :-).
Further, you know that tensorboard's graph and table are ugly and low resolution, they can not be used in the thesis directly. Neptune's beautiful diagrams are welcome and its export function is very easy to use.
Many years before, I have to draw, compare and adjust the graph manually, and now, I am going to move from tensorboard to Neptune this year.
Come on and have a nice day.

@Blaizzy
Copy link
Contributor

Blaizzy commented Jan 23, 2023

Most welcome and thank you for your kind words!

I'm happy you enjoy using Neptune as much as we love making for you :)

@Blaizzy
Copy link
Contributor

Blaizzy commented Jan 23, 2023

I will let you know here once the feature is released.

Other than that, is there anything else I could help you with?

@bg4xsd
Copy link

bg4xsd commented Jan 23, 2023

Hi @Blaizzy

Hope to hear from you soon. By now, no more questions.

Anyway, thank you again.

@Blaizzy
Copy link
Contributor

Blaizzy commented Jan 23, 2023

Perfect, have a great week! :)

@wouterzwerink
Copy link

Hi @Blaizzy ! Is this feature still on the radar? We train on cloud instances that somewhat frequently get interrupted. This prevents us from using offline mode, as we can not resume the same run in offline mode.

@Blaizzy
Copy link
Contributor

Blaizzy commented Jun 2, 2023

Hi @wouterzwerink

This feature is on the radar. However, at the moment, we don't have an ETA for it.

Could you share the tracebacks for the times your training gets interrupted?

@Blaizzy
Copy link
Contributor

Blaizzy commented Jun 5, 2023

Hi @wouterzwerink ,

Do you still need help with this?

@bg4xsd
Copy link

bg4xsd commented Jun 6, 2023

The offline resume is useful for offline logging. Using online mode will decrease the long-time training speed. For using cloud GPU services, such as Kaggle, and Google's colab, the training procedure will be interrupted every 10~12 hours, so the offline resume function is meaningful.

@Blaizzy
Copy link
Contributor

Blaizzy commented Jun 6, 2023

@bg4xsd

I understand.

Could you share the tracebacks for the times your training gets interrupted?

@wouterzwerink
Copy link

@Blaizzy I seem to have missed your question, sorry!
The training interruptions are not due to neptune at all!
The interruptions are from using spot instances. We train with fault tolerance, so the training continues after the interruption. However, to keep neptune fault tolerant, we have to use async mode instead of offline mode.
So I don't need help with this, but thanks for asking! Looking forward to this feature once it is complete

@Blaizzy
Copy link
Contributor

Blaizzy commented Jul 11, 2023

@wouterzwerink great to hear!

If anything pops up feel free to let me know. I'll be happy to help :)

@pprobst
Copy link

pprobst commented Feb 15, 2024

I am interested in this feature. It'd be very useful for multi-script programs.

@wouterzwerink
Copy link

Since its been a while, I'll add that I'm still very interested in this feature

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests