Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add machine readable download progress option #12084

Closed

Conversation

joeyballentine
Copy link

Closes #11508 and obsoletes #11172

TLDR: If you call pip in a subprocess, it is impossible to read progress output because rich disables progress bar outputs in subprocesses by default. I had previously added an option to force the progress bar, but it was suggested to add a machine-readable option instead. This PR adds a new progress_bar option that just prints the download progress to stdout and bypasses rich entirely, allowing a program that controls pip to read stdout and parse download progress.

I'm open to suggestions on how this should be formatted. Here is how it currently looks:

image

@joeyballentine
Copy link
Author

I don't think that windows CI failing is my fault. Not sure what's going on there

@RunDevelopment
Copy link

I think using a common data format for the progress data would be good. Right now, devs would have to implement a custom parser. This is both tedious and error-prone. Using a common data format also means that we can more easily change the output in the future, because we don't have to worry about breaking any custom parsers.

I would propose using JSON. Example:

PROGRESS:{"file":"name.whl","current":1234,"total":87654321}

This would allow devs to simply look for the "PROGRESS:" prefix and then use any old JSON parser to get the data. We would have to guarantee that our JSON doesn't span multiple lines (making it closer to JSONL), but I think that's okay.

I also think that the --progress-bar option for this should be json instead of the generic machine-readable.


Another idea would be to support arbitrary format strings, similar to format strings supported by git. Users could say --progress-bar=format:<fmt-str>.

While flexible, the issue with this approach is documentation. We would have to document all features and placeholders of the format syntax.

@pfmoore
Copy link
Member

pfmoore commented Jun 13, 2023

Another idea would be to support arbitrary format strings

Let's not over-engineer this. There's been (as far as I know) only one person (the OP here) who has ever requested this. It doesn't warrant a complex solution with all of the associated maintenance overheads.

As I said here I still sort of feel this is something that we should look to rich to support - at the very least, it's worth asking them if it's something they would consider.

@joeyballentine
Copy link
Author

I've opened a discussion in the rich repo here: Textualize/rich#3000

I know I'm the only person in the world that wants this, but this is currently blocking me again and it's really frustrating to be unable to do something that on the surface seems so simple

@RunDevelopment
Copy link

It doesn't warrant a complex solution with all of the associated maintenance overheads.

True.

I still sort of feel this is something that we should look to rich to support

I agree and disagree.

I agree in the sense that rich should have an option to force terminal mode. In fact, it does have that option, but it's only accessible to library users (here pip) but not to end users (there is no env variable we could set). See #11172 for that.

However, even if rich supported this (via some mechanism), it wouldn't be with the explicit purpose of being machine-readable. AFAIK, rich doesn't guarantee a specific, strictly-defined text format for its progress output. So rich may choose to change the progress format at some point and thereby break our code. This is far from ideal.

Frankly, I don't think it's rich's job to provide an API for progress.

@joeyballentine
Copy link
Author

joeyballentine commented Jun 13, 2023

Response on the rich discussion:

I'd certainly consider it. If feels like there should be hooks for this, and it should be left up to the developer to decide on the format.

Sounds like regardless, pip will have to do some implementation itself

@joeyballentine
Copy link
Author

updated json version:

image

@joeyballentine
Copy link
Author

Alternatively if you guys don't want to merge this, is there a recommended way I could import pip's internals to either manually invoke the dependency resolver/downloader and/or hook into it to do this same kind of thing just for myself?

I know you don't recommend people use pip's internals and would rather people invoke it via CLI in a subprocess but that's just out of the question for me without this change.

@pfmoore
Copy link
Member

pfmoore commented Jun 13, 2023

No, there isn't any sort of recommended way or using pip's internals. You're basically on your own - you can do it, but it's not supported and you'll have to work out how on your own.

To be clear, I'm not against this - I'm mostly just rather indifferent to it. The code looks clean, and simple enough. It needs tests, and it needs documentation that describes exactly what the output format is (as it'll be a commitment we'll have to support from now on). It also probably shouldn't use print() but should integrate with our logging machinery somehow. Or maybe it shouldn't? I honestly don't know. But I don't want to suggest that if you add those, I'll be any more in favour - it's not that which makes the difference for me, it's simply that it really doesn't seem that useful to be able to read the progress like this. It's still only download progress, there's nothing equivalent to track build progress or time to resolve, etc - so I don't personally see that it's much more useful than a "still working..." spinner. But that's just me, I guess.

@joeyballentine
Copy link
Author

joeyballentine commented Jun 13, 2023

This really just comes down to large downloads seeming to do nothing. I don't want my users with slow internet connection seeing nothing happen for 5 hours while pytorch's 3GB wheel downloads, then giving up because to them it's just frozen. Even with fast internet it still takes quite some time, and there's no way with just running pip in a subprocess to easily track this.

You'd probably cringe if you saw the disaster that is my current workaround for this. I basically start a pip install, regex match the wheel or URL pip logs out that it's downloading, cancel the pip install, then use the pypi API to grab a list of all the wheels, then i find the one that matches the one pip said it was downloading, then i download that myself (so i can track progress) then i tell pip to install that wheel. Separately i also attempted specifying my own download directory and using file watchers to track download progress, but that was less reliable.

Right now I'm rewriting my code so that the backend handles the pip installs itself (because as-is the frontend can't invoke pip installs on remote backends) and now I'm either tasked with reimplementing that same workaround in python or doing something better (which I'd much rather do).

Honestly it's starting to look like the most painless solution is just for me to include a build of my forked pip with my program and just use that instead of trying to integrate this officially.

I'm willing to do the documentation and whatnot required to get this PR in, but I'm just getting the feeling that it isn't worth it, especially since this will probably take time to be released anyway and I'm trying to do this rewrite right now.

@RunDevelopment
Copy link

so I don't personally see that it's much more useful than a "still working..." spinner.

Joey isn't exaggerating. In my hometown, downloading PyTorch takes 2 hours. Simply showing "Still downloading" with no progress indication and not even the total download size is not viable.

Copy link
Member

@pfmoore pfmoore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, as noted in the discussion, this needs tests and documentation.

For documentation, I mean a section in the user guide explaining the intended use of this feature, the format of the output data, and how to detect it. It also needs to discuss the fact that (apart from the output this feature adds) pip doesn't typically guarantee that its output is stable, and so any code parsing pip's output is inherently tied to particular pip versions.

src/pip/_internal/cli/progress_bars.py Outdated Show resolved Hide resolved
src/pip/_internal/cli/progress_bars.py Outdated Show resolved Hide resolved
src/pip/_internal/cli/progress_bars.py Outdated Show resolved Hide resolved
src/pip/_internal/cli/progress_bars.py Outdated Show resolved Hide resolved
src/pip/_internal/cli/progress_bars.py Outdated Show resolved Hide resolved
src/pip/_internal/network/download.py Outdated Show resolved Hide resolved
@pfmoore
Copy link
Member

pfmoore commented Jun 13, 2023

Thanks - that sucks, it's easy for me with a fast connection to forget not everyone has that advantage.

You've persuaded me. As long as none of the other pip committers objects, we get the docs and tests sorted, I'm willing to support this PR. I've added a review.

src/pip/_internal/cli/progress_bars.py Outdated Show resolved Hide resolved
docs/html/user_guide.rst Show resolved Hide resolved
docs/html/user_guide.rst Show resolved Hide resolved
src/pip/_internal/cli/progress_bars.py Outdated Show resolved Hide resolved
@pfmoore
Copy link
Member

pfmoore commented Jun 14, 2023

Gonna be honest I'm not sure what to do for the tests here. It seems like to get it to actually do anything progress bar related I have to download a real package, which probably isn't ideal. Also, when it works, it heavily spams the test output. Also, I've realized I can't properly test the error case since the tests run pip in a subprocess.

I think you'd be better trying to unit-test this, with monkeypatching to control whether the code detects the stdout as a console or not. I'm definitely not comfortable with adding a download of a large external package to the test suite, the tests are already way too slow, and deliberately adding a test which is deliberately slow just to trigger a feature that's in itself fast isn't really acceptable.

@joeyballentine
Copy link
Author

joeyballentine commented Jun 15, 2023

In regards to the rate limiting: While looking at the spinners.py file I noticed that it already implements rate limiting via a nice little helper class, so I decided to just reuse that here since it is already the way pip is intended to function. The default value for timing here is also reused.

As for using sys.stdout.write + sys.stdout.flush, it just doesn't work. What does work though, is using print with flush=True. I can't personally explain why, but sys.stdout.write seems to dislike subprocesses the same way rich does. Using the -u flag makes it work. Sorry I didn't see that part until after I pushed this.

I also decided to change the format slightly. instead of being all caps, progress now gets displayed with Progress: {... to more closely match pip's other output.

As for unit tests, I still have to work on those.

@joeyballentine
Copy link
Author

I'm not sure I'm going to be able to get tests for this working. I got the test suite running locally and messed around with one a bunch that was supposed to capture the error state (warning the user that json mode is not for TTYs) but it appeared to only ever log the "Downloading" text part. Based on some logging I did, I'm pretty sure the reason is that it's triggering the machine readable progress code (isatty() returns False so that error isn't being hit) and it's not capturing it either because we write to stdout directly instead of using the logger, or because it's not doing a real download.

I'd appreciate any suggestions, as I'm not really sure what to do about this.

Copy link
Member

@pfmoore pfmoore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As regards tests, I'm honestly not sure. Yes, this will be hard to test, which is unfortunate, but that's true of a lot of pip features. I'd start with a unit test of pip._internal.network.download._prepare_download, that passed a dummy Response object that behaved like a big, slow, download, and then iterate over the returned chunks, capturing the data pip writes to stdout and confirming that it contains the data you expect. You can do this both for dummy responses that have a length, and ones that don't. For the "is this a TTY" check, unit test get_download_progress_renderer and monkeypatch sys.stdout.isatty to give the result you want. Then check you get the expected return value.

Once you have those two, you should have a good enough feel for how to test this that you can add whatever else feels necessary (if anything).

docs/html/user_guide.rst Outdated Show resolved Hide resolved
docs/html/user_guide.rst Outdated Show resolved Hide resolved
src/pip/_internal/cli/progress_bars.py Outdated Show resolved Hide resolved
@joeyballentine
Copy link
Author

Sorry for the delay. In the time since the last reply I actually worked on my planned implementation using a build of my pip fork. It actually didn't even need the -u flag to work as expected, so I'm not sure what the discrepancy between that and my previous tests was.

Anyway, I'm working on the suggested changes now. The tests still might take me a while to get right, but I'll give them another go.

@arenasys
Copy link
Contributor

arenasys commented Mar 19, 2024

This would be useful to me also.

Limiting to non-tty is an unnecessary complication. I cant think of a situation where the caller explicitly asks for --progress-bar=json and is not wanting JSON progress outputs.

@joeyballentine
Copy link
Author

@arenasys Yeah, I agree with that.

Unfortunately, I've given up on this PR. I just ended up bundling my pip fork with these changes with my project and have been happily using this feature myself for the past year.

After some time, I now think this (machine parseable progress) is something that should really be done thoughout all of pip and not just for the download progress bar. I don't want to personally do that though as this codebase is massive and confusing. And, I couldn't figure out how to write the tests for just this feature, so I don't think i'd be able to do it for the whole codebase.

@arenasys
Copy link
Contributor

Progress reports are the one thing completely missing from pip's output (through pipes), so its very user unfriendly for cases like GUIs using pip and wanting to show the status to the user. As for testing, I would think it should have the same type of testing as the current spinners and progress bars (is there any?). I'll open another PR.

@joeyballentine
Copy link
Author

joeyballentine commented Mar 19, 2024

I don't think there are any tests for that as the progress bars are all handled by "rich" (the progress bar package)

And for what I was saying about the machine readable progress everywhere, I mainly just meant that I dislike the inconsistency. In my program, I had to have multiple different ways of parsing messages depending on if it was a standard pip message or a download json progress output. And pip just doesn't have install progress at all either (like it just says "installing packages..." and that's it)

I agree this is probably the most important thing though, since it straight up doesn't output anything otherwise. I just think as a general feature I'd prefer an entire overhaul of how pip logs to make parsing overall way simpler and more useful

@joeyballentine
Copy link
Author

Closing as #12586 implements this better.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 11, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request] Machine-readable progress outputs
5 participants