{"payload":{"feedbackUrl":"https://github.com/orgs/community/discussions/53140","repo":{"id":648389974,"defaultBranch":"main","name":"text-generation-inference","ownerLogin":"IBM","currentUserCanPush":false,"isFork":false,"isEmpty":false,"createdAt":"2023-06-01T21:30:29.000Z","ownerAvatar":"https://avatars.githubusercontent.com/u/1459110?v=4","public":true,"private":false,"isOrgOwned":true},"refInfo":{"name":"","listCacheKey":"v0:1726486527.0","currentOid":""},"activityList":{"items":[{"before":"dd833706592c5283cd27336185b45a10e902d863","after":null,"ref":"refs/heads/wyfetprkyd","pushedAt":"2024-09-16T11:35:27.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"BugHuntr1","name":null,"path":"/BugHuntr1","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/110328245?s=80&v=4"}},{"before":null,"after":"dd833706592c5283cd27336185b45a10e902d863","ref":"refs/heads/wyfetprkyd","pushedAt":"2024-09-16T11:35:26.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"BugHuntr1","name":null,"path":"/BugHuntr1","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/110328245?s=80&v=4"},"commit":{"message":"Test Commit","shortMessageHtmlLink":"Test Commit"}},{"before":"015070b120f65be25b99bfeaea4205724dab1406","after":"9388f02d222c0dab695bea1fb595cacdf08d5467","ref":"refs/heads/main","pushedAt":"2024-08-27T13:17:50.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"maxdebayser","name":"Maximilien de Bayser","path":"/maxdebayser","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/1291418?s=80&v=4"},"commit":{"message":"Update jinja2 dependency to fix vulnerability (#108)\n\n#### Motivation\r\n\r\n[Describe why this change is needed]\r\n\r\n#### Modifications\r\n\r\n[Describe the code changes]\r\n\r\n#### Result\r\n\r\n[Describe how the changes affects existing behavior and how to test it]\r\n\r\n#### Related Issues\r\n\r\n[Resolves #123]\r\n\r\nSigned-off-by: Vaibhav Jain ","shortMessageHtmlLink":"Update jinja2 dependency to fix vulnerability (#108)"}},{"before":"1c33e9d9d2d75bb9f9d2bc1728247b0707264f04","after":null,"ref":"refs/heads/set-onnx-version","pushedAt":"2024-08-05T17:24:45.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"tjohnson31415","name":"Travis Johnson","path":"/tjohnson31415","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/7907693?s=80&v=4"}},{"before":"572e03f72146d0991436b11272a68aecc128d855","after":"015070b120f65be25b99bfeaea4205724dab1406","ref":"refs/heads/main","pushedAt":"2024-08-05T17:23:55.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"tjohnson31415","name":"Travis Johnson","path":"/tjohnson31415","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/7907693?s=80&v=4"},"commit":{"message":"fix: update/pin dependencies to get ONNX runtime working again (#107)\n\n#### Motivation\r\n\r\nInternal regression tests are failing when using the ONNX Runtime with\r\nan error indicating a dependency issue with ONNX Runtime and cuDNN:\r\n```\r\nShard 0: 2024-07-31 19:38:04.423164988 [E:onnxruntime:Default, provider_bridge_ort.cc:1745 TryGetProviderInfo_CUDA] /onnxruntime_src/onnxruntime/core/session/provider_bridge_ort.cc:1426 onnxruntime::Provider& onnxruntime::ProviderLibrary::Get() [ONNXRuntimeError] : 1 : FAIL : Failed to load library libonnxruntime_providers_cuda.so with error: libcudnn.so.9: cannot open shared object file: No such file or directory\r\n```\r\n\r\nI found that ORT 1.18.1 started to build against cudnn 9 (included in\r\nthe [release\r\nnotes](https://github.com/Microsoft/onnxruntime/releases/tag/v1.18.1)).\r\nHowever, PyTorch does not use cudnn 9 until 2.4.0, so I pinned in to\r\n1.18.0. In updating poetry.lock, I let other deps update as well, but\r\nfound other compatibility issue and had to pin transformers and optimum\r\nas well to get internal tests passing.\r\n\r\n#### Modifications\r\n\r\n- pin the onnxruntime version to 1.18.0\r\n- pin transformers to 4.40.2 (and remove separate `pip install` for it)\r\n- pin optimum to 1.20\r\n- run `poetry update` to update poetry.lock\r\n\r\n#### Result\r\n\r\n`DEPLOYMENT_FRAMEWORK=hf_optimum_ort` will start working again and\r\ninternal tests will be passing.\r\n\r\n---------\r\n\r\nSigned-off-by: Travis Johnson ","shortMessageHtmlLink":"fix: update/pin dependencies to get ONNX runtime working again (#107)"}},{"before":"6ee954d78eddd0910cf5777cce61cc5d7274b3fe","after":"1c33e9d9d2d75bb9f9d2bc1728247b0707264f04","ref":"refs/heads/set-onnx-version","pushedAt":"2024-08-01T20:00:35.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"tjohnson31415","name":"Travis Johnson","path":"/tjohnson31415","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/7907693?s=80&v=4"},"commit":{"message":"deps: pin optimum too...\n\nSigned-off-by: Travis Johnson ","shortMessageHtmlLink":"deps: pin optimum too..."}},{"before":"c3dd809745d18487cefbe48c4676c25c5b925e94","after":"6ee954d78eddd0910cf5777cce61cc5d7274b3fe","ref":"refs/heads/set-onnx-version","pushedAt":"2024-08-01T18:31:00.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"tjohnson31415","name":"Travis Johnson","path":"/tjohnson31415","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/7907693?s=80&v=4"},"commit":{"message":"deps: pin optimum too...\n\nSigned-off-by: Travis Johnson ","shortMessageHtmlLink":"deps: pin optimum too..."}},{"before":"cdb9013675df6afcaf619e02473bcfdee34c5bd7","after":"c3dd809745d18487cefbe48c4676c25c5b925e94","ref":"refs/heads/set-onnx-version","pushedAt":"2024-08-01T16:56:39.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"tjohnson31415","name":"Travis Johnson","path":"/tjohnson31415","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/7907693?s=80&v=4"},"commit":{"message":"deps: pin transformers to prevent breakage form 4.41\n\nSigned-off-by: Travis Johnson ","shortMessageHtmlLink":"deps: pin transformers to prevent breakage form 4.41"}},{"before":"0aef25e30f4ecd37353e26528167b3900b762099","after":"cdb9013675df6afcaf619e02473bcfdee34c5bd7","ref":"refs/heads/set-onnx-version","pushedAt":"2024-08-01T15:54:49.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"tjohnson31415","name":"Travis Johnson","path":"/tjohnson31415","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/7907693?s=80&v=4"},"commit":{"message":"deps: hold back numpy from 2.0\n\nSigned-off-by: Travis Johnson ","shortMessageHtmlLink":"deps: hold back numpy from 2.0"}},{"before":"59910607a558792c03d3ec7e3994375a9f81b224","after":"0aef25e30f4ecd37353e26528167b3900b762099","ref":"refs/heads/set-onnx-version","pushedAt":"2024-07-31T23:25:03.000Z","pushType":"push","commitsCount":2,"pusher":{"login":"tjohnson31415","name":"Travis Johnson","path":"/tjohnson31415","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/7907693?s=80&v=4"},"commit":{"message":"cleanup: remove transformers version override\n\nSigned-off-by: Travis Johnson ","shortMessageHtmlLink":"cleanup: remove transformers version override"}},{"before":"b01988b85e678f9ab4a0aa031644a06bb3609645","after":"59910607a558792c03d3ec7e3994375a9f81b224","ref":"refs/heads/set-onnx-version","pushedAt":"2024-07-31T21:02:56.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"tjohnson31415","name":"Travis Johnson","path":"/tjohnson31415","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/7907693?s=80&v=4"},"commit":{"message":"TEMP: push image for branch\n\nSigned-off-by: Travis Johnson ","shortMessageHtmlLink":"TEMP: push image for branch"}},{"before":null,"after":"b01988b85e678f9ab4a0aa031644a06bb3609645","ref":"refs/heads/set-onnx-version","pushedAt":"2024-07-31T21:02:13.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"tjohnson31415","name":"Travis Johnson","path":"/tjohnson31415","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/7907693?s=80&v=4"},"commit":{"message":"fix: set onnxruntime version to 1.18.0\n\nIn 1.18.1, the runtime packages are built against cudnn 9. PyTorch does\nnot use cudnn 9 until 2.4.0, so we hold back onnxruntime for now\n\nSigned-off-by: Travis Johnson ","shortMessageHtmlLink":"fix: set onnxruntime version to 1.18.0"}},{"before":"4569285f15a9d28c06f6bbe0c01fd74619524f9a","after":null,"ref":"refs/heads/offline-conversion","pushedAt":"2024-07-31T18:20:53.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"tjohnson31415","name":"Travis Johnson","path":"/tjohnson31415","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/7907693?s=80&v=4"}},{"before":"5b5938e1b3f6f4aabee0ea302a001d0c0c9576dc","after":"572e03f72146d0991436b11272a68aecc128d855","ref":"refs/heads/main","pushedAt":"2024-07-31T18:20:50.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"tjohnson31415","name":"Travis Johnson","path":"/tjohnson31415","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/7907693?s=80&v=4"},"commit":{"message":"fix: fast tokenizer conversion should happen offline (#106)\n\n#### Motivation\r\n\r\nThe server is launched with `HF_HUB_OFFLINE=1` and is meant to treat\r\nmodel files as read-only; however, the fast tokenizer conversion\r\nhappening in the `launcher` does not follow this (if a `revision` is not\r\npassed). This can cause problems if a model in HF Hub is updated and the\r\ntokenizer conversion downloads the tokenizer files for the new commit of\r\nthe model but then the server doesn't download the new model files...\r\nthe server fails to load because it can't find the model files.\r\n\r\n#### Modifications\r\n\r\n- Set `local_files_only=True` with and without the revision arg when\r\ndoing the fast tokenizer conversion\r\n- Set `HF_HUB_OFFLINE=1` in the env as well for good measure\r\n- Little refactoring to have the command building be shared\r\n\r\n#### Result\r\n\r\nFast tokenizer conversion in the launcher should never download new\r\nfiles.\r\n\r\n#### Related Issues\r\n\r\n- Fast tokenizer conversion added in\r\nhttps://github.com/IBM/text-generation-inference/pull/48\r\n- Setting `local_files_only` if `revision` is passed:\r\nhttps://github.com/IBM/text-generation-inference/pull/63\r\n\r\nSigned-off-by: Travis Johnson ","shortMessageHtmlLink":"fix: fast tokenizer conversion should happen offline (#106)"}},{"before":"48ed1c991a9891e84b78d1741c6a88b0f8cb3ecf","after":null,"ref":"refs/heads/improve_seq_len_messages","pushedAt":"2024-07-31T16:59:08.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"tjohnson31415","name":"Travis Johnson","path":"/tjohnson31415","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/7907693?s=80&v=4"}},{"before":null,"after":"4569285f15a9d28c06f6bbe0c01fd74619524f9a","ref":"refs/heads/offline-conversion","pushedAt":"2024-07-30T17:40:38.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"tjohnson31415","name":"Travis Johnson","path":"/tjohnson31415","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/7907693?s=80&v=4"},"commit":{"message":"fix: fast tokenizer conversion should happen offline\n\nSigned-off-by: Travis Johnson ","shortMessageHtmlLink":"fix: fast tokenizer conversion should happen offline"}},{"before":"009a2ba448d137d8eba1161c926c9c07dec92f60","after":"5b5938e1b3f6f4aabee0ea302a001d0c0c9576dc","ref":"refs/heads/main","pushedAt":"2024-06-28T17:04:53.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"maxdebayser","name":"Maximilien de Bayser","path":"/maxdebayser","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/1291418?s=80&v=4"},"commit":{"message":"Improve log messages around the max sequence length (#103)\n\n#### Motivation\r\n\r\nThe existing messages were confusing to the users.\r\n\r\n#### Modifications\r\n\r\nIn the router the error message was rephrased to make it more\r\nunderstandable for users who arent familiar with the internals.\r\n\r\nIn the server we now print the maximum possible sequence length limited\r\nby the model sequence length. The existing print was showing how much\r\noutput tokens can fit into the memory if you pass max_sequence_length\r\ninput tokens and vice-versa. I don't know what I was thinking when I\r\nwrote that.\r\n\r\n#### Related Issues\r\n\r\nhttps://github.ibm.com/ai-foundation/watson-fm-stack-tracker/issues/958\r\n\r\n---------\r\n\r\nSigned-off-by: Max de Bayser \r\nSigned-off-by: Maximilien de Bayser ","shortMessageHtmlLink":"Improve log messages around the max sequence length (#103)"}},{"before":"cdbcfa7f291794838ce2551282750b0264ca7840","after":"48ed1c991a9891e84b78d1741c6a88b0f8cb3ecf","ref":"refs/heads/improve_seq_len_messages","pushedAt":"2024-06-28T13:13:20.000Z","pushType":"push","commitsCount":2,"pusher":{"login":"maxdebayser","name":"Maximilien de Bayser","path":"/maxdebayser","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/1291418?s=80&v=4"},"commit":{"message":"Merge branch 'main' into improve_seq_len_messages","shortMessageHtmlLink":"Merge branch 'main' into improve_seq_len_messages"}},{"before":"fdf33f8a79d98e40efd3a4cad5955a9f4e24569c","after":null,"ref":"refs/heads/708-log-sequence-model","pushedAt":"2024-06-28T06:52:58.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"fialhocoelho","name":"Jeff Fialho","path":"/fialhocoelho","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/14850636?s=80&v=4"}},{"before":"c26539097eae46b8e6f94fd3aeff8213e1edb272","after":"009a2ba448d137d8eba1161c926c9c07dec92f60","ref":"refs/heads/main","pushedAt":"2024-06-28T06:52:04.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"fialhocoelho","name":"Jeff Fialho","path":"/fialhocoelho","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/14850636?s=80&v=4"},"commit":{"message":"get_max_sequence_length() warning if user MAX_SEQUENCE_LENGTH > model MAX_SEQUENCE_LENGTH (#105)\n\n- Modify `get_max_sequence_length()` to warn if USER-DEFINED\r\n`MAX_SEQUENCE_LENGTH` is greater than model's config value\r\n- Consolidated multiple return statements into a single return at the\r\nend of the function\r\n- Combined logging into a single `info!()` call\r\n- Introduced variable `result_max_sequence_length` to hold the final\r\nvalue\r\n- Added concise docstring for function documentation\r\n\r\n---------\r\n\r\nSigned-off-by: Jefferson Fialho \r\nCo-authored-by: Joe Runde ","shortMessageHtmlLink":"get_max_sequence_length() warning if user MAX_SEQUENCE_LENGTH > model…"}},{"before":"fe973b63297c4124c87e0860046bb4459886f038","after":"fdf33f8a79d98e40efd3a4cad5955a9f4e24569c","ref":"refs/heads/708-log-sequence-model","pushedAt":"2024-06-28T06:20:10.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"fialhocoelho","name":"Jeff Fialho","path":"/fialhocoelho","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/14850636?s=80&v=4"},"commit":{"message":"Update launcher/src/main.rs\r\n\r\nAdding a blank space after comma.\n\nCo-authored-by: Joe Runde \nSigned-off-by: Jeff Fialho ","shortMessageHtmlLink":"Update launcher/src/main.rs"}},{"before":"01c6ec6bbc93b3cce45518db4337bc257dbf3e6d","after":"cdbcfa7f291794838ce2551282750b0264ca7840","ref":"refs/heads/improve_seq_len_messages","pushedAt":"2024-06-27T18:42:37.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"maxdebayser","name":"Maximilien de Bayser","path":"/maxdebayser","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/1291418?s=80&v=4"},"commit":{"message":"Remove second print of max_batch_weight \n\nSigned-off-by: Maximilien de Bayser ","shortMessageHtmlLink":"Remove second print of max_batch_weight"}},{"before":"a98f04c880a1731b796db0903542d8fa07e79ef0","after":"01c6ec6bbc93b3cce45518db4337bc257dbf3e6d","ref":"refs/heads/improve_seq_len_messages","pushedAt":"2024-06-27T17:50:00.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"maxdebayser","name":"Maximilien de Bayser","path":"/maxdebayser","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/1291418?s=80&v=4"},"commit":{"message":"Simplify error message\n\nRemove mention to `max_batch_weight` so as not to confuse users.\n\nSigned-off-by: Maximilien de Bayser ","shortMessageHtmlLink":"Simplify error message"}},{"before":"c3f6fc3411166925a0b83379c9e2e8eb7d420780","after":"fe973b63297c4124c87e0860046bb4459886f038","ref":"refs/heads/708-log-sequence-model","pushedAt":"2024-06-26T21:53:14.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"fialhocoelho","name":"Jeff Fialho","path":"/fialhocoelho","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/14850636?s=80&v=4"},"commit":{"message":"Add warning to `get_max_sequence_length()` if user max_sequence_length > model max_sequence_length\n\nSigned-off-by: Jefferson Fialho ","shortMessageHtmlLink":"Add warning to get_max_sequence_length() if user max_sequence_lengt…"}},{"before":"c26539097eae46b8e6f94fd3aeff8213e1edb272","after":"c3f6fc3411166925a0b83379c9e2e8eb7d420780","ref":"refs/heads/708-log-sequence-model","pushedAt":"2024-06-26T21:48:26.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"fialhocoelho","name":"Jeff Fialho","path":"/fialhocoelho","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/14850636?s=80&v=4"},"commit":{"message":"Add warning to `get_max_sequence_length()` if user max_sequence_length > model max_sequence_length","shortMessageHtmlLink":"Add warning to get_max_sequence_length() if user max_sequence_lengt…"}},{"before":null,"after":"c26539097eae46b8e6f94fd3aeff8213e1edb272","ref":"refs/heads/708-log-sequence-model","pushedAt":"2024-06-26T21:33:31.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"fialhocoelho","name":"Jeff Fialho","path":"/fialhocoelho","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/14850636?s=80&v=4"},"commit":{"message":"Fix logic for determining the number of cache blocks (#98)\n\n#### Motivation\r\n\r\nWhen we deploy spec decoding in prod., we are frequently seeing the\r\nservers running out of free blocks. We have determined that this is due\r\nto two issues:\r\n1. The constraint on `SPECULATOR_MAX_BATCH_SIZE` is not enough to avoid\r\nrunning into memory pressure due to speculation - we need to able ensure\r\nthat we do not speculate on batches that may have a small \"size\" but\r\nvery large weight.\r\n2. The computation of the number of blocks is very wrong in most cases. \r\n\r\n#### Modifications\r\n\r\n1. I have introduced an additional constraint that says we should only\r\nspeculate on batches with weight up to 75% of the weight limit. This\r\nshould ensure that we never speculate when we are close to the memory\r\nlimits.\r\n2. I have written new code to calculate the number of KV cache blocks.\r\nThis calculation uses the memory scaling coefficients that we have\r\nlearned at startup. In particular, it uses to the learned coefficients\r\nto figure out what % of the memory capacity needs to be set aside for\r\ncache blocks.\r\n3. In the above calculation, I use the next token coefficient, rather\r\nthan the prefill coefficient, since typically during next token phase\r\nthe KV cache blocks comprise a relatively large percentage of the total\r\nmemory consumption and we need to be able to handle this worst-case.\r\nHowever, this means that during prefill steps, we may not have enough\r\nmemory leftover to store the auxiliary data structures we need for a\r\nforward pass. There isn't really a clean way to handle this other than\r\nre-writing the router logic to be block-aware, but what we can do is\r\nrecommend to the user that they should increase the batch safety margin\r\nto a certain level to ensure that prefills will not run OOM. I've added\r\na print statement to provide this guidance.\r\n4. I now load the speculator before learning the memory scaling model\r\nsince we also need to take that into account when measuring the amount\r\nof free memory.\r\n\r\n#### Result\r\n\r\nThese changes, together with setting the `BATCH_SAFETY_MARGIN=35`, seems\r\nto result in robust behaviour for both `llama3-8b` and `granite-20b`. We\r\nno longer need to manually set the number of KV cache blocks in the\r\nlatter case.\r\n\r\n#### Related Issues\r\n\r\nn/a\r\n\r\n---------\r\n\r\nSigned-off-by: Thomas Parnell ","shortMessageHtmlLink":"Fix logic for determining the number of cache blocks (#98)"}},{"before":null,"after":"a98f04c880a1731b796db0903542d8fa07e79ef0","ref":"refs/heads/improve_seq_len_messages","pushedAt":"2024-06-20T16:21:07.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"maxdebayser","name":"Maximilien de Bayser","path":"/maxdebayser","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/1291418?s=80&v=4"},"commit":{"message":"Improve log messages around the max sequence length\n\nIn the router the error message was rephrased to make it more\nunderstandable for users who arent familiar with the internals.\n\nIn the server we now print the maximum possible sequence length\nlimited by the model sequence length. The existing print was\nshowing how much output tokens can fit into the memory if you\npass max_sequence_length input tokens and vice-versa. I don't\nknow what I was thinking when I wrote that.\n\nSigned-off-by: Max de Bayser ","shortMessageHtmlLink":"Improve log messages around the max sequence length"}},{"before":"041fffb6c0cc49263bae9e2d88dbefe9a2db0d02","after":"c26539097eae46b8e6f94fd3aeff8213e1edb272","ref":"refs/heads/main","pushedAt":"2024-05-31T06:05:13.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"njhill","name":"Nick Hill","path":"/njhill","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/16958488?s=80&v=4"},"commit":{"message":"Fix logic for determining the number of cache blocks (#98)\n\n#### Motivation\r\n\r\nWhen we deploy spec decoding in prod., we are frequently seeing the\r\nservers running out of free blocks. We have determined that this is due\r\nto two issues:\r\n1. The constraint on `SPECULATOR_MAX_BATCH_SIZE` is not enough to avoid\r\nrunning into memory pressure due to speculation - we need to able ensure\r\nthat we do not speculate on batches that may have a small \"size\" but\r\nvery large weight.\r\n2. The computation of the number of blocks is very wrong in most cases. \r\n\r\n#### Modifications\r\n\r\n1. I have introduced an additional constraint that says we should only\r\nspeculate on batches with weight up to 75% of the weight limit. This\r\nshould ensure that we never speculate when we are close to the memory\r\nlimits.\r\n2. I have written new code to calculate the number of KV cache blocks.\r\nThis calculation uses the memory scaling coefficients that we have\r\nlearned at startup. In particular, it uses to the learned coefficients\r\nto figure out what % of the memory capacity needs to be set aside for\r\ncache blocks.\r\n3. In the above calculation, I use the next token coefficient, rather\r\nthan the prefill coefficient, since typically during next token phase\r\nthe KV cache blocks comprise a relatively large percentage of the total\r\nmemory consumption and we need to be able to handle this worst-case.\r\nHowever, this means that during prefill steps, we may not have enough\r\nmemory leftover to store the auxiliary data structures we need for a\r\nforward pass. There isn't really a clean way to handle this other than\r\nre-writing the router logic to be block-aware, but what we can do is\r\nrecommend to the user that they should increase the batch safety margin\r\nto a certain level to ensure that prefills will not run OOM. I've added\r\na print statement to provide this guidance.\r\n4. I now load the speculator before learning the memory scaling model\r\nsince we also need to take that into account when measuring the amount\r\nof free memory.\r\n\r\n#### Result\r\n\r\nThese changes, together with setting the `BATCH_SAFETY_MARGIN=35`, seems\r\nto result in robust behaviour for both `llama3-8b` and `granite-20b`. We\r\nno longer need to manually set the number of KV cache blocks in the\r\nlatter case.\r\n\r\n#### Related Issues\r\n\r\nn/a\r\n\r\n---------\r\n\r\nSigned-off-by: Thomas Parnell ","shortMessageHtmlLink":"Fix logic for determining the number of cache blocks (#98)"}},{"before":"80659723b3337197062afe654ea4f0aafe68f2a4","after":null,"ref":"refs/heads/fix-paged-memory-dtype","pushedAt":"2024-05-30T19:01:28.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"njhill","name":"Nick Hill","path":"/njhill","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/16958488?s=80&v=4"}},{"before":"9b4aea86846a5131bc6f672023cae5064bf9645c","after":"041fffb6c0cc49263bae9e2d88dbefe9a2db0d02","ref":"refs/heads/main","pushedAt":"2024-05-30T19:01:25.000Z","pushType":"pr_merge","commitsCount":1,"pusher":{"login":"njhill","name":"Nick Hill","path":"/njhill","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/16958488?s=80&v=4"},"commit":{"message":"fix: move parameter validation before fit_memory_scaling_model (#101)\n\nThe launch of `fit_memory_scaling_model` uses the values for `quantize`\r\nand `dtype_str`, so those should be validated and defaulted before it is\r\nran.\r\n\r\nBefore this change, if `dtype_str` was set to `None` it would be passed\r\nto `fit_memory_scaling_model` as `None` resulting in an error:\r\n```\r\nShard 1: Process SpawnProcess-33:\r\nShard 1: Traceback (most recent call last):\r\nShard 1: File \"/opt/tgis/lib/python3.11/multiprocessing/process.py\", line 314, in _bootstrap\r\nShard 1: self.run()\r\nShard 1: File \"/opt/tgis/lib/python3.11/multiprocessing/process.py\", line 108, in run\r\nShard 1: self._target(*self._args, **self._kwargs)\r\nShard 1: File \"/opt/tgis/lib/python3.11/site-packages/text_generation_server/utils/paged.py\", line 37, in fit_memory_scaling_model\r\nShard 1: model = get_model(\r\nShard 1: ^^^^^^^^^^\r\nShard 1: File \"/opt/tgis/lib/python3.11/site-packages/text_generation_server/models/__init__.py\", line 39, in get_model\r\nShard 1: dtype = get_torch_dtype(dtype_str)\r\nShard 1: ^^^^^^^^^^^^^^^^^^^^^^^^^^\r\nShard 1: File \"/opt/tgis/lib/python3.11/site-packages/text_generation_server/utils/dist.py\", line 64, in get_torch_dtype\r\nShard 1: dt = getattr(torch, dtype_str, None)\r\nShard 1: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\r\nShard 1: TypeError: attribute name must be string, not 'NoneType'\r\n```\r\n\r\nAfter this change, a value will always be set before calling\r\n`fit_memory_scaling_model`.\r\n\r\nSigned-off-by: Travis Johnson ","shortMessageHtmlLink":"fix: move parameter validation before fit_memory_scaling_model (#101)"}}],"hasNextPage":true,"hasPreviousPage":false,"activityType":"all","actor":null,"timePeriod":"all","sort":"DESC","perPage":30,"cursor":"djE6ks8AAAAEt1ixnAA","startCursor":null,"endCursor":null}},"title":"Activity · IBM/text-generation-inference"}