Tesla P40不支持 flash_attn pipline 推理报错 #432

love94me · 2024-08-26T10:08:31Z

from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image

pipe = pipeline(
    '/home/workspace/model/internlm-xcomposer2d5-7b',
    backend_config=TurbomindEngineConfig(tp=4)
    ,model_name='yinhe-vl-chat'
)
image = load_image('/home/workspace/123.png')

response = pipe(('describe this image', image))
print(response.text)

我要怎样关闭它，不使用 flash_attn 加速

(.venv) root@e9c6a6e513d9:/home/workspace#  cd /home/workspace ; /usr/bin/env /home/workspace/.venv/bin/python /config/extensions/ms-python.debugpy-2024.8.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher 44773 -- /home/workspace/test.py 
/home/workspace/.venv/lib/python3.10/site-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  torch.utils._pytree._register_pytree_node(
2024-08-26 18:06:01,470 - modelscope - INFO - PyTorch version 2.2.1+cu118 Found.
2024-08-26 18:06:01,475 - modelscope - INFO - Loading ast index from /root/.cache/modelscope/ast_indexer
2024-08-26 18:06:01,553 - modelscope - INFO - Loading done! Current index file version is 1.15.0, with md5 678b20b607a9b7a970b699aea2b84212 and a total number of 980 components indexed
Set max length to 16384
Dummy Resized
Device does not support bfloat16. Set float16 forcefully
[WARNING] gemm_config.in is not found; using default GEMM algo                                                                                                                                           
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
terminate called after throwing an instance of 'std::runtime_error'
  what():  [TM][ERROR]  Assertion fail: /lmdeploy/src/turbomind/kernels/attention/attention.cu:35

yuhangzang · 2024-08-26T10:30:31Z

You can check the definition in config.json and configuration_internlm_xcomposer2.py.

Not that you may face the out-of-memory problem if you do not use the flash-attention.

love94me · 2024-08-27T02:54:24Z

You can check the definition in config.json and configuration_internlm_xcomposer2.py.

Not that you may face the out-of-memory problem if you do not use the flash-attention.

config.json 和 configuration_internlm_xcomposer2.py 两个都已经修改成以下，但是问题依旧存在，我使用得是pipeline推理才出现得错误

"attn_implementation": "eager",

yuhangzang · 2024-08-30T09:20:17Z

You may set a breakpoint here and check the value of config.attn_implementation.

mm-assistant bot assigned myownskyW7 Aug 26, 2024

InternLM deleted a comment Aug 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tesla P40不支持 flash_attn pipline 推理报错 #432

Tesla P40不支持 flash_attn pipline 推理报错 #432

love94me commented Aug 26, 2024

yuhangzang commented Aug 26, 2024

love94me commented Aug 27, 2024

yuhangzang commented Aug 30, 2024

Tesla P40不支持 flash_attn pipline 推理报错 #432

Tesla P40不支持 flash_attn pipline 推理报错 #432

Comments

love94me commented Aug 26, 2024

yuhangzang commented Aug 26, 2024

love94me commented Aug 27, 2024

yuhangzang commented Aug 30, 2024