Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gliner on CPU with multiple cores #155

Open
vijayendra-g opened this issue Jul 16, 2024 · 14 comments
Open

Gliner on CPU with multiple cores #155

vijayendra-g opened this issue Jul 16, 2024 · 14 comments

Comments

@vijayendra-g
Copy link

vijayendra-g commented Jul 16, 2024

I want to use Gliner on CPU . The medium model takes anywhere between 18- 20 minutes for extracting entities from given text.
My question is,

  1. Does Gliner support multiple cores on a single cpu and multiple cpus ? Will there be improvement in performance ?
  2. Assuming the answer to question 1 is yes, If I were to increase the no of cores and no of cpus, Then what sort of time improvement can we expect. Has anyone tried doing this?
@urchade
Copy link
Owner

urchade commented Jul 16, 2024

hi, 20 min a for a single text ?

@vijayendra-g
Copy link
Author

vijayendra-g commented Jul 16, 2024

yes, Its a small paragraph

@urchade
Copy link
Owner

urchade commented Jul 16, 2024

this should take seconds

@vijayendra-g
Copy link
Author

vijayendra-g commented Jul 16, 2024

Can you please review my code

from gliner import GLiNER

model = GLiNER.from_pretrained("urchade/gliner_medium-v2.1")

start = time.time()
text = """
The history of the football is as rich and varied as the game itself. Ancient civilizations, including the Chinese, Greeks, and Romans, played early forms of football. These rudimentary games involved kicking a ball, albeit with different rules and objectives. The modern football, as we know it, took shape in the 19th century, primarily in England. The establishment of standardized rules by the Football Association in 1863 marked a significant milestone, paving the way for the football to become a global icon."""

labels = [ "year", "country", "features"]

entities = model.predict_entities(text, labels)

for entity in entities:
    print(entity["text"], "=>", entity["label"])

end = time.time()
print(end - start) # time in seconds

In my case

  • text length is 1200 words and I am trying to extract 4-5 entities.
  • Machine details. RAM - 32G

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 43 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Vendor ID: AuthenticAMD
Model name: AMD EPYC 7742 64-Core Processor
CPU family: 23
Model: 49
Thread(s) per core: 1
Core(s) per socket: 64
Socket(s): 2
Stepping: 0
Frequency boost: enabled
CPU max MHz: 2250.0000
CPU min MHz: 1500.0000
BogoMIPS: 4491.55

@BeanWei
Copy link

BeanWei commented Jul 17, 2024

I'm also trying to run in a CPU environment. I tried the sample code you provided, and it showed results in about 0.8 seconds. However, my model is loaded manually by specifying the directory, so is it faster? By the way, from what I've observed, running on CPU seems to be very resource-intensive. When I run predict_entities multiple times in my local environment, the CPU usage stays consistently at 100%, with no progress so far. So I have a similar question about running Gliner in a resource-constrained environment.

@vijayendra-g
Copy link
Author

vijayendra-g commented Jul 17, 2024

@BeanWei If I understand you correctly, is this correct

****model = GLiNER.from_pretrained("/home/.../gliner_medium-v2")****
I did the changes , It's still running in minutes

from gliner import GLiNER
**model = GLiNER.from_pretrained("/home/.../gliner_medium-v2")**
start = time.time()
text = """
The history of the football is as rich and varied as the game itself. Ancient civilizations, including the Chinese, Greeks, and Romans, played early forms of football. These rudimentary games involved kicking a ball, albeit with different rules and objectives. The modern football, as we know it, took shape in the 19th century, primarily in England. The establishment of standardized rules by the Football Association in 1863 marked a significant milestone, paving the way for the football to become a global icon."""

labels = [ "year", "country", "features"]

entities = model.predict_entities(text, labels)

for entity in entities:
    print(entity["text"], "=>", entity["label"])

end = time.time()
print(end - start) # time in seconds

@polodealvarado
Copy link

polodealvarado commented Aug 7, 2024

Hello @vijayendra-g .

I tried to replicate your results and I got it in minutes.
On the other hand, I tried to use it with onnx but It takes more time than using just torch ( two times slower)

Has anyone tested it with onnx ?

@polodealvarado
Copy link

I fixed the code and now it works for me.
I got 30% of improvement in the speed.

@vijayendra-g
Copy link
Author

vijayendra-g commented Aug 8, 2024

@polodealvarado what is the code fix ? How much time does Gliner - medium take now? Please specify Gliner version as well .

@psydok
Copy link

psydok commented Aug 19, 2024

@polodealvarado @vijayendra-g I have encountered the same problem. Onnx runs 2 times longer than the normal model.
It was written that the fixes thread has the needed fix, but that doesn't seem to be it. Can you tell me what kind of fixes are needed? How did you find the bug?
I don't often work with onnx, I don't understand what could be wrong with the current code.

The quantized model sags a lot in quality. it does not seem to be the sag that might have been expected.

@polodealvarado
Copy link

polodealvarado commented Aug 19, 2024

@psydok @vijayendra-g My problem arose with the sequence length. I realized that with a sequence longer than 512 tokens the onnx model takes a lot of time. So, I have just shortened it.

However, as you said @psydok there is a significant degradation in the model’s performance with the quantized versions (up to 100% in some cases).

@psydok
Copy link

psydok commented Aug 19, 2024

@polodealvarado
Thank you for your answer!
I conducted experiment: I limited the sequence to 512. The response time of the model actually got better, but the results are about the same, what with onnx, what without onnx...
I checked it like this (labels=["person", "location"]):

# model = GLiNER.from_pretrained("my_models/gliner_multi", load_onnx_model=True, load_tokenizer=True)
%%timeit
entities = model.predict_entities(text[:512], labels, threshold=0.25)

onnx: 79.4 ms
without onnx: 72.9 ms

%%timeit
entities = model.predict_entities(text[:384], labels, threshold=0.25)

onnx (opset_version=14): 69.1 ms
without onnx: 65.4 ms

That is, there is still no increase in speed... I converted the model as in the guide of this repository: https://github.com/urchade/GLiNER/blob/main/examples/convert_to_onnx.ipynb

@urchade
Copy link
Owner

urchade commented Aug 19, 2024

text[:512]assume 512 characters not tokens

you can change the maximum size by settingmodel.config.max_len = 512

@psydok
Copy link

psydok commented Aug 19, 2024

The default value is model.config.max_len #= 384
I tried more models from onnx-community. Either the quality is terrible, or the original model is already there...
https://huggingface.co/onnx-community/gliner_multi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants