Be able to cache embeddings and load them #946

orionw · 2024-06-17T23:18:22Z

For most users, being able to cache their embedded docs and/or provide a cached embedding file is probably overkill.

However, there are many situations where it would be helpful to have an option to cache them. For example, experiments where you alter the query/document set to for speedups (as I'm doing now) or if you're testing the effect of different prefixes/instructions over the same dataset.

I typically use pyserini for caching the index so that we can quickly search over it later, but that doesn't integrate nicely with mteb. I think it would fairly straightforward to implement this: (1) we need to take in a flag of whether to cache the embeddings, cache them to a file that corresponds to the dataset and model name and (2) provide an option to read in a cached embedding file.

I don't have bandwidth for this right now, but if anyone does it would be an excellent addition.

The text was updated successfully, but these errors were encountered:

tenzu15 · 2024-06-17T23:49:14Z

Hey @orionw ,

I would like to try this if possible!

orionw · 2024-06-17T23:53:50Z

Awesome @tenzu15! It would be great to be able to pass the two flags in the mteb.run command. Something like cache_embeddings: bool = True and cached_embedding_file: str

This would need to be changed in the RetrievalEvaluator class for now. If it's useful for other tasks, we can implement it there also. Also cc'ing @KennethEnevoldsen who may have opinions on where this should be added/what the names should be.

But feel free to start @tenzu15. If you have any questions feel free to make a draft PR and cc me!

KennethEnevoldsen · 2024-06-18T06:29:07Z

@orionw, wouldn't it be better to implement a more general model wrapping for this so that it works for all tasks?

class ModelWrap():
  def __init__(self, model):
     self.model = model
  
  def encode(...):
     embeddings = self.model.encode(...)
     self.store_embeddings(sentences, embeddings)   
     return embeddings

isaac-chung · 2024-06-18T10:22:59Z

There's some background discussion related to the topic from #354 (comment) as well.

orionw · 2024-06-18T13:32:45Z

+1 @KennethEnevoldsen, I think a wrapper is a great idea and even simpler to implement.

KennethEnevoldsen · 2024-09-09T15:42:02Z

It sounds like we settled on a wrapper here. In which case I don't think it is something that should be within mteb. Let me know if you disagree, then I will re-open the issue

orionw · 2024-09-09T15:46:38Z

Personally, I think it’d be nice to have it be full functionality in MTEB so you can cache things. Maybe it’s just my research but not having to recompute the embeddings would save a lot of time and I frequently store them with Pyserini instead. If this was in MTEB it would also allow us to put the indexes on HF so people can just grab and use it.

If no one else finds it useful we can leave it unimplemented but I personally would find it very useful.

KennethEnevoldsen · 2024-09-09T19:28:52Z

Will leave this open then.

Def. think public caches are important - The Scandinavian embedding benchmark implements it for results. Is there a reason why such an approach would not work here? debugging, error analysis I presume?

orionw · 2024-09-09T19:31:32Z

Thanks! I wasn't aware of the Scandinavian embedding benchmark cache - do you mind linking?

KennethEnevoldsen · 2024-09-09T19:41:24Z

cache for results is here:

https://github.com/KennethEnevoldsen/scandinavian-embedding-benchmark/tree/main/src/seb/cache

It is simply implemented as a part of the package. If you try to rerun an already run model it will simply use the cache.

orionw · 2024-09-10T15:44:42Z

These are the cached results right? I don't see any embeddings but maybe missed them

KennethEnevoldsen · 2024-09-11T07:44:43Z

Yeah only results so no embeddings

orionw added good first issue Good for newcomers enhancement New feature or request labels Jun 17, 2024

KennethEnevoldsen closed this as completed Sep 9, 2024

KennethEnevoldsen reopened this Sep 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Be able to cache embeddings and load them #946

Be able to cache embeddings and load them #946

orionw commented Jun 17, 2024

tenzu15 commented Jun 17, 2024

orionw commented Jun 17, 2024

KennethEnevoldsen commented Jun 18, 2024 •

edited

Loading

isaac-chung commented Jun 18, 2024

orionw commented Jun 18, 2024

KennethEnevoldsen commented Sep 9, 2024

orionw commented Sep 9, 2024 •

edited

Loading

KennethEnevoldsen commented Sep 9, 2024

orionw commented Sep 9, 2024

KennethEnevoldsen commented Sep 9, 2024

orionw commented Sep 10, 2024

KennethEnevoldsen commented Sep 11, 2024

Be able to cache embeddings and load them #946

Be able to cache embeddings and load them #946

Comments

orionw commented Jun 17, 2024

tenzu15 commented Jun 17, 2024

orionw commented Jun 17, 2024

KennethEnevoldsen commented Jun 18, 2024 • edited Loading

isaac-chung commented Jun 18, 2024

orionw commented Jun 18, 2024

KennethEnevoldsen commented Sep 9, 2024

orionw commented Sep 9, 2024 • edited Loading

KennethEnevoldsen commented Sep 9, 2024

orionw commented Sep 9, 2024

KennethEnevoldsen commented Sep 9, 2024

orionw commented Sep 10, 2024

KennethEnevoldsen commented Sep 11, 2024

KennethEnevoldsen commented Jun 18, 2024 •

edited

Loading

orionw commented Sep 9, 2024 •

edited

Loading