Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: Add additional metrics to METRIC_REGISTRY for Retrieval and support for different K-s #11030

Open
hkristof03 opened this issue Feb 20, 2024 · 5 comments
Labels
enhancement New feature or request triage Issue needs to be triaged/prioritized

Comments

@hkristof03
Copy link
Contributor

Feature Description

Hi, thanks for this awesome library.

I found a few (for me) inconsistencies in the evaluation across modules.

The RetrievalEvaluator only supports Hit Rate and MRR (and Cohere, which is paid).

class RetrieverEvaluator(BaseRetrievalEvaluator):

METRIC_REGISTRY: Dict[str, Type[BaseRetrievalMetric]] = {

At the same time BeirEvaluator supports NDCG, MAP, Recall and Precision at different K-s.

Although, BEIR supports additional metrics as well.

https://github.com/beir-cellar/beir/blob/f062f038c4bfd19a8ca942a9910b1e0d218759d4/beir/retrieval/evaluation.py#L94

So my questions are:

  1. For the RetrievalEvaluator why do you only support MRR and Hit Rate, and not NDCG, MAP, Recall and Precision, at different K-s?
  2. For BEIR, why don't you support the additional metrics?
  3. Are you planning to add these metrics to the METRIC_REGISTRY ?

Thanks!

Reason

Only basic metrics are supported for Retrieval, independent of K retrieved documents.

Value of Feature

Additional metrics would give greater insights to the Retrieval and Ranking.

@hkristof03 hkristof03 added enhancement New feature or request triage Issue needs to be triaged/prioritized labels Feb 20, 2024
Copy link

dosubot bot commented Feb 20, 2024

🤖

Thank you for your detailed questions. Let's address them one by one:

  1. The RetrievalEvaluator currently supports MRR (Mean Reciprocal Rank) and Hit Rate because these are the metrics that have been implemented in the METRIC_REGISTRY. The METRIC_REGISTRY is a dictionary that maps metric names to their corresponding classes, and the resolve_metrics function uses this registry to create instances of the metrics for evaluation. As for the additional metrics like NDCG (Normalized Discounted Cumulative Gain), MAP (Mean Average Precision), Recall, and Precision at different K-s, these are not currently supported because they have not been implemented in the METRIC_REGISTRY.

  2. The BEIR evaluator supports a different set of metrics because it is designed to evaluate retrieval models in a different context. The BEIR evaluator is designed to evaluate models on the BEIR benchmark, which includes a variety of different tasks and datasets. The metrics supported by the BEIR evaluator are those that are relevant for these tasks and datasets.

  3. As for the future plans to include these additional metrics in the METRIC_REGISTRY, the context does not provide any information about this. Therefore, it is not possible to answer whether there are plans to include these additional metrics for better insights into Retrieval and Ranking.

To include these additional metrics, new classes would need to be created for each metric, similar to the HitRate, MRR, and CohereRerankRelevancyMetric classes. These new classes would need to inherit from BaseRetrievalMetric and implement the compute method to calculate the metric. Once these classes are created, they can be added to the METRIC_REGISTRY dictionary.

I hope this answers your questions. If you have any further queries, feel free to ask.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@hkristof03 hkristof03 changed the title [Feature Request]: [Feature Request]: Add additional metrics to METRIC_REGISTRY for Retrieval and support for different K-s Feb 20, 2024
@aoezdTchibo
Copy link

aoezdTchibo commented Apr 10, 2024

It is a pity that there are no explicit plans for the integration of further metrics like MAP@K or NDCG@K, as I now have to implement these metrics locally myself, although I would like to use the RetrievalEvaluator OOTB...

EDIT: @hkristof03 If it is suitable you could use the InformationRetrievalEvaluator from the sentence-transformer framework.

@hkristof03
Copy link
Contributor Author

@aoezdTchibo I solved by getting the "embedding_dict" keys (node ids) and the embeddings (values) from the vector store's ".to_dict()" method, then getting the node id - doc id map from the "text_id_to_ref_doc_id" from the "embedding_dict", then creating my own faiss index and evaluating the retrieval using the retrieval component of Torchmetrics. Sounds complicated but not that much in implementation. Although I had to write several workarounds for this library to be able to test the RAG system Retrieval component (and its components) on several datasets. For experimentation and extensive evaluation the LlamaIndex is not yet set up.

@aoezdTchibo
Copy link

@hkristof03 Thanks for the link to Touchmetrics! I hadn't heard of this before and it looks very promising.

@AgenP
Copy link
Contributor

AgenP commented May 13, 2024

Hey @hkristof03,

Quick note incase you are still interested

My PR was recently merged for adding a MRR@K and HitRate@K option, through usage of a "use_granular_..." attribute

So we now have more native flexibility for our evals

Hopefully you find it valuable 💪

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request triage Issue needs to be triaged/prioritized
Projects
None yet
Development

No branches or pull requests

3 participants