Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better threshold for fuzzy search? #80

Open
kberry opened this issue Jul 25, 2022 · 3 comments
Open

Better threshold for fuzzy search? #80

kberry opened this issue Jul 25, 2022 · 3 comments
Labels
change Changing program's behavior enhancement New feature or request question Further information is requested search Searching

Comments

@kberry
Copy link
Contributor

kberry commented Jul 25, 2022

texdoc foobar finds float.pdf and other float and hep-float docs.

I see there are three characters in common, but
it seems like too fuzzy of a match. But maybe I think that only because I'm a human (at least, I think I am; sometimes it seems a fuzzy question :). Just thought I'd mention ...

@wtsnjp
Copy link
Member

wtsnjp commented Jul 26, 2022

Yes, I agree with you that "foobar" and "float" looks quite different for me as well, but to calculate the Levenshtein distance, it is just three (1. replace "o" with "l," 2. delete "b," 3. replace "r" with "t").

The distance of three is not so significant for long strings, but it is already too large for short strings, like in this case.

One of the solutions for this is using normalized Levenshtein distance. Let $q$ be a query and $c$ be a candidate match for $q$. There are several ways to normalize the distance, but one of the simplest ones can be calculated as follows:

$$d = \frac{\operatorname{Levenshtein}(q, c)}{\max(\operatorname{len}(q), \operatorname{len}(c))}$$

We can use a value for these types of Levenshtein distance variants for the threshold.

We have to consider only a small thing: we already have a config item of fuzzy_level which initially was expected to have a value for Levenshtein distance itself and can only take an integer value. We might want to maintain backward compatibility for this.

@wtsnjp wtsnjp changed the title texdoc foobar finds float Better threshold for fuzzy search? Aug 2, 2022
@wtsnjp wtsnjp added the enhancement New feature or request label Aug 2, 2022
@wtsnjp wtsnjp added the change Changing program's behavior label Feb 18, 2023
@wtsnjp
Copy link
Member

wtsnjp commented Feb 19, 2023

We need an excellent calculation formula to convert an integer value of fuzzy_level to the threshold for a normalized Levenshtein distance. We want to make fuzzy_level roughly work the same for the keywords whose lengths are around six.

@wtsnjp wtsnjp added question Further information is requested search Searching labels Feb 19, 2023
@wtsnjp
Copy link
Member

wtsnjp commented Feb 25, 2023

Perhaps just using fuzzy_level / 10 as the threshold is OK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
change Changing program's behavior enhancement New feature or request question Further information is requested search Searching
Projects
None yet
Development

No branches or pull requests

2 participants