26. Juni 2019

Wrestling with Lucene.NET: Ordering search results by metadata

This post is somewhat a continuation of last month's post which covered the usage of Lucene in the same Use Case - at an earlier stage: https://www.elbisch.ch/2019/05/31/full-text-search-for-database-entities-with-lucene-net/

Lucene is a search engine, but it can do more than just that. Calculating scores representing the relevancy of a search result in comparison to another and thus ordering the results is just as much a core feature of the framework as is indexing and searching. And just like any other aspect of Lucene, the defaults for sorting and scoring may work well for many use cases, it's very well possible to customize almost every aspect of it if what Lucene delivers by default is not what is wanted. However there are some scenarios which seem to be considerable harder to achieve, one of which I want to showcase in this post.

In this post I describe a solution I developed for a Use Case where the main factor for sorting are metadata. That metadata are numerical values which each correspond to an indexed field, and the higher that value, the more the field is worth. A field's "worth" should then be considered in the sort scoring.

More concretely I index a bunch of employees, along with their technological skills. These skills on one hand have a name (i.e. "Java" or "Entity Framework") and a skill grade which ranges from 1 (Junior) to 5 (Senior Expert). The search function should allow to search for employees with certain skills, and order the results by skill grade descending, so that the most proficient employees appear first. However, as Lucene is a full text search engine and there are many skills with similar names (i.e. "java" might match on the skills "Java", "Java SE", "Javascript" and so on) there may be multiple matching skills whose skillgrades influence the sort order. Different kinds of searches can be explicitly or implicitly executed, namely Exact queries (Term and PhraseQueries) as well as Wildcard- and FuzzyQueries.

Concept

Before heading straight into the code it's not a bad idea to concisely spell out what the target is, i.e. how the scores should come about to be calculated respectively which factors should influence the score.

The concept for this Use Case is, that when an employee has a skill which matches to the search term, the score should be increased by the skill grade. Thus during a search for "java", a junior Java developer would get a score of 1 while a senior expert in Java should get a score of 5. This is done for every matching skill, so an employee with two matching skills, one with a skill grade 5 and one with skill grade 3 would get a score of 8 (= 5 + 3).

Depending on what type of Query matched on a skill, an additional factor is used which should boost exact queries and "nerf" unexact queries.

Creating a custom Similarity

One might expect that the way Lucene calculates Scores by default is a simple one. It is not. The scoring formula consists of many terms which again are resolved by mostly non-trivial calculations. A detailed description of how the scoring function looks can be found here: https://lucene.apache.org/core/3_0_3/api/core/org/apache/lucene/search/Similarity.html

There's many terms which should not play a role in this Use Cases' scoring, such as the query norm or length norm, or the inverse document frequency. In order to cancel these out of the formula they need to be set to constant 1. What should remain are the Term Frequency (how often a matching term appears in the document) and the boost (used to boost matches by skill grade and apply the query boosts.)

This can be done pretty easily by implementing a custom Similarity class:

public class CustomCountingSimilarity : DefaultSimilarity
{
    public override float Idf(long docFreq, long numDocs)
    {
        return 1f;
    }
 
    public override float Tf(float freq)
    {
        return freq;
    }
 
    public override float Coord(int overlap, int maxOverlap)
    {
        return 1f;
    }
 
    public override float QueryNorm(float sumOfSquaredWeights)
    {
        return 1f;
    }
 
    public override float LengthNorm(FieldInvertState state)
    {
        return 1f;
    }
}

That Similarity needs to be handed to the IndexWriter when indexing and to the IndexSearcher when searching:

var config = new IndexWriterConfig(
    LuceneVersion.LUCENE_48, Analyzer)
{
    Similarity = new CustomCountingSimilarity()
};
 
_indexWriter = _indexWriter ?? new IndexWriter(Directory, config);

var indexSearcher = new IndexSearcher(indexReader)
{
    Similarity = new CustomCountingSimilarity()
};

Indexing metadata

Generally, only the bits of data should be indexed which should also be searched through, which would imply that it's only textual data that should be stored to the index. However, in this case Lucene should be able to use non-textual metadata for scoring, and in order to gain access to that metadata it is necessary to write it into the index, as there's nowhere else that Lucene could draw data from.

Each skill of an employee is stored in a separate and isolated Field in an employee's Document. Compare the previous blog post about some notes on that design choice. The Field's values are the only place where any data can be stored. And in order to keep a skill's name and skill grade connected there's no other way, as far as I have figured, to write the skill grade into the same Field as the skill's name. I surround the numerical skill grade with square brackets and separate it from the skill name by a space. Thus, as a WhitespaceTokenizer is used, this additional data won't influence the matching, as the original tokens for the skill's name are still present unaltered. There's now just an additional token with the skill grade.

Thus, the indexing process for the "Skill" fields now looks like this:

foreach (var skill in employee.Skills)
{
    var fieldValue = skill.Knowledge.Name;
 
    if (skill.SkillGrade != null)
    {
        fieldValue += " [" + skill.SkillGrade.Value + "]";
    }
    else
    {
        fieldValue += " [1]";
    }
 
    var field = new Field(
        IndexField.Skill.ToString(),
        fieldValue,
        Field.Store.YES,
        Field.Index.ANALYZED);
 
    doc.Add(field);
}

Writing metadata-boosted queries

Now for the toughest bit: Tell Lucene that matches on Skills should be dynamically boosted by the Skill Grade. Lucene offers two ways of boosting things: Index-time boosting and Query-time boosting. The former allows to apply a boost to a Field or entire Documents. However, the Field boosts are just calculated together to get a fixed boost for the entire Document. Field boosting can not be made to only apply the boost if a match occurred on the respective Field. The boost is always applied - so not what is required here. But Query boosting won't work either, as at the point where it was determined that a Field matched the Query, the Query was already executed and it's boost applied. So Query boosting won't work either for this problem.

So is it impossible? No! Here's a workaround that I came up with: As the number of possible skill grades is limited (1, 2, 3, 4, 5) and there's always only one skill grade per Field, it should work "copying" the Query for the "Skill" field once for each Skill Grade, i.e.

Skill:java [5]^5 (^ = Boost)
Skill:java [4]^4
Skill:java [3]^3
Skill:java [2]^2
Skill:java [1]^1

As there can be at most one Skill Grade per Field always at most one of these pseudo-queries would match.

But what type of Query should there be chosen? There's to consider that, if you i.e. search for "java" there could be a skill indexed as "Java SE [3]" which should still match, even if there is a non-matching token between the matching token and the skill grade token. Furthermore, the original or "core" query can also be a Phrase- Wildcard- or FuzzyQuery, not only a TermQuery.

Answer: SpanNearQuery! With such a query it can be expressed that the search term "java" needs to match on a token of a skill's name, and somewhere near that token needs to be another token matching onto the skill grade token. A SpanNearQuery is composed of multiple SpanQueries. In this case two SpanQueries are needed: The first one representing the "core" query (i.e. "Java needs to occur!") and the second one representing a TermQuery for the skill grade (i.e. "[4]" needs to occur).

The last challenge is how the different types of queries can be altered so that they become SpanQueries.

For TermQueries this is easy: Instead of a TermQuery it's now just called a SpanTermQuery.
For the PhraseQuery another SpanNearQuery can be used (which inherits from SpanQuery). Every word of the phrase is put into a SpanTermQuery which are then combined into a SpanNearQuery with Slop 0. The slop expresses, how many other words can appear in between.
For the Fuzzy- and WildcardQuery a newer addition to Lucene.NET is required (available from Lucene.NET 4.8.0): The SpanMultiTermQueryWrapper whose sole purpose it is to express Queries inheriting from MultiTermQuery as SpanQueries.

PhraseQuery as SpanQuery:

var phraseWords = searchString.Split(' ');
var spanQueries = new List<SpanQuery>();
 
foreach (var phraseWord in phraseWords)
{
    spanQueries.Add(new SpanTermQuery(new Term(field, phraseWord)));
}
 
return new SpanNearQuery(spanQueries.ToArray(), 0, true)
{
    Boost = 2f
};

WildcardQuery as SpanQuery:

var wildcardQuery = new WildcardQuery(term);
 
var wildcardSpanQuery = 
    new SpanMultiTermQueryWrapper<WildcardQuery>(wildcardQuery)
{
    Boost = 0.5f
};

FuzzyQuery as SpanQuery

var fuzzyQuery = new FuzzyQuery(
    new Term(field, searchString),
    _configuration.FuzzyQueryMaxEdits,
    _configuration.FuzzyQueryPrefixLength);
 
var spanFuzzyQuery =
    new SpanMultiTermQueryWrapper<FuzzyQuery>(fuzzyQuery)
{
    Boost = 0.1f
};

Final code snippet: Combining the SpanQueries into SpanNearQueries with their respective boosts. As in the bigger picture I query more fields than just the "Skill" field I made a method to get all Queries for a specific field, and overrode the method for the Skill field which behaves differently.

/// <summary>
/// Returns one or more queries which are to be conducted on the
/// specified field, based on the given search string and its
/// modifier.
///
/// This override adds special queries if the field is the "Skill"
/// field. Special queries are formulated which get a boost
/// according to the skill grade which a matching skill has
/// so that skills with a higher proficiency get a higher Lucene
/// score.
/// </summary>
/// <param name="field">Which Document Field to search in.</param>
/// <param name="searchStringPart">The search string with a
/// modifier indicating the type of search.</param>
protected override IEnumerable<Query> GetQueriesForField(
    string field,
    AnalyzedSearchStringPart searchStringPart)
{
    if (!field.Equals(EmployeeIndex.IndexField.Skill.ToString(), 
        StringComparison.Ordinal))
    {
        return base.GetQueriesForField(field, searchStringPart)
            .ToList();
    }
 
    //The "core" queries are the queries defined on the
    //textual fields.
    var coreQueries = new List<SpanQuery>();
 
    //Instead of using the basic Query subtypes, get transformed
    //variants as SpanQueries which then are able to be closely
    //combined with other queries on the same field.
    switch (searchStringPart.SearchModifier)
    {
        case SearchModifier.Phrase:
            coreQueries.Add(GetPhraseQueryAsSpanQuery(
                searchStringPart.Text, field));
            break;
        case SearchModifier.Wildcard:
            coreQueries.Add(GetWildcardQueryAsSpanQuery(
                searchStringPart.Text, field));
            break;
        case SearchModifier.None:
            coreQueries.AddRange(GetDefaultQueriesAsSpanQueries(
                searchStringPart.Text, field));
            break;
    }
 
    var newQueries = new List<Query>();
 
    var availableSkillGradeValues = _skillGradeRepository.Get()
        .Select(s => s.Value)
        .Distinct();
 
    /*
     * Repeat each core query as many times as there are skill
     * grades available and combine each one with a TermQuery on
     * the SkillGrade as SpanQuery to a SpanNearQuery which
     * contains both the core query (which can be a Term, Fuzzy,
     * Phrase or WildcardQuery) with a Query for the SkillGrade.
     * As there is only one SkillGrade per Skill theres always
     * only one of these repeated queries that matches.
     *
     * These combined SpanNearQueries get boosted by the
     * respective SkillGrade.
     */
    foreach (var coreQuery in coreQueries)
    {
        //BooleanQuery grouping the repeated core queries for
        //each SkillGrade to one query.
        var skillGradeBoostedBooleanQuery = new BooleanQuery();
 
        //Add core query in combination with each SkillGrade.
        foreach (var skillGradeValue in availableSkillGradeValues)
        {
            if (skillGradeValue <= 0) continue;
 
            var spanQueryClauses = new List<SpanQuery>();
 
            //First part is the core query.
            spanQueryClauses.Add(coreQuery);
 
            //Second part is the TermQuery for the SkillGrade
            //with [...] markup.
            spanQueryClauses.Add(new SpanTermQuery(
                new Term(field, $"[{skillGradeValue}]")));
 
            //Combine queries with specified max slop.
            var spanNearQuery = new SpanNearQuery(
                spanQueryClauses.ToArray(), 
                _maxSlopBetweenSkillNameAndSkillGrade, 
                true)
            {
                //Boost query by the SkillGrade multiplied by the
                //boost for the core query.
                Boost = skillGradeValue * coreQuery.Boost
            };
 
            skillGradeBoostedBooleanQuery.Add(
                spanNearQuery, Occur.SHOULD);
        }
 
        newQueries.Add(skillGradeBoostedBooleanQuery);
    }
 
    return newQueries;
}

Checking the results

And does it work as it should? Given that in many aspects Lucene sometimes feels like a black box with unknown inner workings this question is often appropriate to ask. Looking at the results in the UI it actually looks very good:

However, looking at how the scores are calculated using IndexSearcher.Explain shows that its not yet quite what it should be (click to enlarge):

However the discrepancies don't lead to the order being wrong, but of course it still bothers me. So back to the drawing board!