Full-text search for database entities with Lucene.NET

The library Apache Lucene, originally written in Java, has found itself being used very often when it comes to the need of being able to search, and has by now been ported to many other languages and has been used in many other products such as ElasticSearch. The C# port Lucene.NET is very closely modelled after the Java original. It offers a vast amount of options to tailor the search behaviour to ones very specific needs. But with that flexibility comes the price of increased complexity. A lot of terms and concepts need to be understood if the search engine framework wants to be used for something else than the commonly described standard use-case. One such more specialized Use-Case for Lucene is described in this blog post. And given Lucene.NET is by choice as resemblant of the original as possible, the findings of this post may very well be eligible for other languages and environments as well.

 

The code and concepts for this post describe the usage for an ASP.NET MVC App using Lucene.NET version 3.0.3. As the API documentation seems to be unavailable currently, refer to the corresponding Java API documentation for primary research: https://lucene.apache.org/core/3_0_3/api/all/index.html

The Use-Case

This post describes how Lucene.NET was used in a very specific Use-Case. That Use-Case consists of being able to search for different kinds of database entities by searching for textual matches in the entities properties as well as in other entities which are related to the searched entity. To clarify, here's an excerpt of the relational model, showing the "Employee" entity along with the properties which shall be searchable:

For the user, the search should be as simple to use as possible. The User Interface consists of a single search text field, and a set of filters which allow to narrow down the viable employees. If the user enters a search term, i.e. "Java" (without double quotes, probably meaning someone searches for employees which know about Java), it's up to the search algorithm to interpret the search as best as possible to show the most relevant results. However, it should be possible to be more concise, i.e. enter "Java SE" and surround it with double quotes to make sure, only "Java SE" matches, but not "Java" itself.

Indexing entities

The way Lucene, and probably most search-related algorithms, functions, is that a so-called Index is generated. 

Index

An Index is a set of files created by Lucene, containing all the data which should be able to be searched through, stored in a way which allows for the best search efficiency.

Lifecycle & data flow

An important consideration is about when the Index should be created, deleted and updated. As the data stored in the Index is, in this case, duplicated and decoupled from the original, living database data it needs to be ensured that they are always in sync, as the search results found from searching the Index will be correlated to the database data.

 

In this case the following strategy is used:

  1. A separate Index is created for each searchable entity, in this case, one Index for Employee data, one for Project data, as the user can search for either Employees or Projects.
  2. The Indexes are deleted and freshly created upon every start of the (web) application. This is to avoid inconsistencies if the Document format (see below) changed, i.e. due to an update to the app.
  3. Whenever any of the searchable entities or their related entities are updated in the database, the respective Lucene Index is updated as well.

DRY - Dont repeat yourself: Generic classes

Multiple different entities can be searched for. The logic is 90% the same for either of those entities, which lead to the choice of using a lot of generic classes. Especially in conjunction with Dependency Injection (using NInject) it is then very easy to combine generic classes and interfaces with entity-specific implementations where required.

Adding and removing entities to/from the Index

For this job the generic class LuceneWriter was created. The process includes a lot of entity specific configurations, such as how to map the entities to their representations in the Index, or where on the file system to create the Index. Thus the LuceneWriter works with an entity-specific implementation of the interface ILuceneIndex, which contains those configurations.

 

The following terms are of importance when writing to a Lucene Index:

Directory

Specifies where and how to store the Index files.

Document

The representation of one searchable entity in the Index. For an entity to be stored in the Index it needs to be converted into a Document first. A Document is made up of Fields. A document corresponds to i.e. one Employee instance.

Field

Part of a Document. Key / value pair, while the key is a string describing what kind of information it stores. The value is a part of the searchable text of this Document. The equivalent of an in-memory object is a Property, i.e. an Employee might have a "FirstName" property, then its Document would have a Field with key "FirstName" and the value being the actual first name.

 

For each Field of a Document it can be specified whether its value should be stored to the Index (otherwise it is only available while the Document is still in-memory) and whether its contents should be analyzed (cf. below) or should be stored unaltered.

 

Analyzer

Before any text is stored to an Index, it is analyzed, meaning that it is reformatted, depending on which Analyzer is used. Some Analyzers might discard irrelevant stopwords such as "the" or "in", others may convert all text to lowercase first. Additionally, the Analyzer specifies which Tokenizer to use. 

Tokenizer

The Tokenizer is responsible for splitting up the text of Fields into small Tokens. Tokens are the smallest still sensibly long pieces of text which Lucene can search through individually. A standard way to tokenize a text is by making a Token of every bit of text seperated by whitespace characters (essentially words), but in some scenarios it may be preferred to keep multiple words linked, i.e. in Product names or codes. 

Term

Simple key / value pairs used for various purposes. When updating the index, a Term is used to specify which Document to update by specifying which Field of a Document should be used for identification, along with the concrete ID value. If a document whose ID field has the same value as the one specified in the Term, it is replaced, else just added newly.

Here is the LuceneWriter class mentioned above:

public class LuceneWriter<T> : ILuceneWriter<T> where T: class
{
    private readonly ILuceneIndex<T> _index;
 
    public LuceneWriter(ILuceneIndex<T> index)
    {
        _index = index;
    }
 
    public void AddOrUpdateRange(IEnumerable<T> entities)
    {
        if (entities is null)
        {
            throw new ArgumentNullException();
        }
 
        var writer = _index.IndexWriter;
 
        foreach (var entity in entities)
        {
            var indexTerm = new Term(
                _index.IdFieldKey, 
                _index.GetEntityId(entity).ToString());
            var newDocument = _index.GetDocumentFromEntity(entity);
            writer.UpdateDocument(indexTerm, newDocument);
        }
        writer.Commit();
    }
 
    public void Remove(T entity)
    {
        if (entity is null)
        {
            throw new ArgumentNullException();
        }
 
        var writer = _index.IndexWriter;
        var indexTerm = new Term(
            _index.IdFieldKey, 
            _index.GetEntityId(entity).ToString());
        writer.DeleteDocuments(indexTerm);
        writer.Commit();
    }
 
    public IDictionary<int, string> GetIndexedFieldNames()
    {
        return _index.FieldValues
            .ToDictionary(
                x => x, 
                x => _index.GetFieldKeyByFieldId(x));
    }
}

This class has a very close dependency to the entity-specific LuceneIndex-classes, of which an example is show below. Otherwise, most code should be relatively easy to follow with the given explanations above. The last method in this class is used to communicate to the client which Fields can be searched in, in order to offer the possibility to restrict search to one Field only, instead of all Fields.

The EmployeeIndex class, an example implementation of the ILuceneIndex interface:

public class EmployeeIndex : ILuceneIndex<Employee>
{
    private static IndexWriter _indexWriter;
 
    public EmployeeIndex(
        IConfigurationManager configurationManager)
    {
        var dictionaryPath = configurationManager
            .GetAppSetting("LuceneDirectoryPath") + 
                           "/EmployeeIndex";
 
        if (System.IO.Directory.Exists(dictionaryPath))
        {
            System.IO.Directory.Delete(dictionaryPath, true);
        }
 
        Directory = FSDirectory.Open(dictionaryPath);
        Analyzer = new CaseInsensitiveWhitespaceAnalyzer();
 
        //Singleton pattern for IndexWriter to ensure only one
        //IndexWriter exists per index.
        _indexWriter = _indexWriter ?? new IndexWriter(
            Directory, 
            Analyzer, 
            IndexWriter.MaxFieldLength.UNLIMITED);
    }
 
    public IndexWriter IndexWriter => _indexWriter;
 
    public Directory Directory { get; }
 
    public Analyzer Analyzer { get; }
 
    public string IdFieldKey => IndexField.Id.ToString();
 
    public IEnumerable<string> FieldKeys => 
        Enum.GetNames(typeof(IndexField));
 
    public IEnumerable<int> FieldValues =>
        Enum.GetValues(typeof(IndexField))
            .Cast<IndexField>().Cast<int>().ToList();
 
    public Document GetDocumentFromEntity(Employee entity)
    {
        if (entity is null)
        {
            throw new ArgumentNullException();
        }
 
        var employee = entity;
 
        if (employee.Skills is null ||
            employee.UserCourses is null ||
            employee.UserCertificates is null ||
            employee.SchoolEducations is null)
        {
            throw new ArgumentException(
                "Related entity was not eagerly loaded and " +
                "could not be loaded lazily either.");
        }
 
        var doc = new Document();
 
        doc.Add(new Field(
            IndexField.Id.ToString(), 
            employee.Id.ToString(), 
            Field.Store.YES, Field.Index.NOT_ANALYZED));
 
        doc.Add(new Field(
            IndexField.FirstName.ToString(), 
            employee.FirstName ?? "", 
            Field.Store.YES, Field.Index.ANALYZED));
 
        doc.Add(new Field(
            IndexField.LastName.ToString(), 
            employee.LastName ?? "", 
            Field.Store.YES, Field.Index.ANALYZED));
 
        employee.Skills.ToList().ForEach(s =>
        {
            var field = new Field(
                IndexField.Skill.ToString(),
                s.Knowledge.Name + " {" + s.Id + "}",
                Field.Store.YES,
                Field.Index.ANALYZED);
 
            //Boost any matches that will happen on this field
            //by the Skill's SkillGrade (x1-x5)
            field.Boost = s.SkillGrade.Value > 0 ? 
                s.SkillGrade.Value : 1;
 
            doc.Add(field);
        });
 
        employee.UserCourses.ToList().ForEach(c =>
            doc.Add(new Field(
                IndexField.Course.ToString(),
                c.Course.Name + " {" + c.Id + "}",
                Field.Store.YES, Field.Index.ANALYZED)));
 
        employee.UserCertificates.ToList().ForEach(c =>
            doc.Add(new Field(
                IndexField.Certificate.ToString(),
                c.Certificate.Name + " {" + c.Id + "}",
                Field.Store.YES, Field.Index.ANALYZED)));
 
        employee.SchoolEducations.ToList().ForEach(e =>
            doc.Add(new Field(
                IndexField.Course.ToString(),
                $"{e.Institute} {e.Specialism} {e.Degree}" + 
                " {" + e.Id + "}",
                Field.Store.YES, Field.Index.ANALYZED)));
 
        return doc;
    }
 
    public string GetFieldKeyByFieldId(int fieldId)
    {
        if (!Enum.IsDefined(typeof(IndexField), fieldId))
        {
            throw new ArgumentException(
                $"There's no Enum value corresponding to " +
                $"value {fieldId}.");
        }
 
        return ((IndexField) fieldId).ToString();
    }
 
    public int GetEntityId(Employee entity)
    {
        if (entity is null)
        {
            throw new ArgumentNullException();
        }
 
        return entity.Id;
    }
 
    public enum IndexField
    {
        Id,
        Education,
        Skill,
        Course,
        Certificate,
        FirstName,
        LastName
    }
}

A couple of notes on this one:

  • This class deletes the Index upon instantiation and creates it newly. Thus it is intended to be used as Singleton, meaning it should only be instantiated once when the app starts, and then be reused during the rest of the uptime of the app. All other classes can be instantiated per request.
  • The Analyzer used here is a customized one - Lucene.NET offers a WhitespaceAnalyzer, but not one which is also case-insensitive. See below for this class. However, for your Use Case, one of the given Analyzers might do, cf. the API documentation for available Analyzers: https://lucene.apache.org/core/3_0_3/api/core/org/apache/lucene/analysis/Analyzer.html
  • A Document can have multiple fields with the same key. This is very important for this Use-Case, as the various names of Skills, Certificates and so on, of which an Employee may have more than one, should be kept and searched separately, and not concatenated ito one long text.
  • Every field which represents a related entity is suffixed with the related entity's database ID. This serves to be able to load the entity again from the database easily.
  • It is possible to boost a Field. That means that if a search term matches on a boosted field, the document is being regarded more relevant and will thus appear higher up in the search results. In this case, the "Skills" Field is regarded as most interesting and is thus boosted.
  • A nested Enum is used to specify all available Field names. This is simply to avoid any "Magic Strings".

The custom Analyzer which I used:

public class CaseInsensitiveWhitespaceAnalyzer : Analyzer
{
    public override TokenStream TokenStream(
        string fieldName, TextReader reader)
    {
        var result = new WhitespaceTokenizer(reader);
        return new LowerCaseFilter(result);
    }
 
    public override TokenStream ReusableTokenStream(
        string fieldName, TextReader reader)
    {
        var tokenizer = (Tokenizer)PreviousTokenStream;
        if (tokenizer == null)
        {
            var result = new WhitespaceTokenizer(reader);
            return new LowerCaseFilter(result);
        }
        else
        {
            tokenizer.Reset(reader);
        }
 
        return tokenizer;
    }
 
    public override int GetPositionIncrementGap(string fieldName)
    {
        return 1;
    }
}

The override of the GetPositionIncrementGap is required because I used multiple fields with the same name in my Documents. By default, multiple Fields of the same name are treated as if there was one Field of that name, with all values concatenated and only separated by a space. With this override however, every document is considered an isolated piece of text.

On application startup, the LuceneIndex-classes are newly instantiated as Singletons, deleting any old Indexes and creating new ones. Then all entities are loaded for each Index and written to the Index using the LuceneWriter's AddOrUpdateRange method.

 

As the application is using the Repository Pattern with a UnitOfWork class, theres a central place where all changes pass through. Whenever an indexed entity is being updated or removed, it is passed to the respective method of the LuceneWriter to ensure the database and the Indexes are always synchronized.

 

That is already quite a bit of code and even more thoughts which were shed on the matter, and we have not even started what we actually want to use Lucene for: Searching stuff!

Searching entities

As a logical counterpart to the LuceneWriter class from above, a LuceneReader class takes care of any reading actions on the Lucene Indexes. The only reading action of interest here is searching the data stored in the Index.

Building Queries

Simply handing a search string to Lucene and search using that is not quite the amount of control that is anticipiated for this solution. More precisely, the search string can have the following properties, which should be handled accordingly:

  • If a simple word is given, a "best fit" search is made. This includes to try to find exact matches for the search string, but, if the search term is long enough for it to make sense, also conduct a Wildcard-Query and a Fuzzy Query. A Fuzzy Query is an approximate search, which i.e. if "architect" is given as search string may also find the german equivalent "Architekt".
  • If the search term is enclosed in double quotes, only exact matches should be regarded as relevant. The enclosed term can contain more than one word.
  • If the search term contains asterisks, only a WildcardQuery is executed.
  • If the search string consists of multiple, space-separated words (except when enclosed in double quotes), each word is searched for separately, and all searches must yield a result in order for Document to be added to the results.
  • As mentioned above, it is possible to specify whether all Fields should be searched or only one specific one.
  • It is possible to discard employees using search filters. While the filtering is not done using Lucene, it's important to only search through the relevant Documents as the result count is applied during the Lucene search, and thus there would be too few results if the Documents supplied by Lucene would be unfiltered. Thus when the search method is called, a list of IDs is passed which correspond to the viable Documents.

Query

A Query is Lucene's way to specify how exactly the searching should be done. A Query is characterised by the type of Query-Subclass that is used, i.e. TermQuery for exact matches or WildcardQuery for Wildcard searches. Additionally, a Term needs to be given which specifies which Field of the Documents should be searched, and the search term which should be searched for.

BooleanQuery

The BooleanQuery is a special kind of Query which can be used to combiny multiple Queries into one Query. For each Query added to a BooleanQuery it can be specified whether it must, should or must not yield a hit in order for the BooleanQuery to return a Document as result.

With those not-so-unusual requirements for the search it's already a rather complex query tree which needs to be assembled. Here an illustration for the query if someone searches for [architect web] (without the brackets) on all fields, and applies a filter:

 

Building these Queries is all but trivial and requires a decent amount of source code. Which is why this functionality was brought to a separate class:

public class LuceneQueryBuilder<T> : 
        ILuceneQueryBuilder<T> where T : class
{
    private readonly ILuceneIndex<T> _index;
    private readonly ILuceneConfiguration _configuration;
 
    public LuceneQueryBuilder(
        ILuceneIndex<T> index,
        ILuceneConfiguration configuration
    )
    {
        _index = index;
        _configuration = configuration;
    }
 
    public Query GetQuery(
        string searchString, 
        int? fieldIdToSearch, 
        IEnumerable<int> documentIdsToSearch)
    {
        var query = new BooleanQuery();
 
        IEnumerable<string> fieldsToSearch;
        if (fieldIdToSearch is null)
        {
            fieldsToSearch = _index.FieldKeys.ToList();
        }
        else
        {
            fieldsToSearch = new List<string>
            {
                _index.GetFieldKeyByFieldId(
                    (int)fieldIdToSearch)
            };
        }
 
        var searchStringParts = 
            GetAnalyzedSearchStringParts(searchString);
 
        if (!searchStringParts.Any())
        {
            return null;
        }
 
        foreach (var searchStringPart in searchStringParts)
        {
            var partSubQueries = new List<Query>();
 
            if (searchStringPart.SearchModifier == 
                SearchModifier.Phrase)
            {
                partSubQueries.AddRange(GetPhraseQueries(
                    searchStringPart.Text, fieldsToSearch));
            }
 
            if (searchStringPart.SearchModifier == 
                SearchModifier.Wildcard)
            {
                partSubQueries.AddRange(GetAllTerms(
                        searchStringPart.Text, fieldsToSearch)
                    .Select(term => new WildcardQuery(term)));
            }
 
            if (searchStringPart.SearchModifier == 
                SearchModifier.None)
            {
                partSubQueries.AddRange(GetAllTerms(
                        searchStringPart.Text, fieldsToSearch)
                    .Select(term =>
                    {
                        var termQuery = new TermQuery(term);
                        termQuery.Boost = 10; //Boost exact matches
                        return termQuery;
                    }));
            }
 
            //For search string without modifier, add additional
            //WildcardQuery if string length is long enough
            //to avoid too broad searches like *C* when user
            //quickly searches for users with Skill "C"
            if (searchStringPart.SearchModifier == 
                SearchModifier.None &&
                searchStringPart.Text.Length >= 
                _configuration.MinLengthForWildCardQuery)
            {
                //Add asterisks for Wildcard-Query.
                partSubQueries.AddRange(GetAllTerms("*" + 
                    searchStringPart.Text + "*", fieldsToSearch)
                    .Select(term => new WildcardQuery(term)));
            }
 
            //For search string without modifier, add additional
            //FuzzyQuery if string length is long enough
            //to avoid too broad searches like Java --> Jira
            if (searchStringPart.SearchModifier == 
                SearchModifier.None &&
                searchStringPart.Text.Length >= 
                _configuration.MinLengthForFuzzyQuery)
            {
                partSubQueries.AddRange(GetAllTerms(
                        searchStringPart.Text, fieldsToSearch)
                    .Select(term => new FuzzyQuery(
                        term,
                        _configuration.MinFuzzyQuerySimilarity,
                        _configuration.FuzzyQueryPrefixLength)));
            }
 
            //Combine all queries into one query so that at
            //least one of these queries must match.
            var partSearchQuery = new BooleanQuery();
            var partQuery = new BooleanQuery();
            partSubQueries.ForEach(q => partSearchQuery.Add(
                q, Occur.SHOULD));
            partQuery.Add(partSearchQuery, Occur.MUST);
 
            //If only specific documents should be searched,
            //add Term queries on the ID field
            if (documentIdsToSearch != null)
            {
                var idQueries = documentIdsToSearch
                    .Select(id => new TermQuery(new Term(
                        _index.IdFieldKey, id.ToString())));
 
                var idQuery = new BooleanQuery();
 
                idQueries.ToList().ForEach(q => idQuery.Add(
                    q, Occur.SHOULD));
                partQuery.Add(idQuery, Occur.MUST);
            }
 
            //Overall query is a combination of all search
            //string parts' combined queries.
            query.Add(partSearchQuery, Occur.MUST);
        }
 
        return query;
    }
 
    private List<AnalyzedSearchStringPart> 
        GetAnalyzedSearchStringParts(string searchString)
    {
        var stringParts = new List<AnalyzedSearchStringPart>();
 
        searchString = searchString.ToLowerInvariant();
 
        //Extract any exact phrases
        while (searchString.Contains("\""))
        {
            var start = searchString.IndexOf("\"", 
                StringComparison.Ordinal) + 1;
            var end = searchString.IndexOf("\"", start, 
                StringComparison.Ordinal);
 
            var part = searchString.Substring(
                start, end - start);
 
            if (!string.IsNullOrEmpty(part))
            {
                stringParts.Add(new AnalyzedSearchStringPart
                {
                    Text = part,
                    SearchModifier = SearchModifier.Phrase
                });
            }
 
            //cut out part including double quotes.
            searchString = searchString.Replace(
                $"\"{part}\"", "");
 
            //If number of double quotes is uneven and one
            //remains, ignore it.
            if (searchString.Count(f => f == '"') == 1)
            {
                searchString = searchString.Replace("\"", "");
            }
        }
 
        //Split by spaces and analyze each bit
        foreach (var part in searchString.Split(' '))
        {
            var trimmed = part.Trim();
            if (trimmed.Length == 0)
            {
                continue;
            }
 
            //If part contains asterisks and anything else than
            //asterisks, it's a Wildcard part
            //Only-asterisks ("*", "**"...) are ignored.
            if (trimmed.Contains("*") && trimmed
                    .Replace("*", "").Length > 0)
            {
                stringParts.Add(new AnalyzedSearchStringPart
                {
                    Text = trimmed,
                    SearchModifier = SearchModifier.Wildcard
                });
            }
 
            //Everything else is a regular search term
            else
            {
                stringParts.Add(new AnalyzedSearchStringPart
                {
                    Text = trimmed,
                    SearchModifier = SearchModifier.None
                });
            }
        }
 
        return stringParts;
    }
 
    private IEnumerable<Query> GetPhraseQueries(
        string searchString, IEnumerable<string> fieldNames)
    {
        var queries = new List<PhraseQuery>();
 
        foreach (var fieldName in fieldNames)
        {
            var phraseWords = searchString.Split(' ');
            var phraseQuery = new PhraseQuery();
 
            foreach (var phraseWord in phraseWords)
            {
                phraseQuery.Add(new Term(fieldName, phraseWord));
            }
 
            queries.Add(phraseQuery);
        }
 
        return queries;
    }
 
    //Shorthand to get Terms for each field.
    private IEnumerable<Term> GetAllTerms(
        string searchString, IEnumerable<string> fieldNames)
    {
        return fieldNames.Select(fieldName => 
            new Term(fieldName, searchString));
    }
}

Notes:

  • The class AnalyzedSearchStringPart was defined by myself for easier handling of the analyzed bits of the search string.
  • When the search string is enclosed in double quotes, a PhraseQuery needs to be defined, not a TermQuery, as the latter one can not match onto multiple Tokens (words) at once. Note that its not possible to pass the entire search string at once to the PhraseQuery as one Term, each word must be added separately to the same Query.

Interpreting the search results

After the overall Query is assembled, it's not hard to continue to actually searching. All it takes is instantiating an IndexSearcher and hand it the query and the number of results it should deliver.

 

Much trickier however is, to understand, what you actually get from the IndexSearcher. Most noteworthy is maybe, that the Documents, and thus the entities which matched are not actually returned, only the ID's of the matching Documents are. This is probably, as deserializing the Documents is a quite expensive operation. 

 

Lucene offers functionalities to highlight those tokens = parts of the original text onto which a search string matched, i.e. by wrapping that part into some HTML markup. However, this is really intended for long texts, where its sensible to include a handful of words before and after the matched word (like the preview you see in a Google Search). However in this case it should always be the entire Field content that should be returned. But for that, some additional experimenting was required, as well as figuring out some more Lucene terms...

Fragment

An extract of a Field's Text. Usually encompasses multiple Tokens

Fragmenter

Logic unit which splits a Field's text into Fragments.

Formatter

The thing which adds some Markup around the matched Word(s). Typical implementation is using HTML Markup.

QueryScorer

Determines a Score for each Token. The higher the score is, the better the Token matches onto the search text. Tokens with score zero did not match at all.

Highlighter

Combines the functionalities of Fragmenter, Formatter and Scorer. Determines the scoring Tokens, highlights them with the specified Formatter's way of working, and adds the surrounding fragment.

The clue to always getting the entire Field's contents was mainly to use the so-called NullFragmenter which basically just makes one big Fragment out of the entire Text. Using the Highlighters "Low-Level-API" method GetBestTextFragments I was able to get all Fields of the current Document, along with their Score and Markup. Discarding all Fields with Score zero I got the highlihted pieces I wanted for showing back to the User in the FrontEnd.

 

As mentioned above I also appended a database ID to some fields. This I extracted again here in the LuceneReader and stored it along with the other information which was relevant for me in an own set of DTOs.

 

Here's the last code snippet, the LuceneReader:

public class LuceneReader<T> : ILuceneReader<T> where T : class
{
    private readonly ILuceneIndex<T> _index;
    private readonly ILuceneQueryBuilder<T> _queryBuilder;
    private readonly ILuceneConfiguration _configuration;
 
    public LuceneReader(
        ILuceneIndex<T> index,
        ILuceneQueryBuilder<T> queryBuilder,
        ILuceneConfiguration configuration)
    {
        _index = index;
        _queryBuilder = queryBuilder;
        _configuration = configuration;
    }
 
    public LuceneSearchResult<T> Search(
        string searchString,
        bool doLoadAllResults,
        int? fieldIdToSearch = null,
        IEnumerable<int> documentIdsToSearch = null)
    {
        var searchResultItems = 
            new List<LuceneSearchResultItem<T>>();
        int totalHits;
        var query = _queryBuilder.GetQuery(
            searchString, 
            fieldIdToSearch, 
            documentIdsToSearch);
 
        if (query is null)
        {
            throw new ArgumentException(
                $"No meaningful query could be derived from " +
                $"the given searchString: {searchString}");
        }
 
        using (var indexReader = IndexReader.Open(
            _index.Directory, true))
        using (var indexSearcher = new IndexSearcher(
            _index.Directory))
        {
            //Force calculating of scores.
            indexSearcher.SetDefaultFieldSortScoring(true, true); 
            
            TopDocs searchResult;
 
            if (doLoadAllResults)
            {
                //Hack to load all results: set result count
                //to number of Documents in Index.
                searchResult = indexSearcher.Search(
                    query, 
                    indexReader.NumDocs());
            }
            else
            {
                searchResult = indexSearcher.Search(
                    query, 
                    _configuration.DefaultResultCountPerEntity);
            }
 
            totalHits = searchResult.TotalHits;
 
            searchResult.ScoreDocs.ToList().ForEach(x =>
            {
                var document = indexSearcher.Doc(x.Doc);
                var searchResultItem = GetResultItemFromDocument(
                    document, query, x.Score);
 
                searchResultItems.Add(searchResultItem);
            });
        }
 
        return new LuceneSearchResult<T>
        {
            SearchResultItems = searchResultItems,
            TotalResultCount = totalHits
        };
    }
 
    //Extract the relevant information about the search.
    private LuceneSearchResultItem<T> GetResultItemFromDocument(
        Document document, Query query, float score)
    {
        var scorer = new QueryScorer(query);
        var formatter = new SimpleHTMLFormatter(
            "<span class=\"highlighted-match\">", "</span>");
 
        //Makes sure always the entire content of a
        //field is fetched.
        var fragmenter = new NullFragmenter();
        var highlighter = new Highlighter(formatter, scorer)
        {
            TextFragmenter = fragmenter
        };
 
        var searchResultItem = new LuceneSearchResultItem<T>
        {
            Id = int.Parse(document.Get(_index.IdFieldKey)),
            Document = document,
            Score = score
        };
 
        //For each field check if there were any search matches
        //on that field.
        document.GetFields().ToList().ForEach(field =>
        {
            var fieldKey = field.Name;
            var fieldValue = field.StringValue;
            var tokenStream = _index.Analyzer.TokenStream(
                fieldKey, new StringReader(fieldValue));
 
            //For each field get value with HTML markup around
            //matches. Ignore fields with a score of 0 (no match)
            //As NullFragmenter anyway only creates one Fragment
            //per Field, maxNumberOfFragments can be set to 1.
            var matches = highlighter.GetBestTextFragments(
                    tokenStream, fieldValue, false, 1)
                .Where(m => m.Score > 0)
                .Select(m => GetMatchFromFragment(m, fieldKey));
 
            searchResultItem.Matches.AddRange(matches);
        });
 
        return searchResultItem;
    }
 
    //Create a LuceneMatch instance for every field which had
    //one or more matches.
    private LuceneMatch GetMatchFromFragment(
        TextFragment fragment, string fieldKey)
    {
        var matchText = fragment.ToString();
 
        var match = new LuceneMatch
        {
            FieldKey = fieldKey,
            MatchText = matchText
        };
 
        //If field value contains an entity ID, extract it,
        //parse it and store it in separate property.
        var endsWithCurlyBracesWithNumbersRegex = 
            new Regex(@".*\s[{]\d*[}]");
        if (endsWithCurlyBracesWithNumbersRegex.IsMatch(matchText))
        {
            var position = matchText
                .LastIndexOf("{", StringComparison.Ordinal);
            match.MatchText = matchText
                .Substring(0, position).Trim();
 
            var idAsString = matchText
                .Substring(position).Trim('{', '}');
            var couldParse = int.TryParse(
                idAsString, out var parsedId);
 
            match.FieldEntityId = couldParse ? parsedId : 
                null as int?;
        }
 
        return match;
    }
}

Closing statement

Obviously, some bits of code have not been show here, namely all the Interfaces which were required for either Dependency Injection and/or polymorphistic use of the entity-specific implementations. Also some small auxiliary classes with naught but some simple Properties.

 

Also I built an additional SearchService ontop of this Lucene implementation which takes care of loading the corresponding entities from the database and applying the filters which don't require Lucene.

 

At this point I can say that I have grown pretty fond of Lucene.NET, however it was quite arduous to get to this solution as it required quite a decent amount of experimenting and combining findings of other blog posts into something that fulfills what I required. Which kind of surprised me, given that I deem my Use Case not to be that exotic.