Full-text Searching with CosmosDB

While I settled on CosmosDB as the final destination for the document database in my solution, I did early work on the application using the MongoDB docker container. I was happy with how easy it was to write a search method for MongoDB by defining some full-text search indexes and querying those indexes. It was an easy search, since MongoDB did all the hard work. Something like:

public async Task Search<T>(string search)
{
    var query = Builders<T>.Filter.Text(search);
    var results = await _collection.FindAsync(query);
    return results;
}

However, when creating those same full-text search indexes on CosmosDB, you will find that it is not supported. So, what’s the analogous solution in Azure?

Azure does support the idea of full-text indexes, but at a larger scale. Azure Cognitive Search allows you to index a CosmosDB collection for full-text search, as well as filtering, sorting, suggestions and facets. The concept is much the same: a full-text index is defined in Cognitive Search, and it is applied to a specific data source. An indexer process is configured and triggered every time a change is projected to CosmosDB. This isn’t quite automatic, so let’s look at the process:

Define the data source, index, and an indexer in Cognitive Search. The process of creating data sources, indexes and indexers is well-documented.

Ensure that the defined data source includes the high watermark change detection. We are going to disable periodic indexing and use an on-demand approach instead, and need to ensure that the indexing is incremental.

"dataChangeDetectionPolicy": {
        "@odata.type": "#Microsoft.Azure.Search.HighWaterMarkChangeDetectionPolicy",
        "highWaterMarkColumnName": "_ts"
    },

I should note that the instructions given in the documentation use REST API calls from Postman or VS Code. This is not currently necessary, as the Azure Portal supports all the necessary interface elements for defining a data source, index, and indexer for use with CosmosDB MongoDB.

Once the search components have been defined, it will now be possible to glue it all together:

Run the indexer and verify that your documents appear as intended in the search index. Note that the search index doesn’t need to contain the whole document. Only the fields marked as retrievable will be transferred to the index. I was able to use the same read model to query the search index, with the understanding that not all fields would be available for use when retrieving search results.

Verify that your searches work. The Cognitive Search resource has a search explorer where you can choose an index and run queries against it.

Update your write activities on searchable collections to also run the indexer. The code below is from an Event Sourced system which has a distinct read and write side. A projection is a persistent subscription to the event store that duplicates changes into the document database. I’ve updated the Projection constructor to optionally allow passing a search endpoint and indexer to run.

    public Projection(
        IMongoClient mongo, 
        Projector projector, 
        SecretClient secrets, 
        string? searchEndpoint = null, 
        string? indexer = null)
    {
        _mongo = mongo;
        _projector = projector;
        _secrets = secrets;
        _searchEndpoint = searchEndpoint;
        _indexer = indexer;
    }

Then, after completing the projection to the document database:

        if (!string.IsNullOrWhiteSpace(_searchEndpoint))
        {
            // run the indexer if it's been provided
            var key = await _secrets.GetSecretAsync("CognitiveSearch-ApiKey");
            var indexClient = new SearchIndexerClient(new Uri(_searchEndpoint), 
                new AzureKeyCredential(key.Value.Value));
            await indexClient.RunIndexerAsync(_indexer);
        }

That’s the read side taken care of. Every projection to the read side will result in the new or updated document being indexed and very quickly available to search. “Very quickly” in this case means that the human delay between triggering persistence of the document, and providing the search query is more than sufficient to allow the indexer to work. There is _some_ delay, but in practical terms, it is real-time.

I’m not fond of having to pull the API key out of the secret vault every time, but RBAC access to the search endpoint is in public preview and not yet supported by the SDK. At some point we will presumably be able to provide an access token instead of the administrative key and reduce the permissions allowed by the application to the search service.

Now we simply need to replace the MongoDB search code with something that queries the Cognitive Search index:

[HttpGet("search")]
public async Task<IReadOnlyList<ReadModels.MyModel>> Search([FromQuery] string search)
{
    var endpoint = _configuration["Azure:Cognitive:SearchEndpoint"];
    var index = _configuration["Azure:Cognitive:SearchIndexName:MyModel"];
    var key = await _secrets.GetSecretAsync("CognitiveSearch-ApiKey");
    var credential = new AzureKeyCredential(key.Value.Value);
    var searchClient = new SearchClient(new Uri(endpoint), index, credential);
    var results = await searchClient.SearchAsync<ReadModels.MyModel>(search);
    var list = await results.Value.GetResultsAsync().Select(result => result.Document).ToListAsync();
    return list.AsReadOnly();
}

There is the end-to-end solution. Any time an aggregate is persisted to the event store, the resulting projection will also run the indexer and index the newly updated document. Because our search index schema is a subset of our read model, we can use the same model classes with the caveat that not all fields will be available from the results of a search.

Some final notes:

While the Azure Portal API for importing data supports a connection string using ApiKind=MongoDb, this is not enabled by default. It is necessary to join the public preview (which is linked under the Cognitive Search documentation referring to CosmosDB MongoDB API) for now to enable ApiKind in the connection string. Once it is enabled, however, you should be able to translate the instructions provided in the Cognitive Search documentation for use in the Azure Portal.

This is obviously a more involved solution than having the full-text index stored directly in the database, but I think the benefits outweigh the costs. A search index is a relatively inexpensive thing (you get 15 of them for $100USD/mo) and provides full-text search of documents of arbitrary complexity. Depending on your search tier, you may also have unlimited numbers of documents in your index. This is an extremely scalable, customizable, and easy-to-use search that is useful in many applications. There are many features you will find useful in your own application, and the initial setup is really very easy.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: