ElasticSearch Online ReIndex With Writes C#

We just now deployed a application solely using ElasticSearch (Elastic.co) as our backend full-text search provider. We migrated away from SQL Full Text search for reasons I will post later. ElasticSearch runs on top of Lucene, but adds a REST api and cluster goodness on top.

Quick setup,
You need Java, and the Java Runtime(JRE) in your path. I cheat and update the elasticsearch batch files to point at a download.
You need to run elasticsearch. On windows/linux, you just run it from the command line. Later you can register it as a background service.
Install a management plugin. I use a plugin from mobz “plugin install mobz/elasticsearch-head” in the elasticsearch bin folder should do the trick.
Port is 9200 and the plugin’s run on the “_plugin” folder. If you get it up and running you should get this screen minus the indexes.

ElasticSearch-Head-Home

Terms
Index==Table in the SQL terms
Mappings==I want to tell you what this field means and how to store it. In SQL, we just have data types, in ElasticSearch you have much more control than “This is a string”.
Alias==A view in SQL terms, this is important later.
Replica==Backup copy
Shard==A way to split one really large index into smaller indexes. Keep in mind you need to know details of how to update a document if you shard.

The first problem you run into with ElasticSearch is “How do I change my mappings?”. Easy, you make a new index and point everything to it. In SQL Server, SSMS will do this for you by creating a new table, moving the data, dropping the old table, renaming the current table. Elasticsearch doesn’t have an official client, so you have to code this yourself. Second problem is how to handle this while the data is live. ElasticSearch doesn’t have the concept of “locks” in the SQL server sense. Plus relying on locking data to perform an update is so 2004 of an application. Large ElasticSearch instances contains billions of documents, hard to lock or file copy all of them.

Source Code C# (If you want to cut to the chase)
https://github.com/joshbartley/ElasticSearch_ReIndex

ElasticSearch-Single-Index
You always start with one index.

You get some data into ElasticSearch, you find out your mapping is wrong and you need to change it. Four records is easy, four million would be harder. First step you should verison your indexes, dates, numbers, random word generator, doesn’t matter. Whatever makes the most sense to you to understand.

You create your new secondary index, and add in a lowercase mapping, this is superfluous to this post, just an example.

ElasticSearch-Second-Index-ReIndex

If you move all the data now, you may miss the records that were being put into “datav1” during the Re-Index. Here is where the aliases come into play. You can create them however you like, I named them *_R(ead) and *_R(ead)W(rite). You cannot write to an alias with more than one index.

ElasticSearch-Read-Write-Alias

The “data_rw” alias is only to let your application know where to write the data. You have to code your application to accept the alias, grab the indexes, and write to both. In this example I am not using the bulk query as I am only dealing with two records on each operation. Anyone want to PR a change to the C# Client?

[csharp]
private static void WriteSecondaryObjects(ElasticClient client)
{
var indexes = client.GetAlias(x => x.Name("data_rw"));

foreach (var index in indexes.Indices)
{
client.Index(new Company() { Name = "Mega Acme Corp" }, idx => idx.Index(index.Key));
client.Index(new Company() { Name = "Global World Domination Corp Acme LLC" }, idx => idx.Index(index.Key));
}
}
[/csharp]

After I wrote two records, the document count goes from 2 to 4 on the first index, and 0 to 2 on the second. Notice the “data_r” alias is still pointed at the old index, this is because the new index doesn’t have all the data yet.

chrome_2016-04-05_21-23-44

The following code I grabbed from a StackOverflow answer and updated it to the latest NEST (C# Client) for ElasticSearch. http://stackoverflow.com/a/34867857/32963

[csharp]
public static void Reindex(ElasticClient client, string aliasName, string currentIndexName, string nextIndexName)
{
Console.WriteLine("Reindexing documents to new index…");
var searchResult = client.Search<object>(s => s.Index(currentIndexName).AllTypes().From(0).Size(100).Query(q => q.MatchAll()).SearchType(Elasticsearch.Net.SearchType.Scan).Scroll("2m"));
if (searchResult.Total <= 0)
{
Console.WriteLine("Existing index has no documents, nothing to reindex.");
}
else
{
var page = 0;
IBulkResponse bulkResponse = null;
do
{
var result = searchResult;
searchResult = client.Scroll<object>(new Time("2m"), result.ScrollId);
if (searchResult.Documents != null && searchResult.Documents.Any())
{
ThrowOnError(searchResult,"reindex scroll " + page);
bulkResponse = (IBulkResponse)ThrowOnError(client.Bulk(b =>
{
foreach (var hit in searchResult.Hits)
{
b.Index<object>(bi => bi.Document(hit.Source).Type(hit.Type).Index(nextIndexName).Id(hit.Id));
}

return b;
}),"reindex page " + page);
Console.WriteLine("Reindexing progress: " + (page + 1) * 100);
}

++page;
}
while (searchResult.IsValid && bulkResponse != null && bulkResponse.IsValid && searchResult.Documents != null && searchResult.Documents.Any());
Console.WriteLine("Reindexing complete!");
}

Console.WriteLine("Updating alias to point to new index…");
client.Alias(a => a
.Add(aa => aa.Alias(aliasName).Index(nextIndexName))
.Remove(aa => aa.Alias(aliasName).Index(currentIndexName)));
}

[/csharp]

The NEST ReIndex function creates a new index and requires a mapping at create time, that is not what we need. Once it runs, it will grab all the old data. Now if you had data being pumped into ElasticSearch it should be writing it to both indexes.

elasticsearch-reindex

At this point, you would use some feature toggles/flags to test out the new index’s mappings. If all is well, switch over the read alias.

elasticsearch-reindex-complete

DONE!

If you made it this far and want to know what the rollback plan is, well we didn’t delete anything. So just swap the alias back over. Ideally if you used the feature toggle/flag, you would have reduced your risk for the rollback. We are testing this out on Thursday, so we will see how it works with half a million records 🙂

 

Leave a Reply

Your email address will not be published. Required fields are marked *