Elasticsearch as a Database : Yes or No?

Image for post
Image for post

Elasticsearch is construed primarily as a search engine and log consumption system.

Most people advocate using something like MySQL/PostgreSQL/Mongo as the primary database and Es as an indexing backend.

ES can however, be used as a database, obviating the need for a primary database, altogether.

The motivation behind this is as follows:

  1. Reduce storage costs : The moment you use your search engine as a database, you avoid data duplication, and bring down your storage costs by a good percentage.
  2. Reduce data consistency issues : If you commit a document to your primary data-store, you need something that watches its transaction logs to feed that document to elasticsearch. In cases of fail-overs, data inconsistences can emerge between the two. If you only have one database, this problem vanishes.
  3. Reduce Dependencies : You need fewer dependencies , this means fewer attack vectors.

However there are many issues that you need to deal with, which you otherwise take for granted.

  1. Refresh Interval : It may seem strange, but once a document is updated/committed in elasticsearch, it does not become immediately visible, to search requests.While this may not seem like a big problem, it does pose issues in applications, where more than one user interacts with the same document.

2. Transactions: The holy-grail of databasing. Elasticsearch DOES NOT support transactions. Individual document updates are, fortunately, atomic and consistent.

3. Unique Fields: Elasticsearch does not support unique constraints. You can, for example, insert two users into the users index with the same phone number. There is no inbuilt way to prevent this from happening.

With these three, seemingly major issues to deal with, I discuss some workarounds, and programming patterns, that find ways to manage them.

Use Search Requests Only for Search:

This may sound like a no-brainer, but when you use elasticsearch as a database, there is tendency to make use of the /_search API’s in routine database operations.

The search api’s are designed to provide search results. They should not be used in the process of CRUD operations.

Here’s a simple example to elaborate:

You create a simple users document like so:

PUT /users/1
"name" : "John Doe"

You now wish to update this document with a phone number, provided that it doesn’t have a phone number. You retrieve the document by using the /_search api, with an id filter.

GET users/_search
"query" : {"ids" : {"values" : ["1"]}}


{"name" : "John Doe"}

It is possible, that between the time you created the document and retrieved it using the GET/_search endpoint, another administrator has already added a phone number to it.

You may then update a document that already has a phone number.

There are obviously ways to deal with this, (use a scripted update, check version numbers when you update etc.). But if you simply avoided using a /_search request to return the document, and used a GET /_id request, you would receive a real-time version of the document. Its very likely, you would already see the new phone number, and never have to execute an update again.

Write All Code in the “Retry-able” Sense:

Given the fact that Elasticsearch does not support transactions, you need to start thinking in terms of multi-phase commits.

I won’t go into the details of a two-phase or multi-phase commit, but I will introduce a simple flow of events, that deals with updating multiple different documents, without the need for transactions.

Let’s assume a simple case where you want to create a Blog Post and also a bunch of Tags at the same time.

In the SQL world, you’d do this in a transaction, so that the post + tags either all get created or they don’t.

With Elasticsearch , this is more contrived.

Suppose you send a /_bulk create request, which basically sends both the create requests in the same database call.

We have four possibilities:

  1. Both documents successfully created.
  2. Post created, tag not created.
  3. Post not created, tag created.
  4. Both documents failed.

Since there is no transaction support, you can be left with 4 different scenarios to deal with.

The only way to manage this system, is to make all your operations recoverable and retry-able.

Make sure every operation in your code has a system of recovery. In the example above, assume that the Post was created, but the tag creation failed. If your system had to recover from this, it has to be able to attempt recreation of that tag, at a later time.

In order to do that, you should have followed this pattern:

  1. If the Post is committed, commit a tag id with it.
  2. At the end of your bulk call, load the post
  3. If the Post exists -> load the tag ids mentioned in the Post document.
  4. If they do not exist -> recreate them.
  5. If the Post does not exist, recreate the post, and create a new bunch of tags associated with it.

You’re probably wondering, what happens if the Post didn’t get created the first time, but the Tags got created, as we recreated the Post later on AND also recreated a new bunch of Tags.

In such cases, you just have to let go of the tags created in the first case. Not having transactions comes at a cost.

The idea of storing the tag ids on the Post, makes the operation recoverable.

In reality, you should always have a state attribute on any given document, and store the “State” or next permitted operation allowable on that document.

The state can be used to recover and retry operations in most cases.

Try To Reduce Associations

Elasticsearch functions best when you can keep all data related to an object in one document.

This is very hard in practise, but it obviates the need for transactions and state management, if you can design your data in that way.

In the example above, keeping all a Post’s Tags in the Post document, solves a lot of problems. Obviously this cannot be done, for something like a comments system as you could have many thousands of those.

All in all, Elasticsearch can be used as a database, as long as you are careful about these problems. I will be writing more on this in the future.

Written by

Post-Graduate in Clinical Pathology, Lab Director - Pathofast, Computer Vision Enthusiast, Founder algorini.com

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store