Purpose-built databases on AWS

The world has changed. Businesses feel the need to innovate faster, and in an effort to be more agile than their competition are breaking up their monolithic landscapes into distributed (micro-) services. By doing this each service can be developed independently by a team that owns every aspect of its development, security and operations. This allows for a release cycle that is faster than ever before. But such decentralization comes with its own set of challenges.

To successfully decouple services it is often necessary to provide each of them with its own database, as shared databases could couple them in a way that would defeat the purpose of splitting them up in the first place. This is considered challenging, but also opens up a new door: it is now possible to select database types that fit the needs of the service on a case-by-case basis. This is where purpose-built – specialized databases come in, allowing us to optimize to the need of the service. But picking one from the large offering that is available is a challenge in itself.

In this post we will be looking at the benefits and challenges of using purpose-built databases in the cloud and how you would select one from the offering of the largest of the cloud-providers: Amazon Web Services.

Benefits of purpose-built databases

The obvious benefit of using a purpose-built database is performance. Databases are often optimized for a particular way of accessing the data. It is possible to query more data faster if the type of database matches the access pattern of the service. Take for example a service that works more with relationships between entities than the entities themselves. Such a service could benefit from a graph database that is optimized specifically for storing and retrieving relationships.

Another benefit closely linked to performance is scalability. As a rule of thumb you trade flexibility for scalability: the more flexibile you are in how you retrieve or store your data the harder it becomes to scale the database. Services that do not require this flexibility can benefit from simpler databases that are easier to scale. As an example, many transactional (OLTP) systems can benefit greatly from key-value databases that scale easier than their relational counterparts. An example here would be an online shopping website using a simple key-value database, trading query flexibility for faster read/write speeds.

A less obvious benefit that purpose-built databases can bring is improved development speed and a shorter release cycle. This is harder to quantify, but the argument is that teams can better match the data model to the service. Instead of making the service fit the database, the database can be chosen to fit the way the service works. Graph databases are again a good example, but another one could be the use of document databases. If a service mostly works with independent and flexible records that can differ one from another, then a flexible document database will be a better fit than the rigid relational alternative.

Challenges

The use of purpose-built databases also brings a set of challenges. Some of them are a result of distributing your data across services, others from the fact that you use multiple different databases. The biggest challenges here are data consistency, centralized analytics and the skill requirement.

When using multiple databases in your application landscape consistency can become an issue. When state is spread over multiple services it becomes harder to make sure that there are no contradictions. This is a valid concert and should be designed for. Luckily there are ways to solve this, such as making services completely independent. Or by creating a separate service that guards consistency, performing compensating (trans)actions and hereby implementing eventual consistency.

Another challenge is consuming data for analyses. Analysts create valuable analyses by combining data across the whole of the business. This data is now stored across multiple databases, some of which perform poorly when used for analytics. The solution to this problem will depend highly on the analytical requirements, but a popular solution is to feed the data into a data lake and create curated datasets per business concern. The goal is to have a single location where all your data comes together for analytical purposes. Unsurprisingly, once again a purpose-built database can be picked. This time to optimize for the large amount of data and required flexibility for analyses.

The last challenge we will cover is the added complexity that comes from introducing multiple types of databases. Developing against different types of databases is usually not the issue, but the added overhead of installing, configuring, maintaining and optimizing is not to be taken lightly. It is mainly this complexity that has historically formed a barrier for using purpose-built databases. This has changed, and cloud providers are doing a great job at removing this overhead by offering managed databases.

The cloud

Purpose built databases are not necessarily new. NoSQL became a popular alternative to SQL around 10 years ago, and specialized databases like those used for graphs are decades old. Such databases however always had a barrier for entry. You had to buy and configure hardware and invest in knowledge on how to install, configure and operate them. Scaling up was an involved process and scaling out required more capital investment.

Cloud vendors have lifted many of these barriers since then. It started with virtual machines that you could rent per hour, removing the need for capital investment. Managed databases quickly became a thing, so that you no longer had to provision and maintain your virtual machine, operating system, database installation and updates. Now the modern cloud offers serverless solutions: abstracted databases that you no longer have to maintain, scale or operate in any way.

It is therefore not surprising that the cloud is an integral part of modern development. Furthermore, developing in the cloud grants more advantages. Less or even no maintenance is required due to managed and serverless offerings. Deployments are quicker due to new development practices such as Infrastructure-as-Code. Nowadays developers can focus on building cool stuff (read: adding business value to the product) and spend less time provisioning, configuring and maintaining databases. In a new age where the ability to rapidly innovate is paramount to success these benefits mean a lot.

Offering on AWS

Now that we understand why we would choose a purpose-built database and what challenges to expect, it is time to look at the choices AWS gives us. This is a non-trivial task, as you can choose from over 16 (!) combinations of databases and database engines. For the sake of discussion and to make reasoning about them easier we will only look at services in isolation (10 in total) and group them together. This grouping is an oversimplification and is meant to reflect how we experience and think about these databases. Hopefully this will help you in making sense of the large number of choices you have and will prove to be a good starting point for your own analysis. The groups we will look at are Caching, Warehousing, Safe & production-grade, Business-critical and Transactional processes.

Caching

If you are looking at speeding up your queries or decreasing the load on another database you are looking for a cache. The category of database here is called the In-Memory database. They are not limited to caching as such, and some are used quite handily for quickly changing aggregate data such as leaderboards.

Elasticache is the caching service offered by AWS. The service is available in two flavors, namely Redis and Memcached. The easy way to differentiate them is to think of Memcached as a key-value cache that is best used for simple caching solutions. Redis on the hand could even serve as a secondary database due to its disaster recovery capabilities and advanced data types. Prefer Redis when building leadership boards or creating caching layers that need more than simple value retrieval.

Warehousing

Warehousing is the category of analytics on the scale of petabytes. When building a warehouse you are looking for a database that can store and query enormous amounts of data. Data in a warehouse can be hot (queried often) or cold (queried infrequently).

On AWS you will be primarily looking at Redshift in this category, but for cold data Athena + S3 is worth considering as well. Use Athena when your data is stored primarily on S3 and your analytical requirements are not very complex. The trade-off here is the higher latency when querying the data. The upside is the fact that Athena is pay-per-use and much cheaper than Redshift for warehouses that store data more than they query it. In large production environments it is very well possible for S3, Athena and Redshift to co-exist due to how they complement each other.

Business-critical databases

Highly performant, resilient and scalable. Business-critical databases have to do many things right. This is the category where database engines start to cost more than the hardware they run on.

Aurora is a relational and cloud-native database that uses a completely different architecture which is very suited for the cloud. It is pricier than RDS, but for the most important processes in your organization the price is very likely worth it.

Quantum ledger is an immutable and verifiable store built on blockchain technology. This may seem like a surprising entry in this category, but business-critical does not always mean maximum performance. Sometimes it means maximum trust. Quantum ledger provides this kind of trust and you can use it when you need to maintain an immutable trail of changes and guarantee that the data is untampered with.

Transactional processes

When your database mostly handles large volumes of operations related to various transactions in the business sense of the word, then you can speak of a transactional (OLTP) store. Think of carts for web shops or ATM systems for banks. Such databases often scale really well at the cost of not being good at analytics. This category of storage is often described by the fact that data is inserted and retrieved in a very specific manner.

DynamoDB is an evolution of a database developed by Amazon to solve the issues they experienced with scaling their relational databases. Describing it in detail with the nuance it needs would require us to write a book (do check out The DynamoDB Book if you are interested). At the risk of oversimplifying, the short summary is as follows: DynamoDB is a NoSQL database that is meant to run and scale extremely well with practically no maintenance, making it a prime choice for many processes. But these benefits come at the cost of flexibility. The limitations it imposes on your schema makes it difficult to design for and in certain cases it is objectively a bad choice. A rule-of-thumb is that DynamoDB should be evaluated if you can describe 80%+ of the queries that you will need for the data beforehand.

Timestream can be used to quickly gain value from data that is time-based. It is a prime candidate for ingesting data from IoT solutions that measure something continuously. The database is serverless and efficient when querying by, you guessed it, time. As of the time of writing Timestream is still in Preview and can only be used after registration.

Another database that is included here due to its specific access pattern is Neptune. If your data is best described by relationships between entities, that is, your primary entity is the relationship, then Neptune fits the bill. Neptune is a so-called Graph database and is extremely efficient when modelling relationships. It is notable the only non-serverless option in this category, meaning that you will have to provision and monitor capacity.

Safe & production-grade

This is the category of safe choice. Picking a database from this list will never be a wrong choice, but it might not be the optimal one. Warehousing databases will outperform this category when it comes to large storage and big queries, business-critical databases will provide more of everything and transactional databases will scale much better.

RDS (Relational Database Service) is the managed relational database offering of AWS. Pick your virtual machine and your database engine and off you go. Relational databases are well-known and understood, and RDS is a safe pick for one. If your requirements aren’t too harsh or if you’re just getting started then RDS will do the job.

DocumentDB is the document store in the lineup. It is often compared to MongoDB due to the fact that Amazon attempts to be compatible with the Mongo API, but it is a different product. If your data is lean on relationships and consists of mostly self-containing documents, then DocumentDB is a fine choice. It also serves as a fallback for when you prefer a NoSQL database but find DynamoDB too limiting.

How to choose

To help you choose one of the many databases that are available we created a decision tree. Please remember that this an oversimplification. The tree does however communicate what questions you should be asking yourself when picking a database.

Decision tree

Summary

Less is more, but only as long as we are not talking databases. In the age of distributed computing and agile systems there are great advantages to having not only more, but more diverse databases. This allows to optimize the storage to the needs of the service and enrich the customer experience. Cloud providers are seizing the opportunity and compete with purpose-built databases that fill this need. By leveraging these purpose-built databases we can improve performance, increase productivity and greatly help in moving to a more agile environment.

Picking a purpose-built database is in itself a challenge. Understanding the pro’s and con’s is important, as well as knowing the choices that you have. We have shown a simplified overview of the databases that AWS offers and how to choose between them. In reality, choices are rarely as simple as decision trees make them out to be. Experience, skill and time is required to make a good selection, but the benefits of purpose-built databases are worth it.

Ilia Awakimjan

Ilia Awakimjan is na het behalen van zijn Master-titel sinds 2017 Software Engineer met specialisatie AWS in dienst bij Profit4Cloud. Ilia is AWS Certified DevOps Professional, AWS Certified Security Specialty en AWS Certified Networking Specialty gecertificeerd.