Blog

Life of a database in the Cloud

CROZ

27.05.2019

Running a database in a containerized environment isn’t such a trivial task. If you need help, contact us.

With popularization of Kubernetes and other cloud solutions it is now natural for you to deploy your applications on cloud platforms. Most of the enterprise software is built using SOA and Microservice architectural patterns to maximize benefits from distributed and highly available platforms. But have you thought about running your DBMS on such system? I mean, everything should work fine. Right?

Well, it turns out that running database in containerized environment isn’t such trivial task as it seems at first. Many of the most widely used commercial database solutions are older than your junior engineers. Some are even older than your senior developers – first Oracle release dated back in 1979! They surely haven’t thought about containerizing a database or distributing it over large number of computers back then. It is challenging to adapt such large system to support modern clustering, sharding or replication methods. But it’s 2019, containers are broadly used – they aren’t some alpha tech and database companies are adapting their products to fulfill needs of modern infrastructure.

It’s pretty much normal workflow that you run database in container while in development. It keeps your machine nice and tidy, speeds up development environment setup and makes everything easily reproducible. You can write automation files and store them in VCS along your source code. But is this approach good for full-blown production environment?

We usually see single instance containerized databases when working on small projects, which is relatively OK if you are careful enough. You must make sure you persist your data and that you do regular data backups (as you would if you had standard database deployment). Have in mind that most orchestration tools are built for running stateless applications and databases are stateful applications. You will need to take that into consideration when trying to run multiple instances of your database.

And what when your application grows and need for high availability emerges? You will have to modify your database infrastructure to provide failover mechanism, load balancing and other benefits from running multi-instance database cluster. Are containers still a good fit for such scenarios? Answer is, as always, it depends. The database is one of the most critical pieces in your application ecosystem and its very important that it works predictably. What that implies is that you need to have robust infrastructure which will host database instances. You also need to make sure that you have stable storage as databases are heavily stateful applications and they rely on using disk storage. Using any kind of network storage will impact database performance and produce more network traffic compared to traditional deployments on VMs or physical machines. Considering all this, the most appropriate way of deploying containerized database is using Kubernetes StatefulSets.

They are adapted for running stateful applications:

they ensure that storage is stable and persistent
all pods are labeled with an ordinal name(stable identification)
pods are built one at a time instead of in one go
pod rescheduling is stable and persistent
mounting of persistent volumes is automatic

Going MEAN in containers

If you are reading this article, you are probably into web application development and you know that MEAN is shorthand for MongoDB-Express.js-Angular-Node.js stack. It is very popular among web developers because you can build your app very fast and go to production in no time. We will only focus on M part of the stack and let EAN for some other time.

To run MongoDB on Kubernetes in stateful manner we will use Kubernetes StatefulSets. We already mentioned that StatefulSets are Kubernetes API objects used to manage stateful applications. Difference between StatefulSet and ReplicaSet is that every pod will have unique and predictable(ordered) name, which is important in our case because we have to distinguish every node in Mongo replica set(don’t confuse with Kubernetes replica set). Pods are created from the same spec, but are not interchangeable: each has a persistent identifier that it maintains across any rescheduling. One thing you must have in mind is that you are responsible for providing persistent volumes based on requested storage class.

To set up MongoDB replica set, you need three things: A StorageClass to provision persistent volumes, Headless service and a StatefulSet. Cluster administrators define StorageClasses that we can use and we only need to know which one will suit our need. Usually, there will be only one class if we use some distributed file system like GlusterFS or Ceph. Headless service is special kind of Service resource. It is created by defining .spec.ClusterIP as “None”. That will create service without single service IP or load-balancing capabilities. In our case it’s important because we want to access each Mongo node individually. Below are example service and stateful set definitions. StorageClass definition isn’t provided because we assume you have set dynamic provisioning for your cluster and that included StorageClass definition.

HeadlessService definition:

apiVersion: v1
kind: Service
metadata:
name: mongo
labels:
name: mongo
spec:
ports:
- port: 27017
targetPort: 27017
clusterIP: None
selector:
role: mongo

StatefulSet definition:

apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
name: mongo
spec:
serviceName: "mongo"
replicas: 3
template:
metadata:
labels:
role: mongo
environment: test
spec:
terminationGracePeriodSeconds: 10
containers:
- name: mongo
image: mongo
command:
- mongod
- "--replSet"
- rs0
- "--smallfiles"
- "--noprealloc"
ports:
- containerPort: 27017
volumeMounts:
- name: mongo-persistent-storage
mountPath: /data/db
- name: mongo-sidecar
image: cvallance/mongo-k8s-sidecar
env:
- name: MONGO\_SIDECAR\_POD\_LABELS
value: "role=mongo,environment=test"
volumeClaimTemplates:
- metadata:
name: mongo-persistent-storage
annotations:
volume.beta.kubernetes.io/storage-class: "glusterfs"
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 30Gi

StatefulSet definition is pretty straightforward. Only difference with ReplicaSet/DeploymentConfiguration is that it has volumeClaimTemplate. Maybe the most interesting part is mongo-sidecar container. If you ever tried to configure MongoDB cluster manually you know that this is pretty tedious task to do. And it’s easy to miss some step and you then have to start everything from scratch. mongo-k8s-sidecar is attempt to simplify those steps for you. It’s a Node.js application that performs auto configuration no matter how many replicas we put in replica set. There is no need for you to have further knowledge of this code, documentation is sufficient enough for you to work with provided container image. You can find more information here.

And that’s it. All you have to do now is:

 kubectl apply -f service.yaml && kubectl apply -f statefulset.yaml

and 3 node Mongo replica set will be created.

So, am I good to go with containers?

In conclusion, before even attempting to put production database in the container ask yourself: do I really NEED to use containers for it? Surely this idea seems tempting and easy but there are so many challenges that can bite you back. If you are not hosting your own cloud environment your cloud provider will usually offer cloud database solutions that you can use. For example, Amazon has their RDS, Google has CloudSQL or if you are in need of non-relational database you can use Amazon DynamoDB. If you have your own data center you’ll probably be better with dedicating VMs for running database services. Only scenario when running production in container would be a good option is when you really don’t have any other options. And in that case, make sure that you really know what are you doing. And be ready to react fast when problems occur(and be sure that they will appear). Remember: there is no better way to wake you up in the morning like malfunctioning production database.