Multi-tenancy.. do you really need structured data?

One of the big architectural decisions you have to make in multi-tenant software is how you store and partition your data. In many cases you will choose for a RDMBS such as Sql Server, MySQL or Oracle. The choice for an RDMBS is well founded; you get powerful querying, transactions, recovery, backups, indexing and management capabilities. Sure, a RDMBS might not easily scale-out but by scaling up you can come a long way.

Once you go the RDBMS route you have to figure out a database schema that will suit your application. Of course it has to support multi-tenancy. And since you have only a single instance of your application your application must take care of dealing with querying the correct data. In your database design you basically have three options:

- A single database for each tenant

- A shared database, multiple-schema for each tenant

- A shared database, shared schema

This all has been well explained in this MSDN document, and that is not where I want to focus at. Instead, I want to share some design struggles I had with this.

I recently have been quite busy figuring out which path to take. The difficult notion here is how to deal with tenant-specific customizations. For example, different tenants might have the same business entity extended with different attributes. This does not align with a relation server, which only supports a fixed database schema. So you either have to design a very generic and flexible schema, in which case all variability is handled by your application layer. But this tends to lead to awkward and inflexible querying and an unclear schema with names like attribute1, attribute2, attribute3, and a lot of meta-data. The second option is to modify the schema at runtime, which is obviously only possible if each tenant has its own schema or database, a shared schema is not possible. The runtime modification of database schemas seemed like a sensible approach. Just use DDL for schema modifications and introspection and some application layer doing the translation work. But this quickly turns out quite complex and error-prone. For example, what to do with schema updates in your application? And how to upgrade existing tenant data to a revised schema?

Dealing with these problems I figured that the real problem might be that I am locked in this ‘ it has to fit in SQL’ mindset. When I came to think of it, a lot of these customer extensions just end up in some forms or reports. They do not interfere with the core functionality of your application. Why, then, should you store this data in a very structured way which causes all the associated hassle? Why not use, let’s say, the semi-structured XML data column in SQL server to store all this tenant-specific, and possibly ambiguous, data? The world-wide web is pretty much semi-structured and ambiguous, but it works pretty well in the end, doesn’t it? And if it works for the web, why should it not work for me?

My current approach therefore is to have an explicit distinction between the structured and semi-structured data I deal with in my application. As it turned out, the structured data is very fixed among the different tenants. This allows me to adopt a shared database, shared schema approach. This might, or might not be the best way to go security-wise (tenants should never ever be able to see each others data), but at least I do have the option. This distinction also leads to a much more elegant and robust design with less complexity.

So, the next time you are struggling with a db schema, rethink it over; does it really need to be structured? Should it really be indexable and queryable? Or does it allow for a semi-structured data approach? The choice for less rigidity and more flexibility might save you a lot of troubles down the road.