Data Architecture – Too much normalizing?

I’m slowly breaking out of the mentality that all data must be normalized as much as possible. As I work through creating an event sourced system from an existing application, I find that a lot of data ends up in the same place even though it’s not the same data. Let’s look at a simple example, a country with provinces. In a relational database, you’d probably have a lookup table for each of these, and have a foreign key into the province table so we know which province belongs to which country. And then we’d have an address table that has a foreign key to the country and province table. Compare this to a JSON document that simply stores all the provinces as an array of objects:

  "Country": {
    "Id": "CA",
    "Name": "Canada",
    "Provinces": [
        "Id": "AB",
        "Name": "Alberta"
        // ...

The address, then, can’t even store any kind of reference id to the province; it must store some minimal amount of information about the province within the address document itself:

  "StreetAddress": "123 Fake Street",
  "City": "Calgary",
  "Province": {
    "Abbreviation": "AB",
    "Name": "Alberta"
  "PostalCode": "H0H 0H0"

So, really, what’s wrong with this? I guess the main criticism is that we don’t have a central location to update the name of the province. But that’s easy enough to deal with:

    public static async Task UpdateMultipleItems<T>(
        this IMongoCollection<T> collection,
        Expression<Func<T, bool>> query,
        Func<T, Task> update) where T : IId
        var writes = new List<ReplaceOneModel<T>>();
        foreach (var item in collection.AsQueryable().Where(query))
            await update(item);
            writes.Add(new ReplaceOneModel<T>(Builders<T>.Filter.Eq(e => e.Id, item.Id), item));

        await collection.BulkWriteAsync(writes);

When we receive an event that a province name was updated, we simply update all documents that contain that province. If this is something we plan to do regularly, an index would certainly help.

I think it’s important to stick with DDD principles here. If something isn’t an aggregate root, it shouldn’t have its own documents. The province, here, is not an aggregate root as it doesn’t even exist outside the context of its containing country. So, we’ll never see provinces as more than a property of some other aggregate root. Given the size of a province, it seems easiest just to store its value inline.

Ok, so most small entities can be dealt with in this fashion. We do have _some_ need for normalizing, though. Consider the case where there is a relationship between two aggregate roots. In this case, we can simply store a reference id for this property, and use a lookup to get the associated document. But why not take some of what we’ve learned above? For example, instead of merely storing the id value, also store a name or description value as well. And you’re not limited to a single field. Perhaps there’s a LastUpdated field or similar that you’d want to retrieve without loading the entire linked document. Yes, you will have to use the same technique as above to update that field when it changes, but in a lot of cases, you won’t need anything more than a text identifier until a user actually triggers loading of the entire document.

I believe this to be a sound approach. We are already working with eventually consistent databases, so a slight delay in updating subdocuments shouldn’t have a profound impact. I’ll need to work with much larger datasets before I have any basis for comparison, but there’s a benefits to working this way:

  • documents remain logically separated. we don’t link things together simply because one entity has the text we want to use. an address belonging to one type of entity is not necessarily the same thing as an address belonging to another type of entity, and indeed may have different business rules.
  • we gain speed at the expense of space. this is generally a tradeoff most people are willing to make these days.
  • the document for any given aggregate root is human-readable. It is not necessary to perform multiple lookups to obtain the necessary information.

The flip side of this is that we do repeat ourselves, at least superficially. In the address example, there are two sets of POCOs that represent addresses. That is not itself an indication that it’s wrong, but you may need to further consider whether those addresses are, in fact, aggregate roots themselves. However, if they’re not, then I’d continue to argue that the values should be stored inline. I’ll look into the performance implications of this position and follow up. For now, though, it would seem that we are normalizing too much, and much clarity is to be gained by duplicating storage of similar information.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: