I've spent a good deal of time in the last year re-architecting a web application. Mostly, the issues come about as a result of heavy use. Building a single-server MVC application is pretty easy and there's lots of frameworks to help. But there comes a time that simply throwing more resources at your usage doesn't yield great returns. Early in your career, a web application is a pretty simple single-server MVC application, and it works pretty well. Until it doesn't. A lot of pain can be avoided if you plan for scalability in the first place instead of watching your servers struggle with the load and listening to your clients complain. I'm going to present a roadmap for moving your webapp into a distributed architecture that can easily scale out to meet demand.
Most of you are familiar with so-called "n-Tier" architecture. MVC is an example of n-Tier architecture. It has a presentation layer, a logic layer and a data layer. And generally, it all runs on the same server. But it doesn't have to. If we can separate these tiers onto different servers, we can improve multi-user performance significantly. Before presenting the roadmap, I'm going to talk about some of the issues I've encountered in the past. Maybe some of these will be familiar:
1. Your views are not very dynamic
2. You do too much processing in the controllers
Following along the theme of making your website processing use the client browser instead of server resources, why are you doing all that processing in the controller anyway? Let's go back to the idea that websites are lightweight and fast, and do minimal processing. Let's stick with a static website for serving the UI and assets, and look at using APIs instead.
3. Too much of your processing is synchronous
If you've ever analyzed a profiler trace from a busy website, you'll find that a lot of the reason it's so slow is that its waiting for system resources. Waiting for other threads. Waiting for database locks. Of course this is a symptom of poorly written code, but part of the reason that it's poorly written is that it's poorly organized.
4. Long-running processes slow up other clients
This is basically a combination of 2 & 3. If your web server can be brought to a screeching halt by the heavy processing requirements of a single user, then something is seriously wrong. Your website should always be responsive. Long-running, slow processes can be long-running and slow, but everyone else needs the webserver to be snappy.
When we look at all these symptoms, a vision begins to take shape. We need to offload the server-side processing to some kind of computation farm, and let the webserver continue to be responsive. In short, we need to distribute the load better, and take advantage of asynchronous processing. So, with that in mind, let's have a look at how we can take an application designed for a single server, and re-architect it into a distributed system. Here's my recipe:
1. The presentation layer will be served up as a static website. I'd suggest React, but be sure to research the available web UI frameworks before settling on one.
2. The presentation layer will interact with one or more REST APIs to synchronize with the back-end model. I'd suggest .NET Core here.
3. The REST APIs themselves will offload processing to a back-end worker process on one or more servers. They'll send and receive messages through a message queue.
4. The back-end worker processes will read messages from the queue and do the required processing, sending responses back to the API
I'll go a little in-depth on the message queue later. It's the glue between the presentation layer and the logic layer.
- Authentication. The nice thing about a single-server architecture is it's easy to identify your users.
- Application-level caching. The nice thing about a single-server architecture is that you can just stick it in server memory (maybe a session, or application cache)
- Filesystems. The nice thing about a single-server architecture is you can just write files to the server disk.
Let's start with the big one: authentication. This might at first glance seem to be a bit difficult to overcome. You need a front-end (who doesn't know who you are) to be able to convince the back-end that your authorized to perform an action. OAuth is a good way to go with this. It's easy to write your own OAuth server, and have your website pass bearer tokens around. You can validate that the OAuth token came from your server without too much trouble. It's reasonably secure, and with a bit of work, you can harden your security to avoid things like replay attacks (where a man in the middle "replays" a request or extracts the bearer token from it). Definitely put some thought into the design here. You want to come up with a design that allows multiple applications to authenticate through your OAuth server. Your front-end will be responsible for ensuring that the bearer token it is using is current, and silently acquire a new one with the refresh token when it is close to expiring.
Application-level caching comes up frequently. Many web developers make some use of the HTTP session on the server-side to cache any number of things. However, in a distributed architecture, your HTTP session has to live somewhere other than server memory. Redis is a good solution here. You can serialize your session to a Redis cache, which makes it available to any webserver handling your request. The key point here is "serialize". An HTTP session on a single-server architecture lives in server memory, and isn't required to be serializable. You may need to rework some things to allow it to be serialized.
It seems pretty obvious to put file storage on a network share, but when designing for single-server architecture, this is frequently overlooked with good reason. The local disk is much more reliable and always available. You are going to need to handle brief and long-running network outages when using a shared file storage. This is a downside to a distributed system: services we rely on aren't always going to be available. No matter how stable and reliable your network, there will be an outage. Your retry logic in all cases should make a reasonable attempt to reconnect before giving up. Your users shouldn't get errors because of a network blip.
Those are the basic concerns with a distributed system. There's probably a fair bit of work to do if you already have a legacy application that you're rearchitecting, simply because it's easy to make assumptions when everything is on the same server. Start with the above, and put the application in a web farm behind a reverse proxy, and make sure the application still works. Don't cheat and use "sticky sessions" or anything like that. Make sure your client can migrate between servers in the farm without issue.
Once your application is running in a web farm, you can start the heavy lifting. For a non-trivial application, you can expect to spend quite a bit of time with the re-architecture. This roadmap is biased in favor of CQRS and microservices. It also is biased against containers. There are many equally good designs, but this one has proven itself with time. So, here's the roadmap I suggest:
1. Factor out your controller logic into discrete commands and queries. CQRS tells us that "commands" are read-write operations, and that "queries" are read-only operations. We might not do anything with that distinction right now, but it's good to have the option down the road.
2. Your commands and queries should be serializable. They will be sent over the message bus. I put them into the domain model in its own namespace.
3. Each command and query needs its own handler. The handler should be part of a "service" which handles commands and queries for a specific domain. "Service" is kind of an ethereal term here. I've made my services into class libraries that are easily incorporated into any kind of executable. For now, I combine multiple services into a single executable that runs as a Windows service. There are certainly other options in the long-term. I am planning to put each service into an Azure Function or Webjob which will allow for easy scale-out/scale-in configuration. Containers are another option for the same reason.
4. You'll probably notice when you start task 3 that there's a lot of boilerplate code that can easily be factored out into a base service class. For example, all services will require dependency injection, initialization code, and a basic bootstrapper. A ServiceBase, then, might contain methods to do all of these: ConfigureContainer(), Initialize(), Run().
5. Your services will be receiving messages across a message queue. Each handler, then, is a consumer of a particular kind of message. We'll talk a bit more about options below.
6. You may find it necessary to rate-limit your message handlers somewhat. For example, if you are receiving thousands of authentication requests per second, you may well find that your database runs out of available connections, causing a denial of service for the rest of the application.
7. Now that your controller logic has migrated to an actual service layer, your newly unburdened web application can be exposed as an API written in .NET Core.
8. Your web UI can be rewritten in client-side scripting language, and hosted on a static website.
I'm going to spend some time talking about message queues, since that's what I've been working on lately, and what actually inspired this post. I thought it would be better to talk about it in the context of a re-architecture, though, so you can see how the pieces fit together.
There are a lot of options for message queueing. You can use a dedicated message queue server, such as RabbitMQ. Some databases have a queue facility. Redis has a queue facility. Azure has at least one option for message queueing (Service Bus). I'd highly advise against rolling your own message queuing code though. There are some good libraries available that greatly simplify the task, and abstract away a lot of the low-level features required to make it work. I've used three of them: EasyNetQ, NServiceBus, and most recently MassTransit. All of these libraries do a decent job of making the code easy to write. MassTransit and NServiceBus support multiple transports. EasyNetQ is bound to RabbitMQ. NServiceBus costs money for a commercial license. I started with EasyNetQ, but am currently considering MassTransit as a replacement.
RabbitMQ is a decent message queue that supports clustering and high-availability. For the life of me, though, I've never been able to get a cluster to work reliably in the face of network outages (netsplits). I looked into CloudAMQP as a hosted solution, but I'm finding that not having access to the actual server configuration is making it difficult to reproduce my RabbitMQ solution in the cloud. EasyNetQ was a pretty good solution though. ServiceBase had a blocking collection of ServiceWorkers, and upon receiving any type of message, it would determine the type, and use MediatR to find the correct command handler. Rate-limiting was pretty easy to achieve in this fashion. The code to find the right command handler obviously required reflection, and was kind of ugly. But it got the job done. If you can get RabbitMQ working to your satisfaction, a solution along those lines is probably fine, and has the advantage of being (mostly) free. You'll probably want an HA cluster, though, and that will cost some money.
I prefer the approach that NServiceBus and MassTransit take, though. You register consumers for a particular type of message, and those consumers will automatically be called. I would expect that something similar to the ugly reflection code I used would be required to make this work, but at least that detail is well-hidden. For example, MassTransit allows you to implement IConsumer<TMessage>, and configure an endpoint as implementing specific consumers. Ultimately, your message handler is a single class implementing IConsumer<TMessage>. When correctly configured, a client application sending a message of type TMessage will ultimately end up in your handler code. MassTransit's documentation to this effect, though, is somewhat confusing and sparse. NServiceBus was better documented and easier to implement.
As MassTransit and NServiceBus both support multiple transports (notably, RabbitMQ and Azure Service Bus), you can experiment with different transports and find one to your liking. RabbitMQ looks good in terms of features, but seems to require a fair bit of knowledge to configure correctly in a cluster. I'm currently experimenting with Azure Service Bus, but I haven't made a final decision yet. Given the cost of RabbitMQ (in either time or money or both), I'm leaning away from it at the moment. Azure Service Bus is probably implemented as a cluster on Azure, but that detail is hidden from the end-user. Cost-wise it seems to be reasonable. If the performance and security features are sufficient, I'll probably stick with that.
I'll fully admit that this is a novice design. However, I've seen vast improvements in performance, reliability and scalability since rearchitecting like this. Further refinements are necessary I'm sure, but that's part of the joy of architecture. Today's design shows its flaws, and tomorrow's design addresses them: ever in search of perfection. I hope you've found this discussion worth thinking about. As always, I welcome your feedback and questions.