Granular deployment strategy

Artem Volobuev
9 min readFeb 13, 2024

--

Introduction

Many of us writing code, whether it is a bug fix or a new feature implementation, don’t know how they will deploy new code. I mean we know that some pipeline will deliver our code to production. But how often do we think about the risks that new deployment brings? Is it safe enough for us to just swap old hosts with hosts containing new code? If you struggle with this question it is definitely worth to continue reading. However, if you’ve never thought about that, then keep ready you will learn many new things.

It is just an illustration to draw your attention

There are many options for how we could deploy our code from the risk management standpoint. I’m going to tell you about one very specific way of doing that, which I call “Granular deployment strategy with a parameter”. But before doing that let’s overview common options we usually have.

Understanding Deployment Strategies

As I already said, there are many deployment strategies, each with pros and cons. Let’s name some of the strategies in order to have more context:

Big Bang — a very funny name but it speaks for itself. This method originates when software requires an installation by a user manually. The user shuts down an old version of a program and runs a new version. You can see some problems here:

  1. Downtime. It means your system is down and no one can reach it before a new version is running. If your customers are good with some maintenance windows, then perhaps it is a way to go.
  2. Impact in case of failure is high. Since you substitute an old version and if the main logic is corrupted in a new version, all your users will face that issue.
  3. During an incident, the rollback strategy is to do the same as you do for deployment backward. This means users can not use your system while the rollback is happening. In some cases, a rollback can be very complicated. For instance, if your application is using a database you need to come up with a strategy for how to roll back data too.

There are positive things in this strategy though:

  1. Infrastructure cost is low. You don’t need to introduce any new entities in your infrastructure field. You can do a deployment using the same hosts that an old version of the system used.

Yeah, what a joke, there is only one positive thing. This strategy doesn’t suit you if your system is an online service or similar where uptime is crucial for clients.

Blue-Green — is a better option than the Big Bang strategy but it still has some drawbacks. You have two versions deployed simultaneously: the old version is a blue host and the new one is a green host. Once the new version is ready and all validation steps have happened, we add the green host to a load balancer. Yeah, you need to expand your infrastructure with a load balancer. At that moment production traffic is going to both versions old and new. We do some validations again to make sure we are ready to take the blue host out of the load balancer.

It has obviously more pros than cons so let’s start from them:

  1. No downtime. Since at any moment you have a host of either green or blue work interaction of your clients can with the system without a break.
  2. The client impact in case of any failure is less since only half of the system will be affected.
  3. Rollback is easy. Taking the green host out of the load balancer and adding the blue host into the load balancer are the rollback actions which you need to do to stop the client impact. Simply saying you need to do the rollout steps backwards.

From the cons standpoint we have:

  1. Infrastructure cost is bigger. Remember I mentioned the load balancer so you have to pay for this and you have to pay for old and new versions together.
  2. Maintaining and operating efforts are bigger. The more moving parts in your system you have the more effort you need to spend — it is the law.

Canary — an upgrade of the Blue-Green strategy. Imagine, you have an old host, it is a blue host. And you have a new host which is green. But at this time traffic goes to both nodes, not in a 50/50 proportion but 90% for the blue and 10% for the green. That means we decrease risks by routing only a tiny portion of traffic to the new host. Once we are sure there are no issues we are ready to change a proportion. It could be 50/50, 20/80 and in the end, it will be 0/100 which means the old host doesn’t receive traffic at all when the new one is giving 100% of it.

What are the positive things in canary strategy:

  1. No downtime. As for the Blue-Green strategy, the Canary does not have a cutover during deployment. The old host works until we set 100% traffic to the new host.
  2. Client impact is the least possible. We can set an initial value for the new host as small as we want. Hence the risk of impact will be balanced with this number.
  3. Rollback is easy. The same is true if it is just the Blue-Green deployment, with only the difference that you need to set 100% traffic consumption for the old host.

As for disadvantages, I would say that they are the same as in the Blue-Green strategy.

However, there is only a very small question, how can we do this balancing of the traffic?

The Granular Approach — What and Why?

There are many ways to balance traffic between hosts in a set proportion. I won’t cover all of them, let’s highlight some.

Canary deployment via adding an extra host to a fleet. Let’s say we have a fleet of 3 hosts doing a release. We do a canary deployment with 1 additional host where we have a new version. In total, we have 4 hosts running and connected to the load balancer. Not thinking for too long we can say that we do canary with a 75/25 proportion where 25% of traffic goes to the new version. Very good, but not enough. What if we think we want to do a 90/10 traffic proportion? In this case, we need to have 9 hosts with the old version and only 1 with a new version. That can be very expensive.

There is a way to make it as granular as you want. If we could balance traffic at a code level but not on a host level it could solve a problem. Saying that we should start with understanding what our traffic is. We need to understand what unites all requests to our service. For instance, it could be a user id.

Canary deployment with a parameter. As soon as we have something in common in our traffic(e.g. user id), we can build a mechanism when we can specify a precise amount of traffic which should go through a new functionality that a new release brings.

As said above we should have something general for every request that comes from users. Assuming that there is some id which usually is a UUID, we can range incoming requests by comparing the id or only a part of it with some parameter. We can construct a rule that says if the parameter is bigger than the id then this traffic goes to a new functionality otherwise it goes the old way.

Real-World Scenario

Canary deployment with a parameter seems a robust solution for a safe granular deployment strategy. Let’s look into that more closely with a meaningful example.

We have a system for processing orders in a chain of restaurants. The main business entity is an order. An order has a unique id which is presented like UUID. One service from our system is a service for storing orders in DB. We don’t need to know any other details.

At some point in this service life, we decided to add a new functionality. We want to validate some fields in orders prior to storing them. We implemented this functionality and now we have decided to check it on a very limited amount of orders. The idea is simple — only ~10% of orders should go through a new logic. Finally, let’s write some code, it is in Java but it should be similar in other languages:

if(isAllowedForNewFunctionality(order.getId()) {
// do validation
} else {
// no validation
}

Yes, that’s simple! But what does isAllowedForNewFunctionality do? It should return true if that particular order is in that 10% of orders that should go through validation.

boolean isAllowedForNewFunctionality(UUID orderId) {
final var fourDigitHex = orderId.toString().split("-")[1];
final var limit = Integer.parseInt(LIMIT_HEX, 16);
return limit >= Integer.parseInt(fourDigitHex, 16)
}

In isAllowedForNewFunctionality we get a four-digit sequence from our id. The magic is easy here: the structure of UUID is xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx. Also, we know UUID is a sequence of hexadecimal numbers divided by dashes. We take the second sequence after the first dash. We parse it like an Int number and then we compare it with the limit. So LIMIT_HEX is the limit we set in our configuration which shows us a percentage of allowed orders that go through new functionality. It is presented like a four-digit hex number(i.e. 0000 is 0% when FFFF is 100%). There is a question: how can we convert a percentage to that representation in the configuration?

int percentage = 10; // Example percentage

// Decimal value for FFFF
int maxValue = 65535;

// Calculate the decimal value corresponding to the percentage
int decimalValue = (int) ((percentage / 100.0) * maxValue);

// Convert the decimal value to a hexadecimal string
String hexValue = String.format("%04X", decimalValue);

System.out.println(hexValue); // => 1999

We take the maximum decimal value which is 65535. Then we solve proportion in order to find the value for the percentage we are looking for. Once it is done we need to convert it back to hexadecimal value. At this point, we can set 1999 to configuration for the LIMIT_HEX parameter. Doing that we limit the number of orders which go for new functionality with 10% of them.

With further iterations, we can increase the percentage of traffic which goes to new functionality until it reaches 100%.

Implementation Insights

As you can see we solved the main problem — we can differentiate the amount of orders (i.e. traffic) with impressive accuracy. Rolling new functionality out you can start with very small numbers like 1%(or even less) to make sure you are not affecting users or at least to have a small issue blast radius. However, there are some things I need to say in addition:

  • You should keep in mind that UUID distribution could affect the amount of traffic coming to new functionality. With a small amount of orders, you will in fact have less than 10%. The bigger amount of orders the closer traffic is to the limit set.
  • In our example, we used a hexadecimal parameter because we use UUID as a unique identifier. At the same time, you are free to use a different base for the limit parameter if your particular task is having another common attribute for the traffic. Use decimal if your ids are decimal, or ASCII codes if it is needed.
  • Using different bases for the limit parameter you should solve the problem of converting it into percentage.

Comparing With Other Methods

There are other ways to make a granular rollout of a new functionality. However, they are not giving that level of control over traffic. It is possible to build a solution configuring a load balancer. At the same time, it might be not a trivial task. Doing that you can not separate traffic by some attribute, you just say that this amount of traffic goes to hosts with new functionality and that amount goes to old hosts. Configuring the load balancer requires making some infrastructural changes. Which might bring maintenance costs increase.

A granular deployment strategy is particularly beneficial in cases when you can define a base on which you do a traffic separation. On the other hand, this method does not suit when traffic is very versatile.

Conclusion

A granular deployment strategy with parameter

  • requires minor code changes, no infrastructure;
  • brings very precise control over traffic;
  • can be easily modified based on business requirements;
  • significantly reduces risks.

Next time, when you have to decide how to deploy your new feature, give a chance for a granular deployment strategy with a parameter making your developer life easier and less stressful.

--

--