Hybrid cloud scaling; coping with the unpredictable hordes during launch

We were recently involved with the addition of Tripwire Interactive’s FPS Killing Floor 2 to PS Plus; a subscription service offering free games to in excess of 26 million subscribed gamers.

Prior to this, Killing Floor 2 had a core player base on PC and PS4 in the thousands; it was very exciting to help accommodate the huge new audience for this great game.

THE CHALLENGE

Predicting the impact of granting access to millions of PS4 owners is as challenging as it is terrifying! It’s fair to expect that a game of this quality will see a huge uptake; the hugely successful Rocket League is a great example of how much this kind of promotion can boost a game’s popularity.

The traditional approach to handling a sudden increase in capacity has previously been to be over-prepared and over-provisioned. It’s hard to think of many things more damaging than gamers simply not being able to play when they want to. This is hugely amplified during a launch but can be disastrous in normal operation as well.

So how many should you plan for? Whether you go for 10,000, 80,000 or even 100,000; it’s very rarely correct.

Evidently, the result is going to be one of three things:

  • You overestimate. Wrongly estimating that you’d see more players would result in spending too much.
  • You underestimate. Wrongly estimating that you’d see less players would result not having enough capacity to serve the players.
  • You get your estimates right. You should consider a career as a sorcerer (Please send us your CV).

THE SOLUTION

Multiplay’s Hybrid Scaling technology uses both bare metal and multiple cloud providers to remove this guess work. Typically, we achieve this by doing the following:

  1. Use our best estimates and modelling to provision a level of bare metal hardware. In normal, day-to-day operation, this should handle the majority of your gameplay sessions.
  2. Configure multiple cloud providers (Instance types, availability zones etc) to be able to start up and provision capacity on demand.
  3. Set thresholds at which we trigger this. Essentially, we always try to maintain a pool of unused capacity; if we drop under this amount, VMs are requested without any human interaction.
  4. Monitor the initial period of the launch, around 2–3 days, and add more bare metal in where needed. This achieves the sweet spot on cost to ensure we only use as much cloud as we need for peak and unexpected trends.

The intended result of this system is, to be frank, a boring launch regardless of the influx of players. We love boring launches. Boring launches for us mean that the players are happy.

In the moments where games are growing exponentially and the stream of players seems endless, it’s very easy to be blinded by the success. This could cause you to have multiple thousands of machines that you no longer need if the player base moves on to other things.

Multiplay has experienced this first hand in the form of some hugely popular games whose lifespans were far shorter than they deserved.

To tackle this, we will wait for a set period of time for a VM to become empty and ensure that it’s shut down as quickly as is safe to do so. We’ll keep it around for another predetermined time for re-use, after which it’ll go away for good. This cycle of provisioning and deprovisioning is constant and highly effective; ensuring you only have capacity online for the time it’s needed.

THE RESULT

In the case of the Killing Floor 2 launch, the first two days of launch went extremely well; peak concurrent users was around 17,000 on day one and 50,000 on day two:

It was really gratifying to see this uptake and it was particularly exciting to see our platform being used for another hugely popular game and working flawlessly.

The above graph shows our scaling for the first two days of this release. The graph shows the total number of server instances (copies of the executable) running for one of our busier regions. The light green line shows the total number of “Allocated” or in-use instances. The dark green (and mostly obscured) line shows the total number of started/running instances. The blue line is the total created capacity of server instances.

As you can see as the day progresses, we create additional capacity constantly (as shown by the yellow line at the bottom) and stay just ahead of the curve. We cover the line showing running servers with our blue “Total” line for the majority of the time; this is because we have all created capacity online and either in-use or in a “hot standby” state. We’ll see these lines separate as capacity is shut down during the quieter times of day.

Our cloud partners quickly provisioned new VMs when requested throughout the incredible initial velocity and there was no instance of our system failing to provide a server when needed.

Following this initial period, the trend from the first two days was entered into our modelling tool to give us a quantity of additional bare metal needed. Whilst our hybrid scaling platform takes care of the initial, unpredictable demand, we’re quick to act to ensure we keep costs down.

This graph shows our split between Bare Metal, Google Cloud Platform and Amazon EC2. The majority of our burst capacity was provisioned into GCP, with EC2 serving Oceania and South America.

As well as the capacity provided by these cloud providers we also gain huge amounts of resiliency from having both different companies and multiple availability zones within each region.

An example of how important this can be came when, yet again, we proved that cloud is not infinite. Whilst scaling very quickly on the second evening we had utilised all available capacity within one availability zone (AZ). The benefit of having multiple AZs configured for the region is clear in this example; our system detected this, we disabled the AZ and scaled into an alternate one.

In conclusion, it’s very clear to see the benefits that hybrid scaling offered in this case. We were able to deliver game servers to hundreds of thousands of players without them seeing any impact of, at times, a 2500% increase in demand.

If you’d like to learn more about our platform or ask any questions, please feel free to get in touch!

--

--