With CommandFest just over two weeks away, we wanted to dive into some engineering behind how SpellTable works and how we’re scaling. As some of you may know, our video layer is peer to peer. That means your video feed goes straight to your opponents and no one else. The main reason we do it this way is cost. We simply couldn’t afford to run SpellTable with hosted video yet, although we do plan to have it in the future. There are some drawbacks to the peer to peer solution. More CPU is needed to encode your stream and more upload bandwidth is required to upload your stream to each opponent. Additionally we cannot support too many players and/or spectator mode.
So if video is peer to peer, what is there to scale? There are two things related to video streaming that we must scale. If you’ve ever had the problem where someone couldn’t see your video, or you couldn’t see theirs. Or if you’ve had your video flicker in/out, it’s probably because one of these two things.
The Relay Server
While most users (about 80%) will never need a relay server. Unfortunately some users are stuck behind a firewall or a tricky network configuration where there’s just no way for a direct connection. In that case, we route their video through a server in the cloud that both players can access.This server does not have access to your video feed, it simply takes it in and sends it right back out. Because we don’t need to decode the video feed, this server is really cheap to run and all we pay for is bandwidth and a small instance cost. The cost for bandwidth is $0.01 per gb and we’ve used over 4 terabytes of relay traffic this month on just one single relay server. That server is overloaded and we need to scale it. If you’ve got a delay in your feed, this is a likely culprit.
Fortunately, scaling relay servers is relatively easy. They are stateless so we can simply launch a ton and use some DNS tricks to round-robin them. We can even use GeoDNS to make sure you connect to a server near you. This project is almost done and we plan to launch additional servers in the USA, Germany, Australia, and Asia before CommandFest..
We could simply pay for a service that does all this for us. In fact, early on we used just such a service called Twilio NAT. This worked great, but it cost us $0.40 a gigabyte. Recall our own solution is $0.01, so this is 40 times more expensive, and that’s just for North America. Other parts of the world cost $0.50 or even $0.60 per gb. If we were still on Twilio’s service, the month of May would end up costing somewhere around $2,000.
The Peering Server
While video is peer to peer, the data required to connect that video stream is not. This data consists of session information about how to connect to another player. This is where a peering server comes in. It takes the information from all parties and exchanges it. Our initial solution was a forked open source peering server which worked well, but there are some downsides. First, it keeps all connections in memory. That means if the server ever has a hiccup or we need to deploy, all connections are lost and must reconnect with fresh IDs. To compound this, our peering server is hosted on a service called Heroku which automatically restarts our peering server at a random unpredictable time. Needless to say, this is not ideal.
Scaling this solution would require a persisted database and sharding.Sharding is just a fancy way to say we can break down our databases/connections into smaller parts. In this case, however, we’ve decided to buy vs. build and we’ll soon be moving our peering solution to Firebase. This will allow us to scale peering to 200,000 simultaneous connections before we need to worry about sharding. While we’re confident CommandFest will be enjoyed by many, we do not anticipate 200,000 participants. While Firebase is not the most affordable solution, we’re super grateful to our Patreons and supporters who have allowed us to not have to worry about this particular cost.
Thanks for reading and joining us on this journey!