Viximo thrives on reliable services: internally and externally. Downtime, even if caused by a 3rd-party, means fewer customers and real costs - not just for us, but for all of our partners. It's for that very reason we've invested a little blood, sweat, and tears into automated monitoring for the partner systems that we rely on.
Back in the old days...
It's hard to believe it was a little over a year ago when we first automated monitoring of the 3rd-party games that run on our platform. At the time, we were still fairly inefficient at detecting outages. If a partner happened to detect an outage it was typically long after it had already started, leaving many users with a cryptic error page. If users created support tickets about the outage, the lack of a dedicated customer support engineer meant we were often backlogged several days. If no employees at Viximo or its partners played the game on any given day, it was possible no one would realize it was down.
The nature of our 3rd-party game integrations makes detecting outages a bit more difficult. Like many apps on Facebook, games are integrated via an iframe on the page. Simple failures such as non-200 status codes are easy to detect, but games can fail in much more subtle ways:
- Failures isolated to certain geographic zones (US, Germany, Spain, etc.)
- Issues encountered only within Flash
- Javascript errors or failure to load certain files on the page
- Intermittent failures due to capacity issues
Without direct insight into the actual application and the servers it's running on, these problems are only magnified.
Getting the job done right...
The tools we now use are insufficient by themselves, but together they can detect the majority of the outages that our partners experience by tackling the problem from various angles.
Monitis is a hosted monitoring solution that we use primarily for its uptime service. It offers geographic locations that map well to Viximo's social network integrations and provides a suite of advanced notifications / callback hooks.
Airbrake collects errors generated within our application. This also allows us to track javascript errors generated on the browser, particularly within partner pages.
Zendesk is a customer support management application we use to track data about support issues. Zendesk has a great API and management interface that makes it easy to analyze and categorize the various issues per game.
Nagios is a system monitoring application that is used, in this case, to monitor application and user behavior through Updawg. Viximo also uses Nagios for other internal systems monitoring.
Using the above tools, we've defined a series of triggers and notification channels that can quickly alert us to issues so that they can be resolved promptly. The first set of triggers below will automatically mark a game as down and place it in maintenance mode. To prevent blips and false alarms from taking apps down too often, these triggers must fail a certain number of times consecutively from a subset of our 5 geographic locations around the world.
- HTTP status code - Any 2xx status code is considered success; all other codes are considered a failure. For the most part, this trigger catches most outages since the web server will typically return a 500 status code when down.
- Content matching - Any response that does not include a particular set of content specific to that game is considered a failure. This catches instances where the game is returning a 2xx HTTP status code even though it failed to process.
- Timeout - If the url takes more than 10 seconds to process, the game is marked as down. Disabling apps as a result of a timeout can help ensure that the game-playing experience is tolerable for active players. This way other players can get funneled to other games while the performance gets investigated.
The remaining triggers use the available notification channels to alert folks at Viximo when they're failing. This gives us the ability to manually test the game and validate that it is in fact down prior to actually marking it as such in our system.
- User activity - Visits, transactions, and revenue are compared to historical averages. If any value deviates too far from the average, an alert is generated.
- Customer support rate - If the rate of customer support issues for a game deviates too far from the historical average, an alert is generated. While this data is available for automated monitoring, this is still a manual process that our customer support engineer performs.
- Error rate - If the total count for a particular type of error exceeds a certain predetermined threshold, an alert is generated. Again, while this data is available for automated monitoring, this is still a manual process the our engineers must perform.
Communication and recovery...
Once one of the above triggers is activated, a variety of communication channels are available to alert a targeted set of people at Viximo. The type of channel used depends on the severity of the trigger. They include:
- SMS - This isused for the most reliable triggers, such as those that automatically mark a game as down. Typically the integration manager is notified of these events.
- E-mail - All triggers that are activated will send an e-mail to a mailing list at Viximo that includes folks from the integration, engineering, and product teams. This ensures that those with knowledge of the game can address any issues should they come up.
In addition to the above two channels, there are also additional communication channels for users on the Viximo network:
- Maintenance page - When a game is marked as down within the Viximo system, a maintenance page is automatically displayed in place of the game. This includes a friendly error message and points users to other games that can be played.
- Announcement - Administrators have the ability to create announcement popups that will be displayed to any user attempting to play a Viximo game. This also allows us to quickly alert users when there are problems in a game that we're investigating.
Once a game outage has been validated, the partner is typically contacted via e-mail and phone so that they are aware of the issue. An integration engineer is made available in cases where logs and error data are needed to help explain what's occurring.
Moving forward...
Like most software development, monitoring is an ongoing process that can be constantly tweaked and improved to more accurately and quickly detect and resolve outages. There are still some triggers that we could automate, such as the detection of increases in customer support requests and application errors. As well, we could begin to take advantage of Monitis's full-page website monitoring tools that allow every link on the game's page to be validated instead of just the main page itself.
While we strive to make content available to our users as reliable as possible, there are always going to be unexpected outages. The best thing we can do is to make that experience as painless as possible for both our users and our partners. By having automated monitoring in place for our partners, this has put us on the right path towards that goal. ...