our love fest with
gaming, blogged

Blogs

Lessons Learned during an AWS Stack Upgrade

Filed In:

Lessons Learned during an AWS Stack Upgrade

Thanks to a short lull in activity we had the time to upgrade a busy rails stack to be more ops friendly. While the upgrade was a success in itself, we probably got more value from what we learned during the transition.

  • The faster we made our stack, the less it mattered.
  • Client latency is real and must be accounted for.
  • Nginx mitigates starvation by eating client latency.
  • AWS ELB appears to have an invisible floor on average request time under high load.
    • Varies by time of day. For us AWS East is around 100ms from 9-5 EST.

Stack Upgrade

The stack had a reasonable amount of legacy-ness to purge, we were looking forward to a simpler/cleaner environment with maybe a modest performance gain from the removal of a software load balancing tier. Without going into too many unnecessary details, here are the software components of the old and new stacks with traffic flowing from left to right.

Old Stack
Users → ELB → nginx/haproxy → Apache/Passenger/Rails
New Stack
Users → ELB → Apache/Haproxy/Thin/Rails

Expectations

  • We expected overall request time reported by CloudWatch to drop 5-20ms.
  • We expected Request Queue time as reported by NewRelic to be slightly more variable and increase slightly.
  • Deploy time should drop from 30m to 5m.
NewRelic, pre-upgrade

We’re pretty big fans of NewRelic, a moderate amount of money + minimal developer time will get you get some amazing application visibility. Here’s a pre-upgrade overview of our application according to NewRelic.



The multi-color graph on the left represents the average time spent (ms) in various tiers for all monitored requests over the specified period. All the tiers except for Request Queuing are accurate and work out of the box. Request Queuing is the time delta between an inserted Http DateTime Header and the start of the Rails controller. For us, the header is introduced by Apache and generally shows the time spent in the Passenger Global Queue. Also shown are the throughput (requests per minute) and the Apdex which is defined as the percentage of requests served in under 250ms.

NewRelic, post-upgrade

Immediately after the upgrade the Request Queue time shot up from 5ms to 20ms or so while overall request time according to CloudWatch remained the same. We ran through our tests and checks, everything passed so we figured our expectations were mostly met, but instead of getting a slight speed boost the dropped tier gain was exactly offset by the extra queuing time.

Checking back later that night (EST) was the closest I’ve been to panic in a long time.



Request Queue time was steadily climbing even while overall throughput was dropping. A quick check of the app servers verified they were functioning normally, cpu and RAM were available, there was no swapping activity. And once again, CloudWatch indicated current request times were identical to those the day before, and the day before that.



CloudWatch times are in GMT as opposed to EST for NewRelic. Latency stands for the full request time, not a single packet round trip time. The 5/3 17:00 spike was post-deployment testing.

So Many Questions

After determining the app servers were not cratering and that users were unaffected, we had time to sit back and try to figure out what was going on.

  1. Why was the Request Queue time so high?
  2. How did sitting in a queue for 50-100ms in the new architecture not result in 50-100ms increase in overall request time?
  3. Or another way to phrase #2, what is the shared limiting factor keeping the overall request times identical between the different stack variations?
  4. What is the request time hump at 12am GMT?

Remote Traffic and Latency

The easiest to answer turned out to be #4, the jump in request times around 12am GMT. A quick rollup of a few apache access logs revealed that 37% of the traffic in that period originated from Brazil as opposed to the normal 1-5%. As the east coast users drop off around 5pm the Brazilian users jump on.

That brings to the foreground a foundational aspect of web traffic, latency. Web latency is defined as single packet round trip time and differs significantly depending on distance. The CloudWatch request time pattern is based on client location. The post-upgrade NewRelic graph also shows the effects of latency, but the pre-upgrade NewRelic graphs mysteriously do not.

Nginx Buffering

A bit of research later and we find that by default nginx will act as a buffer between the client and the server, effectively shielding your app servers from client latency. By omitting nginx in our post-upgrade stack we lost this feature and reduced overall capacity. As expected, reintroducing nginx trimmed up the Request Queue time.



The Glass Floor

With great fanfare we pulled up CloudWatch only to become sorely disappointed. Again the CloudWatch Latency graph had not budged. Three different traffic patterns through various components, various problems and various fixes; yet no change in overall average user request time.


If one is to believe that the NewRelic Request Queue time of 50-100ms was real, then the only explanation is that somewhere upstream in the black box that is ELB queuing/delay is happening. The best outside evidence we could find supporting this theory is a comment by Shlomo Swidler implying there is a floor under heavy load.

Http Keep-Alive

Http keep-alive was created to save round trips between a client and a server and is enabled by default in http 1.1. Keep-alive is implemented as a pipeline as in first in first out. Consider the situation where 4 connections arrive within 5ms of each other, the first taking 100ms to complete in the backend app server. The remaining three connections complete in 30ms. In this scenario all four connections will be returned to the client in 100ms.

Being http 1.1 compliant, the ELB uses keep-alive. Disabling keep-alive on our side of the ELB did not affect the round trip time. A buried keep-alive pipeline inside the black box that is the ELB could explain the glass floor, but is only one of many possible answers.

Conclusion

Back to the original points.

  • The faster we made our stack, the less it mattered.
  • Client latency is real and must be accounted for.
  • Nginx mitigates starvation by eating client latency.
  • AWS ELB has an invisible floor on average request time.


Client latency can reduce capacity of your app tier but can be mitigated by nginx. Any changes in our stack to drop request time were offset by an unknown floor in the ELB.
...

Off Facebook is different

As it becomes more difficult and expensive to attract new users to games on Facebook, distributing social games on other social networks is becoming more attractive every day. Game producers can see much better margins, better pricing power and closer-knit communities on these smaller, more focused destinations. These destinations tend to concentrate on a particular community, whether focused on geography (StudiVZ, Tuenti, Nasza Klasa), ethnicity (QuePasa, Black Planet) or interests (Gaia, Smallworlds). The differences between these cozier specialty networks and Facebook do not stop at these areas of focus. Players have different expectations of service level and interaction with other players. The companies and technologies underlying these specialty communities also have their own assumptions and special needs. Knowing and leveraging the advantages of these sites can make your launch into this rich set of users pay back handsomely.

Specialty sites have tight communities of dedicated players. The gamers all know each other and have healthy (and sometimes unhealthy!) rivalries that span multiple games. Part of the reason for this is there are tens of games on these sites, not the thousands of games that are on Facebook.  Gamers talk on game-dedicated forums across forums on the site. Heavy users on these sites often know the sites’ customer service people by name, and vice-versa. They tend to get personal service from staff. This expectation is especially true for the “whales” - big spenders - on games on the site. This is a great advantage, since these big spending players tend to play several games, and represent a potential audience for any new games launched on these sites.

One upshot of this closeness of community is that the user bases of specialty sites insist on “fairness”. Cheaters are called out by name on support tickets and user forums. It is expected that the site or the game will punish or ban misbehaving users. Also, this close-knit community makes certain practices that treat different users in different ways, like A/B testing price or item award behaviors more difficult. This is because users share this information freely, and in a very closed community.

Specialty sites can represent a great value in terms of price of user acquisition. Often the per-user acquisition cost is a fraction of the Facebook rate, if not in some cases free. Given the smaller total number of games, cross-promotion works much better, since the number of possible other games to play is smaller. The price for this dramatically different acquisition cost is significant coordination with the site itself.  These sites need high-touch support for promotions, game launch timing, featured game placement, etc. While there is the possibility of buying ads on these sites, timing and availability vary much more widely.

There are also significant differences in the underlying integration technology of these sites. Off-Facebook sites have very different integration APIs for the game canvas, viral channel distribution formats, timing and permissions on notifications. There is even a wide range of API formats and features within the “standardized” OpenSocial-based sites. A big advantage here is that many of these sites not only LOVE viral messages, but base in-site reward systems on notifications, leaderboard achievements, etc. So they need to work almost as well as your currency system. Specialty sites have very different security rules about friend graph access and other user information, particularly when EU privacy laws are in effect. There is also a dizzying array of payment options, pre-existing in-site currencies, based on user demographics, country, site policy and other factors. Users on many of the most lucrative culturally- or geographically-focused sites also have an expectation to play in their native language, which implies its own complex set of issues to manage for existing and updated content. Stay tuned for posts dedicated exclusively to game currency and localization coming soon!

There is a lot of opportunity to get high-margin, engaged users on specialty social networks. To take advantage of this very rich vein of opportunity requires adapting to the very specific needs of these focused networks. Viximo helps our game development partners address and exploit these differences by handling them ourselves, shielding the game team from the vast majority of the work involved. Viximo consolidates customer service requests within and across sites and passes just the root-cause issues to the game partner. We answer most game play questions, maintaining site-specific and game-specific FAQs and all payments-related requests. Viximo coordinates with each of the Publishers’ Support and Forum organizations to communicate with the user base in a way that is specific to that specific community. We assign a dedicated Product Manager to every publisher who focuses on all Viximo-distributed game support. These Product Managers also work directly with the business people at publisher sites to optimize launch dates, promotions and leverage cross-promotion across games on a site. We become a trusted partner on these sites, as a continuing source of high-quality games, and get preferential placement and pricing of promotions. ...

Supercharge your sales!

5 Lessons Learned

Let’s face it. We all love getting a deal. It’s a well-known fact that many retailers just mark up their prices so they can reduce them later on, but I’ll admit I get a small thrill out of seeing that “40% off!” sticker on a price tag. Even the message on my grocery receipt that tells me I saved 23% for being a “member” makes me feel like a smart, savvy shopper.

Well, social games are no different. Since only a small percentage of social gamers will ever spend money on a particular game, sales can be a great way to drive more paying users. Here at Viximo, we’ve developed a variety of tools to help us run sales and promotions in a fun and engaging way. This is a short list of what we’ve learned along the way.

1. Create a sense of urgency.  

We’ve experimented with various timeframes for sales, ranging from minutes to days. The result? We’ve shown we can create just as much - or more - lift during a 15-minute sale than a full day or even multi-day promotion. The reason? Social games are played in short increments (a typical session lasts just 2-5 minutes), so there is a relatively small window in which players make purchase decisions. A discount that’s only available for 15 minutes can therefore drive more of these impulse buys. If a player sees that something will be available all week, they’re much more likely to enter into the “Oh, I’ll do it later” mindset.

2. Mix it up.

While we try to run sales with some regularity, we’ve found that you can’t be too predictable. If you start doing special discounts every Friday starting at 5:00, chances are your players will catch on and wait until the end of the week to spend their money. With this in mind, we’ve developed a number of different ways to run sales, and try to make each one unique. Just a couple examples:

  • Collectible sales - These multi-day sales reward players for coming back to a game each day with a higher discount on their purchases. Each day they also receive a special prize that they “collect” in a banner above the game. 
  • Flash sales - These minutes-long sales can happen at any time throughout the day, with little advanced warning. 

3. Be relevant.

By far, we’ve seen the most success when we run sales tied to holidays or events. And because we treat every site in our network separately, we can make it relevant to the local culture. For instance, this spring we’ve done promotions for Carnival on Orkut (Brazil), St. Patrick’s Day on Yahoo (US), and Easter on VZNet (Europe).

4. Test your limits.

It’s funny how sometimes you can get more people to buy something when you have a low discount vs. a high one. Without getting too much into the psychology of purchase decisions, I’d say it’s all about the perceived value. Sometimes, “75% off” means “Please, take this crap off my hands”, while “40% off” means “Wow, that’s a bargain”. We’ve tirelessly tested different combinations of discount, messaging, and timing on each of our games and sites to maximize conversion rate and overall revenue.

5. Make it personal.

We’ve built a robust targeting engine to help us run specific promotions to a selected portion of our audience, based on behavioral data. Why? For starters, we can drive the behaviors we want without giving money away - after all, why offer a discount to someone who was probably going to buy anyway? We can also get specific with our messaging when we’re talking to a small group of similar users, driving a higher CTR (up to 60% higher!) on behaviorally-targeted promotions.

Sales are just another example of how we add value to our partners’ already kick-ass games. And through a process of development, marketing, and constant testing, we’re learning more every day. ...

The Challenge of OpenSocial

Filed In:

Let's Go Open

When you’re living in the world of Facebook, life is relatively uncomplicated. With one set of APIs to develop against, it’s not hard to understand why many fall into the trap of building their apps as though the only site they will ever have to integrate against is Facebook. But, what does one do to move beyond Facebook? Does the promise of OpenSocial, as a unifying off-Facebook API, truly deliver?

Developing a successful cross-network application comes down to a couple of core principles:

1) Can one successfully monetize against the participating demographics?
2) Can one successfully capture the attention of new and engaged users via viral channels?
3) Can users easily connect with their friends in the game?

Fundamentally, these three questions shouldn't be that hard to answer; however, it becomes increasingly difficult when the solution to each question varies widely from site to site. To solve these troublesome issues, OpenSocial was formed as a counter-balance to Facebook's dominance. Rather than requiring developers to build ten custom integrations for ten social networks, they could build one integration against OpenSocial and be able to launch themselves on any number of OpenSocial compliant networks.

As a developer who spends his days integrating off of Facebook, I would love to tell you that the promise was delivered. I would love to tell you that integrating on the top ten OpenSocial sites is more like integrating against a single common API. It's not.

It's About Money

A great game can monetize almost anywhere, that is something that we at Viximo have seen time and time again. The principles that drive an excellent game on Facebook can carry over onto any number of other networks worldwide. All that being said, much of the hard work comes long after the game development has finished and network integration has begun.

On a social network that does not have a site-wide economy (think Facebook Credits), the complexity can be immense. Finding the proper payment providers for a particular locale is not as simple as picking Visa, Mastercard, or American Express. In some countries, users prefer a more managed experience through a PayPal-esque provider, where they feel protected through a respected middleman. In other countries, a user may prefer to use their mobile or landline phones for payment. To top it all off, there may be scores of competing payment providers, each with differing levels of credibility that can directly affect your bottom line.

Even on a site that does have a site-wide economy, your troubles aren't over. In all of my integrations, I have yet to see a network that has chosen to mirror the Facebook APIs for managing their economies. It can often take weeks to iron out the complexities of how their economy works with their foreign currency, what fees, if any, need to be taken into account, and how their payment flow integrates both client and server side.

Some sites, such as Hi5, have attempted to standardize the model for payment processing on OpenSocial networks; however, most sites choose to allow game developers to manage their own economies. While this can have its advantages, it can also make it very difficult to get launched on these new networks where one lacks the expertise to know the proper payment options, price-points, and payment experience. This can put even the most successful viral launch in jeopardy, as a poor initial payment experience may permanently detract future purchasers.

Let's Go Viral

After going live on a new social network, the ability to acquire users cheaply is essential to maintaining strong margins. Like Facebook, most social networks provide some mechanisms for users to send gifts to their friends, share updates about their adventures in your game, and invite their friends to join in the fun. Determining how the social network intends for you to perform these actions though can be remarkably challenging.

Every site tends to have different limitations on throttling (the number of times, per day, a user can perform a certain action), different limitations on the length of messages, varying implementations for images and other "eye catching" assets, and different degrees of support for parameter passing. Each of these issues can be tedious and time consuming to assess, implement, and test. In addition, tracking the ever changing nature of the APIs can leave one's head spinning. Some social networks choose to keep up with the ever evolving OpenSocial spec, while others have chosen to do an initial implementation of the specification followed by expansion with custom APIs that are specific to their site.

As Facebook continues to evolve and improve their social-gaming features, so shall sites around the world as they continue to do their best to keep up with best-of-breed practices and features. Could OpenSocial be the answer to creating a common off-Facebook API that reduces stress and headaches for developers who want to launch their games around the world? Perhaps. Is OpenSocial the answer today, or anytime in the near future? Absolutely not.

So, What Am I to Do?

Because of all these issues, Viximo is a perfect solution for a developer looking to diversify their social network portfolio. Rather than writing ten custom integrations and adding the necessary staff to maintain this work going forward, Viximo takes care of this for you. From user acquisition and monetization, to integration optimization, Viximo abstracts away all the complexity of moving off of Facebook and lets you, the developer, focus on making an awesome game. That’s the magic of Viximo. ...

Time Better Spent

We had a chance to connect with a number of mobile app developers at the local Meetup hosted by RaizLabs last night at Microsoft's NERD center in Cambridge.  There's clearly a buzz around Mobile app dev happening in Boston and I really like this city because it's compact and the developers who come to learn about the tech have become a tight group.  So tight, in fact that I was hit up by several folks after we demo'd our sample app and they all asked the same question: "I liked the demo, but what do you guys do?".

I'll take the hit for that because I've been demoing a sample app that showcases the capabilities of our Social Zone SDK for Android and iOS but eyes focus on the demo, not the technology behind it.  The lesson here is to focus on what the dev's care about most: time.  I know that mobile app developers can connect their games and apps to preferred social networks like Facebook, Tuenti, or Google+ to expand their product reach and improve discovery with a single line of code from our SDK, but the message got away from me and I’m seeing it in other demos (and webinars) too.

That's "my bad" for falling into an age old demo trap and a reminder to all that it's too easy to do.  We built a slick app to showcase capabilities of our SDK, but what we offer is a white label solution so the slick demo app is really irrelevant to developers.  Now, in answer to the folks who stopped me after DrinkOnTap last night and any others who have stayed up late to add a social graph to their apps, I'll just say this: "We provide social hooks for mobile apps, and you can pop that in with a single line of code. Bang. Done."

Add another line of code for presence detection, another for optimized messaging, and another to provide in-app recommendations for other apps your company provides.  Single lines of code that allow mobile app dev's to spend their time building better games and apps, not banging their heads doing user authentication and messaging to Facebook and similar social network APIs. As I said in my last blog, "let us do that".
...

Here it is! Our Social Zone Platform

It's official! Our Social Zone platform, which helps make mobile apps more social, has publicly launched.                  

To put it simply, the Social Zone helps connect a user to their preferred social network in order to play mobile games with their friends. Mobile app makers have a hard time getting noticed on platforms with hundreds of thousands of rival apps. By making the mobile apps more social, we will help the apps spread easier across the user's extended social networks.

Viximo has already made it easier to spread games through social networks on the web, and now with the mobile-focused Social Zone, we're enabling developers to add social hooks to their games played on Android and iOS (iPhone, iPad, iPod Touch) devices.

Our goal with the Social Zone is to amplify the virality of mobile apps being developed; enable players to find, collaborate and compete with friends in real-time; and discover relevant games based on what their friends are playing. With Social Zone, developers can accelerate user acquisition, increase engagement, and drive monetization.

We’re taking years of experience in social games and making it "drop dead simple" for mobile app developers to apply it to the mobile frontier. Learn more about Social Zone and our offerings, including our Social Supergraph, by visiting us at mobile.viximo.com.

No plans this weekend? Give our platform a spin at the 2012 AngelHack events in San Francisco and Boston.  Our goal as a sponsor is to support on-site developers to leverage our mobile SDK to add social hooks to their apps. We will be speaking at both events, 10:30am in San Francisco and 11:25am in Boston. Come by and hack with us, and maybe even win a pretty awesome prize. ...

Mobile Hackathon Success!

Saturday we held our first Boston Mobile Hackathon (pat on the back). We had developers showing up with all kinds of experience, exchanging ideas, swapping knowledge, and making connections through an extraordinary networking environment. It was great fun all while being super productive.

The goal of the event was to showcase some brand new mobile dev tools that are not yet publicly available and bring mobile apps to the next level by adding social networks and cloud service hooks. It was encouraging to see the dedication of all the hackers. We will definitely be hosting another Hackathon in the near future.

Participants submitted video demos of their apps or games to the judges and a winner has been determined based on the best use of the "Vixivey" mobile technologies. That winner is Dave Owens (pictured below: middle) from TapWalk who built a massively multi-user real-time continuous game of the geek classic "Rock, Paper, Scissors, Lizard, Spock!" Congrats, Dave!

A big thanks to Kinvey for co-organizing the event, to WorkBar for providing us the perfect location, and to b.good & Hot Tomatoes for keeping our hackers energized by filling bellies with the best grub in town.

Hack on!

You Really Should Let Us Handle That

You're reading this copy because it spilled out of my head, I popped it into a doc and pressed a few buttons that served it up for you in whatever device or medium you have handy.  Your car may be reading it to you as you drive, you may be scanning it on your phone on the metro, or at your desk as you sip your soup.  What's cool is that you no longer wonder or care what happened between my writing it and your uptake.  Someone, or something else is responsible to make sure the bits were posted, a network moved it, your device could find it and give it to you, and someone actually got paid to enable that process.

At our upcoming Hackathon in Boston, we'll be sharing a first glimpse of our Android SDK for mobile app developers who want to add social hooks to their apps and games.  Sure, some Android games can connect to your personal network of friends on Facebook now, but how does that happen?  What the players don't see or likely care about is how much time the app developers need to spend keeping their apps updated to take advantage of the back-end services that the social networks provide.  That's where Viximo shines.

By leveraging this newest SDK, mobile app developers can write their social network integrations once, and we'll take care of the heavy lifting and updates when they're ready to open their games up to the world of users on multiple social networks.  Heck, we'll even take care of native messaging, real time presence detection, recommendations and tracking requirements for those platforms too.  We've been doing it on the web for years, and now we're going mobile.  Learn more at the Hackathon and watch for detailed resources posted to the site coming soon.

  ...

Know Thy Audience

Filed In:

The job of the Vice President of the United States is one of the most vaguely defined in our political system, but it consists of two absolute necessities:

         1. Be alive in case the president dies

         2. Help the president get elected (or re-elected)

Last week Vice President Joe Biden failed miserably in his latter responsibility when he tried to rally support for Barak Obama by telling attendees of a San Francisco rally that "the Giants are going to the Superbowl!."


As we know now, the Giants will in fact be going to the Superbowl, but to the chagrin of San Francisco fans everywhere and as anyone who's ever watched an NFL game knows; the Giants play in New York (New Jersey to be precise). Old Amtrak Joe most likely confused this team with their baseball equivalent (who do in fact play in San Fran), but nonetheless the gaffe is sure to stay with him as part of his legacy as Veep.

Not just a comical punchline destined for SNL's cold open, Biden's mistake illustrates a crucial core competency to successfully bring a product to market: Know thy Audience. I don't need a focus group to tell me that people will be more receptive to the product I am offering them if they think I can understand their needs.

History is littered with examples of companies large and small that dove into new markets without a full understanding of their audience:

·   Disney almost bankrupted their new park in Paris because they didn't serve wine, a staple of the European meal.

·   Coca-Cola had to recall their 2 liter bottles from Spain because local refrigerators couldn't fit them.

·   An American airline promoted that they offered "rendezvous lounges" in Brazilian airports without realizing that in Portuguese, "rendezvous" refers to a place to have sex.

Mistakes like this can undo months of planning, be incredibly expensive and worse of all- indelibly harm a brand's reputation among a demographic. At Viximo we are constantly expanding into new markets, so it is crucial that we take the time to understand the habits of our new audience. Not all games, branding, or marketing tactics will work equally well across our portfolio. That's why we take the time to work with our publishing partners to learn about their users, examine successful games already in the area, and research behavioral and social tendencies of the region, to decide which of our various games and promotions will work best to maximize our audience's enjoyment. Doing this kind of legwork before launching in a new market ensures that we are never encouraging our audience to do something like root for the other team in a playoff game. ...

Automated Partner Monitoring

Filed In:

Viximo thrives on reliable services: internally and externally. Downtime, even if caused by a 3rd-party, means fewer customers and real costs - not just for us, but for all of our partners. It's for that very reason we've invested a little blood, sweat, and tears into automated monitoring for the partner systems that we rely on.

Back in the old days...

It's hard to believe it was a little over a year ago when we first automated monitoring of the 3rd-party games that run on our platform. At the time, we were still fairly inefficient at detecting outages. If a partner happened to detect an outage it was typically long after it had already started, leaving many users with a cryptic error page. If users created support tickets about the outage, the lack of a dedicated customer support engineer meant we were often backlogged several days. If no employees at Viximo or its partners played the game on any given day, it was possible no one would realize it was down.

The nature of our 3rd-party game integrations makes detecting outages a bit more difficult. Like many apps on Facebook, games are integrated via an iframe on the page. Simple failures such as non-200 status codes are easy to detect, but games can fail in much more subtle ways:

  • Failures isolated to certain geographic zones (US, Germany, Spain, etc.)
  • Issues encountered only within Flash
  • Javascript errors or failure to load certain files on the page
  • Intermittent failures due to capacity issues

Without direct insight into the actual application and the servers it's running on, these problems are only magnified.

Getting the job done right...

The tools we now use are insufficient by themselves, but together they can detect the majority of the outages that our partners experience by tackling the problem from various angles.

Monitis is a hosted monitoring solution that we use primarily for its uptime service. It offers geographic locations that map well to Viximo's social network integrations and provides a suite of advanced notifications / callback hooks.

Airbrake collects errors generated within our application. This also allows us to track javascript errors generated on the browser, particularly within partner pages.

Zendesk is a customer support management application we use to track data about support issues. Zendesk has a great API and management interface that makes it easy to analyze and categorize the various issues per game.

Nagios is a system monitoring application that is used, in this case, to monitor application and user behavior through Updawg. Viximo also uses Nagios for other internal systems monitoring.

Using the above tools, we've defined a series of triggers and notification channels that can quickly alert us to issues so that they can be resolved promptly. The first set of triggers below will automatically mark a game as down and place it in maintenance mode. To prevent blips and false alarms from taking apps down too often, these triggers must fail a certain number of times consecutively from a subset of our 5 geographic locations around the world.

  • HTTP status code - Any 2xx status code is considered success; all other codes are considered a failure. For the most part, this trigger catches most outages since the web server will typically return a 500 status code when down.
  • Content matching - Any response that does not include a particular set of content specific to that game is considered a failure. This catches instances where the game is returning a 2xx HTTP status code even though it failed to process.
  • Timeout - If the url takes more than 10 seconds to process, the game is marked as down. Disabling apps as a result of a timeout can help ensure that the game-playing experience is tolerable for active players. This way other players can get funneled to other games while the performance gets investigated.

The remaining triggers use the available notification channels to alert folks at Viximo when they're failing. This gives us the ability to manually test the game and validate that it is in fact down prior to actually marking it as such in our system.

  • User activity - Visits, transactions, and revenue are compared to historical averages. If any value deviates too far from the average, an alert is generated.
  • Customer support rate - If the rate of customer support issues for a game deviates too far from the historical average, an alert is generated. While this data is available for automated monitoring, this is still a manual process that our customer support engineer performs.
  • Error rate - If the total count for a particular type of error exceeds a certain predetermined threshold, an alert is generated. Again, while this data is available for automated monitoring, this is still a manual process the our engineers must perform.

Communication and recovery...

Once one of the above triggers is activated, a variety of communication channels are available to alert a targeted set of people at Viximo. The type of channel used depends on the severity of the trigger. They include:

  • SMS - This isused for the most reliable triggers, such as those that automatically mark a game as down. Typically the integration manager is notified of these events.
  • E-mail - All triggers that are activated will send an e-mail to a mailing list at Viximo that includes folks from the integration, engineering, and product teams. This ensures that those with knowledge of the game can address any issues should they come up.

In addition to the above two channels, there are also additional communication channels for users on the Viximo network:

  • Maintenance page - When a game is marked as down within the Viximo system, a maintenance page is automatically displayed in place of the game. This includes a friendly error message and points users to other games that can be played.
  • Announcement - Administrators have the ability to create announcement popups that will be displayed to any user attempting to play a Viximo game. This also allows us to quickly alert users when there are problems in a game that we're investigating.

Once a game outage has been validated, the partner is typically contacted via e-mail and phone so that they are aware of the issue. An integration engineer is made available in cases where logs and error data are needed to help explain what's occurring.

Moving forward...

Like most software development, monitoring is an ongoing process that can be constantly tweaked and improved to more accurately and quickly detect and resolve outages. There are still some triggers that we could automate, such as the detection of increases in customer support requests and application errors. As well, we could begin to take advantage of Monitis's full-page website monitoring tools that allow every link on the game's page to be validated instead of just the main page itself.

While we strive to make content available to our users as reliable as possible, there are always going to be unexpected outages. The best thing we can do is to make that experience as painless as possible for both our users and our partners. By having automated monitoring in place for our partners, this has put us on the right path towards that goal. ...