Technology

Mobile Hackathon Success!

Saturday we held our first Boston Mobile Hackathon (pat on the back). We had developers showing up with all kinds of experience, exchanging ideas, swapping knowledge, and making connections through an extraordinary networking environment. It was great fun all while being super productive.

The goal of the event was to showcase some brand new mobile dev tools that are not yet publicly available and bring mobile apps to the next level by adding social networks and cloud service hooks. It was encouraging to see the dedication of all the hackers. We will definitely be hosting another Hackathon in the near future.

Participants submitted video demos of their apps or games to the judges and a winner has been determined based on the best use of the "Vixivey" mobile technologies. That winner is Dave Owens (pictured below: middle) from TapWalk who built a massively multi-user real-time continuous game of the geek classic "Rock, Paper, Scissors, Lizard, Spock!" Congrats, Dave!

A big thanks to Kinvey for co-organizing the event, to WorkBar for providing us the perfect location, and to b.good & Hot Tomatoes for keeping our hackers energized by filling bellies with the best grub in town.

Hack on!

You Really Should Let Us Handle That

You're reading this copy because it spilled out of my head, I popped it into a doc and pressed a few buttons that served it up for you in whatever device or medium you have handy.  Your car may be reading it to you as you drive, you may be scanning it on your phone on the metro, or at your desk as you sip your soup.  What's cool is that you no longer wonder or care what happened between my writing it and your uptake.  Someone, or something else is responsible to make sure the bits were posted, a network moved it, your device could find it and give it to you, and someone actually got paid to enable that process.

At our upcoming Hackathon in Boston, we'll be sharing a first glimpse of our Android SDK for mobile app developers who want to add social hooks to their apps and games.  Sure, some Android games can connect to your personal network of friends on Facebook now, but how does that happen?  What the players don't see or likely care about is how much time the app developers need to spend keeping their apps updated to take advantage of the back-end services that the social networks provide.  That's where Viximo shines.

By leveraging this newest SDK, mobile app developers can write their social network integrations once, and we'll take care of the heavy lifting and updates when they're ready to open their games up to the world of users on multiple social networks.  Heck, we'll even take care of native messaging, real time presence detection, recommendations and tracking requirements for those platforms too.  We've been doing it on the web for years, and now we're going mobile.  Learn more at the Hackathon and watch for detailed resources posted to the site coming soon.

  ...

Automated Partner Monitoring

Filed In:

Viximo thrives on reliable services: internally and externally. Downtime, even if caused by a 3rd-party, means fewer customers and real costs - not just for us, but for all of our partners. It's for that very reason we've invested a little blood, sweat, and tears into automated monitoring for the partner systems that we rely on.

Back in the old days...

It's hard to believe it was a little over a year ago when we first automated monitoring of the 3rd-party games that run on our platform. At the time, we were still fairly inefficient at detecting outages. If a partner happened to detect an outage it was typically long after it had already started, leaving many users with a cryptic error page. If users created support tickets about the outage, the lack of a dedicated customer support engineer meant we were often backlogged several days. If no employees at Viximo or its partners played the game on any given day, it was possible no one would realize it was down.

The nature of our 3rd-party game integrations makes detecting outages a bit more difficult. Like many apps on Facebook, games are integrated via an iframe on the page. Simple failures such as non-200 status codes are easy to detect, but games can fail in much more subtle ways:

  • Failures isolated to certain geographic zones (US, Germany, Spain, etc.)
  • Issues encountered only within Flash
  • Javascript errors or failure to load certain files on the page
  • Intermittent failures due to capacity issues

Without direct insight into the actual application and the servers it's running on, these problems are only magnified.

Getting the job done right...

The tools we now use are insufficient by themselves, but together they can detect the majority of the outages that our partners experience by tackling the problem from various angles.

Monitis is a hosted monitoring solution that we use primarily for its uptime service. It offers geographic locations that map well to Viximo's social network integrations and provides a suite of advanced notifications / callback hooks.

Airbrake collects errors generated within our application. This also allows us to track javascript errors generated on the browser, particularly within partner pages.

Zendesk is a customer support management application we use to track data about support issues. Zendesk has a great API and management interface that makes it easy to analyze and categorize the various issues per game.

Nagios is a system monitoring application that is used, in this case, to monitor application and user behavior through Updawg. Viximo also uses Nagios for other internal systems monitoring.

Using the above tools, we've defined a series of triggers and notification channels that can quickly alert us to issues so that they can be resolved promptly. The first set of triggers below will automatically mark a game as down and place it in maintenance mode. To prevent blips and false alarms from taking apps down too often, these triggers must fail a certain number of times consecutively from a subset of our 5 geographic locations around the world.

  • HTTP status code - Any 2xx status code is considered success; all other codes are considered a failure. For the most part, this trigger catches most outages since the web server will typically return a 500 status code when down.
  • Content matching - Any response that does not include a particular set of content specific to that game is considered a failure. This catches instances where the game is returning a 2xx HTTP status code even though it failed to process.
  • Timeout - If the url takes more than 10 seconds to process, the game is marked as down. Disabling apps as a result of a timeout can help ensure that the game-playing experience is tolerable for active players. This way other players can get funneled to other games while the performance gets investigated.

The remaining triggers use the available notification channels to alert folks at Viximo when they're failing. This gives us the ability to manually test the game and validate that it is in fact down prior to actually marking it as such in our system.

  • User activity - Visits, transactions, and revenue are compared to historical averages. If any value deviates too far from the average, an alert is generated.
  • Customer support rate - If the rate of customer support issues for a game deviates too far from the historical average, an alert is generated. While this data is available for automated monitoring, this is still a manual process that our customer support engineer performs.
  • Error rate - If the total count for a particular type of error exceeds a certain predetermined threshold, an alert is generated. Again, while this data is available for automated monitoring, this is still a manual process the our engineers must perform.

Communication and recovery...

Once one of the above triggers is activated, a variety of communication channels are available to alert a targeted set of people at Viximo. The type of channel used depends on the severity of the trigger. They include:

  • SMS - This isused for the most reliable triggers, such as those that automatically mark a game as down. Typically the integration manager is notified of these events.
  • E-mail - All triggers that are activated will send an e-mail to a mailing list at Viximo that includes folks from the integration, engineering, and product teams. This ensures that those with knowledge of the game can address any issues should they come up.

In addition to the above two channels, there are also additional communication channels for users on the Viximo network:

  • Maintenance page - When a game is marked as down within the Viximo system, a maintenance page is automatically displayed in place of the game. This includes a friendly error message and points users to other games that can be played.
  • Announcement - Administrators have the ability to create announcement popups that will be displayed to any user attempting to play a Viximo game. This also allows us to quickly alert users when there are problems in a game that we're investigating.

Once a game outage has been validated, the partner is typically contacted via e-mail and phone so that they are aware of the issue. An integration engineer is made available in cases where logs and error data are needed to help explain what's occurring.

Moving forward...

Like most software development, monitoring is an ongoing process that can be constantly tweaked and improved to more accurately and quickly detect and resolve outages. There are still some triggers that we could automate, such as the detection of increases in customer support requests and application errors. As well, we could begin to take advantage of Monitis's full-page website monitoring tools that allow every link on the game's page to be validated instead of just the main page itself.

While we strive to make content available to our users as reliable as possible, there are always going to be unexpected outages. The best thing we can do is to make that experience as painless as possible for both our users and our partners. By having automated monitoring in place for our partners, this has put us on the right path towards that goal. ...

cloud server ! = cloud server

Filed In:

One of the more interesting facets of my work at Viximo is tracking EC2 cluster performance. Below you’ll see a graph of data collected and displayed through New Relic.

The graph shows throughput and average response time for our cluster over a three hour period. The vertical bar in the center represents a feature hotfix that most certainly would not affect performance. So why would response time drop so dramatically from 33ms to 27ms, a (33-27) / 33 = 18% difference?

Spinning up instances in EC2 is cheap, and restarting passenger results in starvation while rails initializes and is then forked into workers. Instead of restarting apache we spin up a new batch of servers, wait til they’re settled then re-point the load balancer. The graph is illustrating the differences in cloud hardware/instances.

The AWS zone we're in appears to have three different classes of hardware for small/medium instances, fast, slow and screwy as illustrated above. Slow as in 40% slower than the fast instance. Screwy as in what’s up with that reported cpu? (Known ubuntu kernel bug). At first we’d thought the instance performance differences could be explained by something obvious like multi-tenancy, however after lengthy capacity testing we’d found that the data in /proc/cpuinfo mattered significantly more than other factors such as multi-tenancy. The variation between a fast and slow server could be 25ms vs 40ms average response times. Inside the fast hardware class, multi-tenancy and other factors explain the observed range of 23-28ms average response times.

Back to the inital graph, the average response time for the 6 instances is (25.2 + 24.3 + 41.4 + 25.1 + 26.5 + 29.9) / 6 = 28.7. Now change two of the fast instances into slow instances (eg 24.3 => 41.1), (25.2 + 41.1 + 41.4 + 40.5 + 26.5 + 29.9) / 6 = 34.1. Boom, we’ve got a (34.1 - 28.7) / 34.1 = 16% difference.

So what does this mean for Viximo’s day to day operations? Most of the time the differences are ignored. Occasionally we find it worth the effort to weed out slow servers for long running rarely modified services. Sometimes it does matter such as a Ruby app tier where cpu is the limiting factor and the cpu differences can either degrade performance or conversely cost us money. For those we must capacity test each class of hardware then factor in observed ratios for cluster sizing.
...

How We Build

Filed In:

I'm excited to announce that in the weeks and months to come, we're going to begin sharing more about our technology, engineering, and operations.  We're eager to share how we've designed and built our service and some of the many lessons we've learned along the way.  But first, to help orient those discussions, I want to give you a glimpse inside just how we build our products.

Application Engineering

We call our core engineering team Application Engineering.  This team is primarily focused on development of any user experience components, our integration container, and all the RESTful services supporting those.  Essentially, anything outward facing that our partners or users may see is developed by this team.

Requirements for the App Engineering team are managed by our product management staff which includes both product managers and partner account managers.

Because these features often have deadlines, or at least timelines that need to be communicated publicly, this team uses Scrum as it's primary process.  Sprints are typically 2-3 weeks in length with a set of target stories and a planned release date.  We find that Scrum also allows for enough rigor to implement a solid testing period before the release goes live.  The burndown chart is shared throughout the company and the team holds daily standups including stakeholders to track progress. 

Of course, throughout the sprint we regularly perform update releases, sometimes several in a single day.  So we're not stricly bound by the sprint schedule, but it provides a nice cadence to feature development and helps manage expectations with our product team and our partners.

Infrastructure Engineering

Over time we discovered a different category of work that we call Infrastructure Engineering.  These tasks often have more to do with systems than features, things that affect performance, scalability, deployability, or monitorability.  They are not overtly outward facing.  This team includes what most think of as devops, but also includes architecture and application work as well.

Requirements for the Infrastructure Engineering team are primarily managed by our technology team which takes input from engineering and operations.

Because these features don't typically have rigid deadlines, but it is important to manage and prioritize this work, the team uses Kanban as it's primary process.  Our Kanban board is shared with everyone and the team meets weekly to talk about approaches, review impediments, and raise new issues.

Work from the Infrastructure Engineering team rolls into production all the time, but some architectural changes are obviously coupled to App Engineering sprint releases to keep things in sync.

Extreme Flexibility

The trick to all of this is that we're still a completely flat organization and these teams are extremely fluid.  Typically the App Engineering team has about 8 engineers and the Infrastructure Engineering team has about 4 engineers, but this changes all the time.  As a new rush of application oriented work comes through, we'll shift resources towards App Engineering.  As a large amount of infrastructure work is required we'll shift some staff in that direction.

This also means that engineers have the opportunity to drift across the product stack, getting great full stack exposure and changing up their work on a regular basis.  Several of our engineers very intentionally alternate back and forth between application and infrastructure work on a regular basis.

This organization has allowed us to deliver dozens and dozens of releases on-time and on-plan, while continuously evolving our architecture to meet the growing needs of our partners and customers.

Hopefully this will help orient some of what we'll share in the months to come, and if you're interested in joining our team, we're always looking for more smart folks.  Please email us at jobs@viximo.com if you think you can help. ...