Technology

The Challenge of OpenSocial

Filed In:

Let's Go Open

When you’re living in the world of Facebook, life is relatively uncomplicated. With one set of APIs to develop against, it’s not hard to understand why many fall into the trap of building their apps as though the only site they will ever have to integrate against is Facebook. But, what does one do to move beyond Facebook? Does the promise of OpenSocial, as a unifying off-Facebook API, truly deliver?

Developing a successful cross-network application comes down to a couple of core principles:

1) Can one successfully monetize against the participating demographics?
2) Can one successfully capture the attention of new and engaged users via viral channels?
3) Can users easily connect with their friends in the game?

Fundamentally, these three questions shouldn't be that hard to answer; however, it becomes increasingly difficult when the solution to each question varies widely from site to site. To solve these troublesome issues, OpenSocial was formed as a counter-balance to Facebook's dominance. Rather than requiring developers to build ten custom integrations for ten social networks, they could build one integration against OpenSocial and be able to launch themselves on any number of OpenSocial compliant networks.

As a developer who spends his days integrating off of Facebook, I would love to tell you that the promise was delivered. I would love to tell you that integrating on the top ten OpenSocial sites is more like integrating against a single common API. It's not.

It's About Money

A great game can monetize almost anywhere, that is something that we at Viximo have seen time and time again. The principles that drive an excellent game on Facebook can carry over onto any number of other networks worldwide. All that being said, much of the hard work comes long after the game development has finished and network integration has begun.

On a social network that does not have a site-wide economy (think Facebook Credits), the complexity can be immense. Finding the proper payment providers for a particular locale is not as simple as picking Visa, Mastercard, or American Express. In some countries, users prefer a more managed experience through a PayPal-esque provider, where they feel protected through a respected middleman. In other countries, a user may prefer to use their mobile or landline phones for payment. To top it all off, there may be scores of competing payment providers, each with differing levels of credibility that can directly affect your bottom line.

Even on a site that does have a site-wide economy, your troubles aren't over. In all of my integrations, I have yet to see a network that has chosen to mirror the Facebook APIs for managing their economies. It can often take weeks to iron out the complexities of how their economy works with their foreign currency, what fees, if any, need to be taken into account, and how their payment flow integrates both client and server side.

Some sites, such as Hi5, have attempted to standardize the model for payment processing on OpenSocial networks; however, most sites choose to allow game developers to manage their own economies. While this can have its advantages, it can also make it very difficult to get launched on these new networks where one lacks the expertise to know the proper payment options, price-points, and payment experience. This can put even the most successful viral launch in jeopardy, as a poor initial payment experience may permanently detract future purchasers.

Let's Go Viral

After going live on a new social network, the ability to acquire users cheaply is essential to maintaining strong margins. Like Facebook, most social networks provide some mechanisms for users to send gifts to their friends, share updates about their adventures in your game, and invite their friends to join in the fun. Determining how the social network intends for you to perform these actions though can be remarkably challenging.

Every site tends to have different limitations on throttling (the number of times, per day, a user can perform a certain action), different limitations on the length of messages, varying implementations for images and other "eye catching" assets, and different degrees of support for parameter passing. Each of these issues can be tedious and time consuming to assess, implement, and test. In addition, tracking the ever changing nature of the APIs can leave one's head spinning. Some social networks choose to keep up with the ever evolving OpenSocial spec, while others have chosen to do an initial implementation of the specification followed by expansion with custom APIs that are specific to their site.

As Facebook continues to evolve and improve their social-gaming features, so shall sites around the world as they continue to do their best to keep up with best-of-breed practices and features. Could OpenSocial be the answer to creating a common off-Facebook API that reduces stress and headaches for developers who want to launch their games around the world? Perhaps. Is OpenSocial the answer today, or anytime in the near future? Absolutely not.

So, What Am I to Do?

Because of all these issues, Viximo is a perfect solution for a developer looking to diversify their social network portfolio. Rather than writing ten custom integrations and adding the necessary staff to maintain this work going forward, Viximo takes care of this for you. From user acquisition and monetization, to integration optimization, Viximo abstracts away all the complexity of moving off of Facebook and lets you, the developer, focus on making an awesome game. That’s the magic of Viximo. ...

Here it is! Our Social Zone Platform

It's official! Our Social Zone platform, which helps make mobile apps more social, has publicly launched.                  

To put it simply, the Social Zone helps connect a user to their preferred social network in order to play mobile games with their friends. Mobile app makers have a hard time getting noticed on platforms with hundreds of thousands of rival apps. By making the mobile apps more social, we will help the apps spread easier across the user's extended social networks.

Viximo has already made it easier to spread games through social networks on the web, and now with the mobile-focused Social Zone, we're enabling developers to add social hooks to their games played on Android and iOS (iPhone, iPad, iPod Touch) devices.

Our goal with the Social Zone is to amplify the virality of mobile apps being developed; enable players to find, collaborate and compete with friends in real-time; and discover relevant games based on what their friends are playing. With Social Zone, developers can accelerate user acquisition, increase engagement, and drive monetization.

We’re taking years of experience in social games and making it "drop dead simple" for mobile app developers to apply it to the mobile frontier. Learn more about Social Zone and our offerings, including our Social Supergraph, by visiting us at mobile.viximo.com.

No plans this weekend? Give our platform a spin at the 2012 AngelHack events in San Francisco and Boston.  Our goal as a sponsor is to support on-site developers to leverage our mobile SDK to add social hooks to their apps. We will be speaking at both events, 10:30am in San Francisco and 11:25am in Boston. Come by and hack with us, and maybe even win a pretty awesome prize. ...

Mobile Hackathon Success!

Saturday we held our first Boston Mobile Hackathon (pat on the back). We had developers showing up with all kinds of experience, exchanging ideas, swapping knowledge, and making connections through an extraordinary networking environment. It was great fun all while being super productive.

The goal of the event was to showcase some brand new mobile dev tools that are not yet publicly available and bring mobile apps to the next level by adding social networks and cloud service hooks. It was encouraging to see the dedication of all the hackers. We will definitely be hosting another Hackathon in the near future.

Participants submitted video demos of their apps or games to the judges and a winner has been determined based on the best use of the "Vixivey" mobile technologies. That winner is Dave Owens (pictured below: middle) from TapWalk who built a massively multi-user real-time continuous game of the geek classic "Rock, Paper, Scissors, Lizard, Spock!" Congrats, Dave!

A big thanks to Kinvey for co-organizing the event, to WorkBar for providing us the perfect location, and to b.good & Hot Tomatoes for keeping our hackers energized by filling bellies with the best grub in town.

Hack on!

You Really Should Let Us Handle That

You're reading this copy because it spilled out of my head, I popped it into a doc and pressed a few buttons that served it up for you in whatever device or medium you have handy.  Your car may be reading it to you as you drive, you may be scanning it on your phone on the metro, or at your desk as you sip your soup.  What's cool is that you no longer wonder or care what happened between my writing it and your uptake.  Someone, or something else is responsible to make sure the bits were posted, a network moved it, your device could find it and give it to you, and someone actually got paid to enable that process.

At our upcoming Hackathon in Boston, we'll be sharing a first glimpse of our Android SDK for mobile app developers who want to add social hooks to their apps and games.  Sure, some Android games can connect to your personal network of friends on Facebook now, but how does that happen?  What the players don't see or likely care about is how much time the app developers need to spend keeping their apps updated to take advantage of the back-end services that the social networks provide.  That's where Viximo shines.

By leveraging this newest SDK, mobile app developers can write their social network integrations once, and we'll take care of the heavy lifting and updates when they're ready to open their games up to the world of users on multiple social networks.  Heck, we'll even take care of native messaging, real time presence detection, recommendations and tracking requirements for those platforms too.  We've been doing it on the web for years, and now we're going mobile.  Learn more at the Hackathon and watch for detailed resources posted to the site coming soon.

  ...

Automated Partner Monitoring

Filed In:

Viximo thrives on reliable services: internally and externally. Downtime, even if caused by a 3rd-party, means fewer customers and real costs - not just for us, but for all of our partners. It's for that very reason we've invested a little blood, sweat, and tears into automated monitoring for the partner systems that we rely on.

Back in the old days...

It's hard to believe it was a little over a year ago when we first automated monitoring of the 3rd-party games that run on our platform. At the time, we were still fairly inefficient at detecting outages. If a partner happened to detect an outage it was typically long after it had already started, leaving many users with a cryptic error page. If users created support tickets about the outage, the lack of a dedicated customer support engineer meant we were often backlogged several days. If no employees at Viximo or its partners played the game on any given day, it was possible no one would realize it was down.

The nature of our 3rd-party game integrations makes detecting outages a bit more difficult. Like many apps on Facebook, games are integrated via an iframe on the page. Simple failures such as non-200 status codes are easy to detect, but games can fail in much more subtle ways:

  • Failures isolated to certain geographic zones (US, Germany, Spain, etc.)
  • Issues encountered only within Flash
  • Javascript errors or failure to load certain files on the page
  • Intermittent failures due to capacity issues

Without direct insight into the actual application and the servers it's running on, these problems are only magnified.

Getting the job done right...

The tools we now use are insufficient by themselves, but together they can detect the majority of the outages that our partners experience by tackling the problem from various angles.

Monitis is a hosted monitoring solution that we use primarily for its uptime service. It offers geographic locations that map well to Viximo's social network integrations and provides a suite of advanced notifications / callback hooks.

Airbrake collects errors generated within our application. This also allows us to track javascript errors generated on the browser, particularly within partner pages.

Zendesk is a customer support management application we use to track data about support issues. Zendesk has a great API and management interface that makes it easy to analyze and categorize the various issues per game.

Nagios is a system monitoring application that is used, in this case, to monitor application and user behavior through Updawg. Viximo also uses Nagios for other internal systems monitoring.

Using the above tools, we've defined a series of triggers and notification channels that can quickly alert us to issues so that they can be resolved promptly. The first set of triggers below will automatically mark a game as down and place it in maintenance mode. To prevent blips and false alarms from taking apps down too often, these triggers must fail a certain number of times consecutively from a subset of our 5 geographic locations around the world.

  • HTTP status code - Any 2xx status code is considered success; all other codes are considered a failure. For the most part, this trigger catches most outages since the web server will typically return a 500 status code when down.
  • Content matching - Any response that does not include a particular set of content specific to that game is considered a failure. This catches instances where the game is returning a 2xx HTTP status code even though it failed to process.
  • Timeout - If the url takes more than 10 seconds to process, the game is marked as down. Disabling apps as a result of a timeout can help ensure that the game-playing experience is tolerable for active players. This way other players can get funneled to other games while the performance gets investigated.

The remaining triggers use the available notification channels to alert folks at Viximo when they're failing. This gives us the ability to manually test the game and validate that it is in fact down prior to actually marking it as such in our system.

  • User activity - Visits, transactions, and revenue are compared to historical averages. If any value deviates too far from the average, an alert is generated.
  • Customer support rate - If the rate of customer support issues for a game deviates too far from the historical average, an alert is generated. While this data is available for automated monitoring, this is still a manual process that our customer support engineer performs.
  • Error rate - If the total count for a particular type of error exceeds a certain predetermined threshold, an alert is generated. Again, while this data is available for automated monitoring, this is still a manual process the our engineers must perform.

Communication and recovery...

Once one of the above triggers is activated, a variety of communication channels are available to alert a targeted set of people at Viximo. The type of channel used depends on the severity of the trigger. They include:

  • SMS - This isused for the most reliable triggers, such as those that automatically mark a game as down. Typically the integration manager is notified of these events.
  • E-mail - All triggers that are activated will send an e-mail to a mailing list at Viximo that includes folks from the integration, engineering, and product teams. This ensures that those with knowledge of the game can address any issues should they come up.

In addition to the above two channels, there are also additional communication channels for users on the Viximo network:

  • Maintenance page - When a game is marked as down within the Viximo system, a maintenance page is automatically displayed in place of the game. This includes a friendly error message and points users to other games that can be played.
  • Announcement - Administrators have the ability to create announcement popups that will be displayed to any user attempting to play a Viximo game. This also allows us to quickly alert users when there are problems in a game that we're investigating.

Once a game outage has been validated, the partner is typically contacted via e-mail and phone so that they are aware of the issue. An integration engineer is made available in cases where logs and error data are needed to help explain what's occurring.

Moving forward...

Like most software development, monitoring is an ongoing process that can be constantly tweaked and improved to more accurately and quickly detect and resolve outages. There are still some triggers that we could automate, such as the detection of increases in customer support requests and application errors. As well, we could begin to take advantage of Monitis's full-page website monitoring tools that allow every link on the game's page to be validated instead of just the main page itself.

While we strive to make content available to our users as reliable as possible, there are always going to be unexpected outages. The best thing we can do is to make that experience as painless as possible for both our users and our partners. By having automated monitoring in place for our partners, this has put us on the right path towards that goal. ...

cloud server ! = cloud server

Filed In:

One of the more interesting facets of my work at Viximo is tracking EC2 cluster performance. Below you’ll see a graph of data collected and displayed through New Relic.

The graph shows throughput and average response time for our cluster over a three hour period. The vertical bar in the center represents a feature hotfix that most certainly would not affect performance. So why would response time drop so dramatically from 33ms to 27ms, a (33-27) / 33 = 18% difference?

Spinning up instances in EC2 is cheap, and restarting passenger results in starvation while rails initializes and is then forked into workers. Instead of restarting apache we spin up a new batch of servers, wait til they’re settled then re-point the load balancer. The graph is illustrating the differences in cloud hardware/instances.

The AWS zone we're in appears to have three different classes of hardware for small/medium instances, fast, slow and screwy as illustrated above. Slow as in 40% slower than the fast instance. Screwy as in what’s up with that reported cpu? (Known ubuntu kernel bug). At first we’d thought the instance performance differences could be explained by something obvious like multi-tenancy, however after lengthy capacity testing we’d found that the data in /proc/cpuinfo mattered significantly more than other factors such as multi-tenancy. The variation between a fast and slow server could be 25ms vs 40ms average response times. Inside the fast hardware class, multi-tenancy and other factors explain the observed range of 23-28ms average response times.

Back to the inital graph, the average response time for the 6 instances is (25.2 + 24.3 + 41.4 + 25.1 + 26.5 + 29.9) / 6 = 28.7. Now change two of the fast instances into slow instances (eg 24.3 => 41.1), (25.2 + 41.1 + 41.4 + 40.5 + 26.5 + 29.9) / 6 = 34.1. Boom, we’ve got a (34.1 - 28.7) / 34.1 = 16% difference.

So what does this mean for Viximo’s day to day operations? Most of the time the differences are ignored. Occasionally we find it worth the effort to weed out slow servers for long running rarely modified services. Sometimes it does matter such as a Ruby app tier where cpu is the limiting factor and the cpu differences can either degrade performance or conversely cost us money. For those we must capacity test each class of hardware then factor in observed ratios for cluster sizing.
...

How We Build

Filed In:

I'm excited to announce that in the weeks and months to come, we're going to begin sharing more about our technology, engineering, and operations.  We're eager to share how we've designed and built our service and some of the many lessons we've learned along the way.  But first, to help orient those discussions, I want to give you a glimpse inside just how we build our products.

Application Engineering

We call our core engineering team Application Engineering.  This team is primarily focused on development of any user experience components, our integration container, and all the RESTful services supporting those.  Essentially, anything outward facing that our partners or users may see is developed by this team.

Requirements for the App Engineering team are managed by our product management staff which includes both product managers and partner account managers.

Because these features often have deadlines, or at least timelines that need to be communicated publicly, this team uses Scrum as it's primary process.  Sprints are typically 2-3 weeks in length with a set of target stories and a planned release date.  We find that Scrum also allows for enough rigor to implement a solid testing period before the release goes live.  The burndown chart is shared throughout the company and the team holds daily standups including stakeholders to track progress. 

Of course, throughout the sprint we regularly perform update releases, sometimes several in a single day.  So we're not stricly bound by the sprint schedule, but it provides a nice cadence to feature development and helps manage expectations with our product team and our partners.

Infrastructure Engineering

Over time we discovered a different category of work that we call Infrastructure Engineering.  These tasks often have more to do with systems than features, things that affect performance, scalability, deployability, or monitorability.  They are not overtly outward facing.  This team includes what most think of as devops, but also includes architecture and application work as well.

Requirements for the Infrastructure Engineering team are primarily managed by our technology team which takes input from engineering and operations.

Because these features don't typically have rigid deadlines, but it is important to manage and prioritize this work, the team uses Kanban as it's primary process.  Our Kanban board is shared with everyone and the team meets weekly to talk about approaches, review impediments, and raise new issues.

Work from the Infrastructure Engineering team rolls into production all the time, but some architectural changes are obviously coupled to App Engineering sprint releases to keep things in sync.

Extreme Flexibility

The trick to all of this is that we're still a completely flat organization and these teams are extremely fluid.  Typically the App Engineering team has about 8 engineers and the Infrastructure Engineering team has about 4 engineers, but this changes all the time.  As a new rush of application oriented work comes through, we'll shift resources towards App Engineering.  As a large amount of infrastructure work is required we'll shift some staff in that direction.

This also means that engineers have the opportunity to drift across the product stack, getting great full stack exposure and changing up their work on a regular basis.  Several of our engineers very intentionally alternate back and forth between application and infrastructure work on a regular basis.

This organization has allowed us to deliver dozens and dozens of releases on-time and on-plan, while continuously evolving our architecture to meet the growing needs of our partners and customers.

Hopefully this will help orient some of what we'll share in the months to come, and if you're interested in joining our team, we're always looking for more smart folks.  Please email us at jobs@viximo.com if you think you can help. ...