Monday, August 20, 2012

Build a Reliable and Scalable Twitter Streaming Worker Role in Windows Azure

In our early prototype, we used a single worker role instance that connects to Twitter public streams endpoint, parses the tweets and persist them to a SQL Azure database.  There were two issues with this approach:

1. Reliability:
Windows Azure requires at least two instances for each role to achieve 99.5% uptime.  Yet Twitter public streams only allow one standing connection to the public endpoints.  Connecting to a public stream more than once with the same account credentials will cause the oldest connection to be disconnected.  Creating multiple accounts to skirt that limitation is in borderline violation of Twitter policy.  Because we had only one worker role instance, we lose 15-30min streaming data every time Windows Azure upgrades or redeploys role instances.  This is unacceptable as users are expecting and relying on the low-latency data.

2. Scalability:
Reading and parsing tweets is order of magnitude faster than saving the tweets to SQL Azure.  Saving tweets to SQL Azure wouldn't catch up with the incoming tweets when put in stress.

The Old Architecture

To address both issues, we made the following architectural changes:

First, decouple the worker role into Streamer and Importer.  Streamer reads tweets from Twitter public streams and put them in an Azure queue; Importer reads tweets off the Azure queue and parses them before importing into the database. Now we can scale out the streaming and importing independently based on their own loads.

Second, we instantiate two Streamer role instances and use heartbeats to coordinate which instance should own the streaming connection to Twitter.

To be more specific,
1) the role instance that currently maintains the Twitter streaming session writes a heartbeat (its instance ID and a time stamp) to an Azure blob in a fixed interval.
2) each Streamer role instance checks that heartbeat. If missing heartbeat is detected, current role instance takes over the streaming session, writes out its own heartbeat and spins off a thread that backfills any potential missing tweets by calling Twitter REST APIs.  Using REST APIs in conjunction with the Streaming API for backfilling is one of the best practices recommended by Twitter.
3) if the original owner of the streaming session who missed heartbeats ever wakes up, it detects no missing heartbeats and disposes streaming resources.

The New Architecture
The new architecture was deployed a couple of months and has been running smoothly so far.

No comments:

Post a Comment