Monday, February 4, 2013

Scale-out with Lucene and Azure Table Storage

Initially Socedo used SQL Azure as its main storage.  But after onboarding a couple dozen customers, we realized SQL Azure no longer met our needs:
- our data size is growing at 5-8GB a day.  We were fast approaching the maximum 150GB single Azure database size limit.
- potentially we could use the recently release sharding feature.  But IMHO the feature isn't ready for prime time yet because developers would have to write a lot of plumbing code -- from auto-partitioning/balancing, to fan-out query to middle-tier integration figuring out which partition to query.
- SQL Azure still doesn't support full-text indexing, which customers have been begging for years. What a pity.
- finally, cost is the killer.  SQL Azure costs $1.76 per GB (for 100GB size of database).

After some research, Socedo team spent two weeks building the new backend on top of Azure Table Storage and Lucene.  Lucene technology powers many sites and applications, including the new Twitter search.  The new backend has been running smoothly for a month now and we're very pleased with the results so far.
- we lower the cost by 25 times from $1.76 per GB down to $0.07 per GB (local redundant)
- the maximum data size per table is 100TB (667 times bigger compared to 150GB SQL Azure database size limit!)
- we can easily scale out to onboard new customers by adding new partitions to the table storage
- Lucene allows us to score the leads dynamically at query time very efficiently

Thursday, October 4, 2012

Backup Options for Your SQL Azure Database

SQL Azure guarantees “Monthly Availability” of 99.9% during a calendar month, but it currently does not back up your database automatically.  Without a proper DB backup plan, your business is literally one-click away from disaster.

There are a few options you can back up your database:

1. Copy one Azure database to another http://msdn.microsoft.com/en-us/library/windowsazure/ff951624.aspx
2. Use BCP command line utility to backup and restore your data on SQL Azure to/from your local disk http://blogs.msdn.com/b/sqlazure/archive/2010/05/21/10014019.aspx
3. SQL Data Sync http://msdn.microsoft.com/en-us/library/windowsazure/hh456371.aspx

Friday, September 7, 2012

Display Interactive Twitter Timelines on Your Website

Twitter just launched a new tool called "embedded timelines" that can syndicate any public Twitter timeline to your website with one line of code.

A couple of cool things about this widget are:
  • No coding is needed -- all you need to do is to configure then copy & paste HTML code
  • You can interact with the tweets (show photos/media, reply, retweet, favorite, show more tweets) within the widget
  • You can display other Twitter user's public timelines instead of your own if you choose so
I played with it and was able to get it integrated with this Blogger site rather quickly. Following is the steps to light this widget up:
  1. Go to https://twitter.com/settings/widgets after signing in to your Twitter account
  2. Click "Create new" button to create a new widget
  3. Configure your widget such as Height, Theme, Link color etc then click "Create widget" button (see pic below)
  4. Copy the HTML code at the bottom right of the screen 
  5. Log on to your Blogger site, click Layout menu item on the left
  6. Click Add a Gadget, select Basics gadget category, scroll down to select HTML/JavaScript
  7. Type your Title, paste the HTML code from step #4, hit Save
  8. After click Save arrangement, your timeline is live on your Blogger site! (the live demo is available at the bottom of this page's sidebar)
Twitter Embedded Timeline Widget Configuration

If you need to integrate your timelines with other non-Blogger websites, Step 1-4 are the same.  You just need to replace step 5-8 with pasting generated HTML into your own webpage.

Monday, August 20, 2012

Build a Reliable and Scalable Twitter Streaming Worker Role in Windows Azure

In our early prototype, we used a single worker role instance that connects to Twitter public streams endpoint, parses the tweets and persist them to a SQL Azure database.  There were two issues with this approach:

1. Reliability:
Windows Azure requires at least two instances for each role to achieve 99.5% uptime.  Yet Twitter public streams only allow one standing connection to the public endpoints.  Connecting to a public stream more than once with the same account credentials will cause the oldest connection to be disconnected.  Creating multiple accounts to skirt that limitation is in borderline violation of Twitter policy.  Because we had only one worker role instance, we lose 15-30min streaming data every time Windows Azure upgrades or redeploys role instances.  This is unacceptable as users are expecting and relying on the low-latency data.

2. Scalability:
Reading and parsing tweets is order of magnitude faster than saving the tweets to SQL Azure.  Saving tweets to SQL Azure wouldn't catch up with the incoming tweets when put in stress.

The Old Architecture

To address both issues, we made the following architectural changes:

First, decouple the worker role into Streamer and Importer.  Streamer reads tweets from Twitter public streams and put them in an Azure queue; Importer reads tweets off the Azure queue and parses them before importing into the database. Now we can scale out the streaming and importing independently based on their own loads.

Second, we instantiate two Streamer role instances and use heartbeats to coordinate which instance should own the streaming connection to Twitter.

To be more specific,
1) the role instance that currently maintains the Twitter streaming session writes a heartbeat (its instance ID and a time stamp) to an Azure blob in a fixed interval.
2) each Streamer role instance checks that heartbeat. If missing heartbeat is detected, current role instance takes over the streaming session, writes out its own heartbeat and spins off a thread that backfills any potential missing tweets by calling Twitter REST APIs.  Using REST APIs in conjunction with the Streaming API for backfilling is one of the best practices recommended by Twitter.
3) if the original owner of the streaming session who missed heartbeats ever wakes up, it detects no missing heartbeats and disposes streaming resources.

The New Architecture
The new architecture was deployed a couple of months and has been running smoothly so far.

Tuesday, August 14, 2012

Wireframing with Balsamiq

After playing with and comparing a few wireframing tools, we settled with Balsamiq.  We've been using it for the past few months and are quite happy with it so far.  Following is a screenshot of one of the sample mockups.

What we like about Balsamiq are:

  • It produces unreal-looking yet professional mockups
  • Ability to link pages and click through them
  • Extensive set of UI controls (out-of-the-box, from the Balsamiq community and upload your own) 
  • Create reusable common elements across different mockups with symbols (aka template, master page) 

We use Balsamiq desktop version during intensive wireframing sessions and web version for sharing and collaboration.  One thing Balsamiq could do better is integration between the desktop version and web version. Currently you have to export all mockup as separate BMML files and upload them one-by-one to the web.  It is a tedious and time-consuming process. It'll be nice if we can pull mockups directly from the web into desktop and publish them right back when it's done.

Disclaimer: my company and I are not associated or affilicated with Balsamiq in any way other than we are a paying customer.