Topsy, formed in 2006, is a social media analytics and search company based in San Francisco, California. The company takes data from sources like Twitter, Google Plus, and LiveJournal and analyzes it for influence, sentiment and location. Topsy delivers insights to customers including marketing, news, entertainment, and financial organizations. Customers include the National Football League (NFL), The New York Times, Yahoo, Warner Brothers, AOL, Geico, and Nielsen.
Topsy has created a full-scale index of the public social web, indexing, analyzing, and ranking content from a multitude of sources. Customers rely on Topsy to inform them about breaking news, deliver insights about changing brand perceptions, make sense of market trends, and advise them on optimizing their content and media strategies.
To deliver its insights, the company analyzes and stores content gathered from social media sources, including Twitter. Tweets may be small—the maximum length of a tweet is 140 characters—but once you start mining them for influence, sentiment and location, and expanding the URLs to catalog them appropriately, their size swells. Topsy currently manages a database of 125 TB of data, and every day, it grows by another 450 million tweets.
Storing historical data from online sources enables the company to reprocess data as necessary—when a disruptive technology prompts the need to reprocess, for example, or when Topsy adjusts its algorithms to more precisely track trends. As the amount of data from social media sources has grown, so has the company’s need for flexible computation and cost-effective storage.
Why Amazon Web Services
Since its inception, the company has used Amazon Web Services (AWS), hosting their corporate website, DNS, and email on the AWS Cloud. When it came time to increase Topsy’s computational power and storage capacity, the company found that the AWS Cloud could provide the kind of on-demand resources and long-term storage necessary. “When we need a lot of resources for a specific period of time, that’s our AWS sweet spot,” says David Berk, Vice President of Operations.
Topsy uses Amazon Elastic Compute Cloud (Amazon EC2) for increased computing power on demand, leveraging Spot Instances to secure lower prices. The company continually revises its algorithms to assess current and past trends, testing the algorithms against the full Topsy index to ensure accurate, actionable results.
The company also uses Amazon Simple Storage Service (Amazon S3) and Amazon Glacier for long-term storage of its 90 TB of data. Topsy employs a hybrid strategy, maintaining an on-premises infrastructure to index its data and AWS for storage and reprocessing.
Figure 1: Topsy Architecture Diagram
The company uses Apache Hadoop-based technology as its primary platform for analysis. Indexing is stored in memory in Topsy’s on-premises infrastructure.
Berk cites the flexibility and cost of the AWS Cloud as a differentiator for the company. “Using AWS on demand gives us the computational power to uncover valuable new insights from our data,” Berk says. “Using Spot Instances gives us more instances for the same money. We can instantiate instances on the fly when we need them, and terminate them when we don't.”
AWS helped enable the company to experiment with its algorithms, giving Topsy a competitive advantage. “We can make a change to our algorithms in our test bed and quickly compare it to what’s running in production,” Berk says. “Amazon EC2 and Spot Instances allow us to very intelligently refine our calculations and test them quickly. That’s a capability we wouldn’t have if we had to run it on our own on-premises system.”
The ability to store massive data sets is another differentiator for Topsy. “Without Amazon Glacier, we wouldn’t be able to store so much historical data,” Berk says. “Our customers come to us for insights about how their brands are doing, and we can tell them not just how they’re doing today, but how their initiatives stack up against what they were doing last year, or in 2007. AWS is an essential component of our strategy.”