About Diffbot

Diffbot is a San Francisco Bay Area startup that provides developers with APIs and other tools to understand and extract data from any web page. Diffbot’s APIs can extract the title, author, date, text, images, videos, captions, categories, entities, and other metadata from an article page to enhance readability on mobile applications. In addition, Diffbot provides a programmatic crawler that can be combined with page analysis APIs to extract and index databases of information from entire websites in real-time.

Diffbot enables software companies of all sizes—whether it’s a large company wanting to mine information from an entire website or a small, product-focused team with limited resources—to access nearly any page on the web as a source of structured data with a simple API call. Instapaper, Digg, AOL, Salesforce, CBS Interactive, and The New York Times use Diffbot’s APIs to power their content engines and analyze competitors. Large firms such as Salesforce’s Radian6 use Diffbot to monitor social media conversations while startups such as FindTheBest use Diffbot to check product pricing information on the web.

Diffbot, located in Palo Alto, CA, was founded by Mike Tung, then a graduate student in Artificial Intelligence at Stanford University, and was the first company to take part in StartX, Stanford’s on-campus accelerator.

The Challenge

Diffbot's technology applies computer vision and natural language processing algorithms to web pages, executing all of the styling, scripting, and layout needed to produce visual information. The processes are CPU-intensive and users tend to submit content in bursts from news streams, social media channels, and other sources. As a result, Diffbot has to be able to scale to handle frequent, real-time spikes in demand.

The company runs its own data center and was using custom software to handle deployment and scaling. "When we were first started out as a small company, running the operations of our data center consumed an enormous amount of my time and attention," says Founder and CEO, Mike Tung. "In the startup stage, focus is critical—anything that distracts you from delivering your company's core and unique value can be fatal to your venture's success. As we started to ramp up API call volumes, it was clear that we needed a better strategy for scaling our computing resources. Diffbot handles hundreds of millions of API calls per month, but as a startup, it was not capital efficient to build out a large-scale on-premises infrastructure."

Why Amazon Web Services

Diffbot considered a variety of solutions, but chose Amazon Web Services (AWS) because of the scalability of the platform and the ability to leverage Amazon EC2 Spot Instances as a cost-effective way to purchase compute capacity. Diffbot designed a solution that integrated the use of Amazon Elastic Compute Cloud (Amazon EC2) instances with existing on-premises resources. Diffbot uses the compute-optimized c1.xlarge Amazon EC2 instance types for its most compute-intensive machine learning loads. The high core count of these instance types means that multi-threaded code can utilize static objects more efficiently in memory. The higher clock speeds means that latency can be reduced.

By switching from Berkeley Internet Name Domain (BIND) DNS servers to Amazon Route 53, a globally distributed DNS, Diffbot can utilize the geographical distribution and the higher hit rate of a shared cache, removing a single-point-of-failure and lowering the average roundtrip latency. Diffbot uses Amazon Machine Images (AMIs) to define images of worker roles, greatly simplifying deployment and rollback and Amazon Simple Storage Service (Amazon S3) to store the AMIs.

Diffbot APIs analyze a web page and return a JavaScript Object Notation (JSON) object in real-time. The on-demand nature of some of its APIs means that traffic can spike throughout the day as new web pages are created across the web. Diffbot monitors resources with Amazon CloudWatch and utilizes Auto Scaling with custom predictive logic in order to scale up its analysis fleet during periods of high demand. This allows Diffbot to maintain high performance regardless of the amount of traffic it receives.

The Benefits

Diffbot  processes hundreds of millions of web pages per month, and using Amazon EC2 Spot Instances lets the company flexibly prioritize and shift computing resources, depending on the level of requests. “Using Amazon EC2 Spot Instances helps Diffbot realize a 70 percent cost savings while the flexible prioritization increases reliability,” says Tung.

By running on the AWS Cloud, Diffbot is able to focus resources on developing cutting-edge machine learning algorithms, rather than worrying about hardware failure. Tung estimates that Diffbot can scale its infrastructure as needed in five minutes. “Utilizing AWS allows Diffbot to run on the same kind of world-class infrastructure that big companies use to operate their businesses. The resulting level of reliability, performance, and scale gained as a result would have been impossible to achieve by building out our own servers.”