Razorfish, a digital advertising and marketing firm, segments users and customers based on the collection and analysis of non-personally identifiable data from browsing sessions. Doing so requires applying data mining methods across historical click streams to identify effective segmentation and categorization algorithms and techniques. These click streams are generated when a visitor navigates a web site or catalog, leaving behind patterns that can indicate a user’s interests. Algorithms are then implemented on systems that can batch execute at the appropriate scale against current data sets ranging in size from dozens of Gigabytes to Terabytes. The algorithms are also customized on a client-by-client basis to observe online/offline sales and customer loyalty data. Results of the analysis are loaded into ad-serving and cross-selling systems that in turn deliver the segmentation results in real time.
A common issue Razorfish has found with customer segmentation is the need to process gigantic click stream data sets. These large data sets are often the result of holiday shopping traffic on a retail website, or sudden dramatic growth on the data network of a media or social networking site. Building in-house infrastructure to analyze these click stream datasets requires investment in expensive “headroom” to handle peak demand. Without the expensive computing resources, Razorfish risks losing clients that require Razorfish to have sufficient resources at hand during critical moments. In addition, applications that can’t scale to handle increasingly large datasets can cause delays in identifying and applying algorithms that could drive additional revenue. As the sample data set grows (i.e. more users, more pages, more clicks), fewer applications are available that can handle the load and provide a timely response. Meanwhile, as the number of clients that utilize targeted advertising grows, access to on-demand compute and storage resources becomes a requirement. It was thus imperative for Razorfish to implement customer segmentation algorithms in a way that could be applied and executed independently of the scale of the incoming data and supporting infrastructure.
Prior to implementing Amazon Web Services (AWS), Razorfish relied on a traditional hosting environment that utilized high-cost SAN equipment for storage, a proprietary distributed log processing cluster of 30 servers, and several high-end SQL servers. In preparation for the 2009 holiday season, demand for targeted advertising increased. To support this need, Razorfish faced a potential cost of over $500,000 in additional hardware expenses, a procurement time frame of about two months, and the need for an additional senior operations/database administrator. Furthermore, due to downstream dependencies, they needed their daily processing cycle to complete within 18 hours. However, given the increased data volume, Razorfish expected their processing cycle to extend past two days for each run even after the potential investment in human and computing resources.
Why Amazon Web Services
To deal with the combination of huge datasets and custom segmentation targeting activities, coupled with price sensitive clients, Razorfish decided to move away from their rigid data infrastructure status quo. This migration helped Razorfish process vast amounts of data to handle the need for rapid scaling at both the application and infrastructure levels. Razorfish selected Ad Serving integration, AWS, Amazon Elastic MapReduce (a hosted Apache Hadoop service), Cascading, and a variety of chosen applications to power their targeted advertising system based on these benefits:
Efficient: Elastic infrastructure from AWS allows capacity to be provisioned as needed based on load, reducing cost and the risk of processing delays. Amazon Elastic MapReduce and Cascading lets Razorfish focus on application development without having to worry about time-consuming set-up, management, or tuning of Hadoop clusters or the compute capacity upon which they sit.
Ease of integration: Amazon Elastic MapReduce with Cascading allows data processing in the cloud without any changes to the underlying algorithms.
Flexible: Hadoop with Cascading is flexible enough to allow “agile” implementation and unit testing of sophisticated algorithms.
Adaptable: Cascading simplifies the integration of Hadoop with external ad systems.
Scalable: AWS infrastructure helps Razorfish reliably store and process huge (Petabytes) data sets.
The AWS elastic infrastructure platform allows Razorfish to manage wide variability in load by provisioning and removing capacity as needed. Mark Taylor, Program Director at Razorfish, said, “With our implementation of Amazon Elastic MapReduce and Cascading, there was no upfront investment in hardware, no hardware procurement delay, and no additional operations staff was hired. We completed development and testing of our first client project in six weeks. Our process is completely automated. Total cost of the infrastructure averages around $13,000 per month. Because of the richness of the algorithm and the flexibility of the platform to support it at scale, our first client campaign experienced a 500% increase in their return on ad spend from a similar campaign a year before.”