About HG Data
HG Data is a global leader on competitive intelligence of installed technologies for the world’s largest technology companies, fast growing start-ups and innovative OEM partners. HG Data indexes billions of unstructured documents across the Internet, archived Web and offline resources – including B2B social media, case studies, press releases, blog postings, government documents, content libraries, technical support forums, website source code, job postings, resumes – to produce a detailed, accurate census of B2B technology installations in use at companies globally.
HG Data has been able to build one of the world’s largest global B2B database of installed technologies with more accuracy and greater detail than ever before possible. HG Data allows its clients to target companies by installed technology for market analysis, competitive displacement, predictive modeling, marketing campaigns, client retention initiatives and sales playbooks. Founded in 2010, HG Data has offices in Santa Barbara, California and Sunnyvale, California.
HG Data’s founders started the company after selling their existing company, NOZA, a philanthropy database service, to a fundraising software firm. At the time, the company was using a co-location service. “We were moving toward cloud services even before we started HG Data,” says Craig Harris, Co-founder and CEO. “While the computers at the co-location service were fast, it was a capital investment to buy a new machine, which could take up to two weeks to install. If there was a problem, we had to call technical support at the facility. Someone there would have to investigate the issue and call us back. It could take a while.” Once the sale was completed, the founders of HG Data had to surrender the existing infrastructure and needed a solution to replace it.
Why Amazon Web Services
HG Data had to make a decision quickly. Victor Moreira, the company’s Chief Technology Officer (CTO), had already used Amazon Web Services (AWS). “I took advantage of the AWS Free Tier to work on a personal project, so I was already familiar with AWS services and features. We looked at other cloud service providers, but my experience helped us decide. We’ve been on AWS since day one.”
Processing billions of documents for business intelligence
The core of HG Data’s business lies in gathering raw documents, which it processes and delivers as a feed or flat file to customers. The data platform uses proprietary natural language algorithms to process the documents. The algorithms have intelligence built-in to identify appropriate language and syntax. For example, if a document is a job description for a global sales position requires experience with a customer relationship management (CRM) system, the platform’s algorithms can distinguish between “a global salesforce using a CRM” and “the Salesforce CRM.”
The company collects documents from private sources and receives the data in batch loads. “We typically buy around one billion documents on a quarterly basis,” Moreira says. “We collect around 100,000 documents daily through web crawls. This translates to about 10 TB of raw data each month, which we store in Amazon Simple Storage Service (Amazon S3) buckets.” The original size of the batch load increases as it moves through the processing pipeline because HG Data keeps copies for mirroring and duplication. At the end of the process, HG Data deletes the extra copies reducing the size back to 10 TB.
The company uses Amazon Elastic MapReduce (Amazon EMR) to de-dupe the documents and put them in an uniform JSON format for analysis. “Before Amazon EMR, we couldn’t de-dupe one billion documents in a linear fashion,” Moreira says. “So we took advantage of Hadoop and its scale-out paradigm to work on the documents in parallel, and we used Amazon EMR to make running Hadoop clusters easy, and now we can de-dupe 10+ billion documents.” Using Amazon EMR also allows HG Data’s developers to focus on code instead of managing Hadoop clusters. “When I first started working with Hadoop, I spent all my time getting clusters ready and nodes spun up,” Moreira says. “I haven’t touched a Hadoop cluster since I started using Amazon EMR. Now we can spend our time designing the algorithms.”
After processing, the data is sent to another Amazon S3 bucket and then to a MongoDB NoSQL database. ElasticSearch, an open source search and analytics engine, indexes data from the database for full-text search capability. After the natural language and machine learning algorithms have run, the final data is stored on Amazon RDS for MySQL, which is the delivery mechanism to the customer. Most of the architecture runs in the US West (Oregon) Region. Figure 1 demonstrates HG Data’s architecture on AWS.
Figure 1. HG Data Architecture on AWS
Using AWS APIs to improve operational efficiency and save money
HG Data initially designed the data platform using the .Net Framework and a C# database on an extra-large Amazon Elastic Compute Cloud (Amazon EC2) instance. It wasn’t as fast as the company needed, so they redesigned it using a more modular approach with several purpose-specific tools, including Glassfish, MongoDB, ElasticSearch, and Ember.js. Arnold David Gowans, Lead System Architect, says, “There’s no way we would be where we are today without AWS APIs. We had a team of three engineers who were mostly programmers and problem solvers, not system administrators. Using AWS APIs alleviates most of the administrative overhead and greatly increases the performance of the product. I can focus on the problem that I’m trying to solve instead of system administration of the tool.” The team uses APIs to launch instances automatically, run the web crawler for a designated period of time, collect the data and then shut down the instance.
Most of the processes have fixed times. The engineers created APIs that work with Amazon EC2 APIs to spin up instances dynamically based on CPU and memory. Then Gowans wrote an API to look at Amazon EC2 Spot Instance history and determined that using 20 Amazon EC2 Spot Instances to run Amazon EMR would save the company money. This API would calculate the ideal bid price based on the job requirements and length, saving up to 70% of the on demand price. HG Data runs other parts of the environment, including its internal and external web sites, on Amazon EC2 Reserved Instances.
“Using AWS, we have faster time to market with lower capital investment because we don’t have to invest in hardware,” says Moreira. “If we had to purchase the hardware now, it would be around $100,000. Since our data is growing, we would need to invest more every year. So, I could see us spending $100,000 yearly for the next 3 years just in CAPEX for our core process. Amortized yearly, and taking in consideration AWS monthly cost we’re saving about 50% in hardware costs.”
HG Data measures success on speed to market: how quickly the company can acquire data, process it, and make it deliverable in the right file format for customer. “We’re using APIs to launch a machine or crawl for data with a single click. It takes a couple of days at most instead of three months with our old system.”
“AWS enabled us to become a legitimate big data operation,” says Harris. “Ten minutes after a large tech company publicly announced an acquisition, we got a call from a customer who wanted to send a targeted email message in the next 12 hours. We didn’t have the raw data, so we had to crawl hundreds of thousands of sources of data, process it, and ship the information to the customer. We wouldn’t have made the 12-hour deadline without AWS. With our old system, it would have taken two to three weeks. Now we can run as fast as we can write the code and add it to our machine learning platform. AWS enables us to not let speed of processing and delivery be a gating factor to upscaling our business. That’s priceless to us.”