Case Study
Hortonworks to EMR: A migration enabling Big Data analytics at pace


The Media and Entertainment domain (M&E) is crowded with more players than ever before. Increased access to high speed internet has seen OTT streaming services emerging as a strong contender to traditional media studios. In this war for eyeballs, actionable business insights derived from processing big data is the most crucial weapon of all. Hence, big data processing technologies like Hadoop play an indispensable part in the technology stacks of M&E enterprises like our client – an entertainment conglomerate founded in 1923 and headquartered in Burbank, California, USA.

A customer insights application powered by Hortonworks was built for the client’s Data Science team. For over a year, the application was managed by Agilisium, while the client’s Data Engineering team supported and served requests to develop new queries for analysis.

This application was starting to show latency. Agilisium’s ongoing relationship with the client and data analytics expertise enabled it to proactively recommend one significant change that accelerated client’s data-to-insight journey time from days (on Hortonworks) to hours.

The Challenge

The primary change recommended by Agilisium’s experts was that the client migrate from their existing Hortonworks data platform to a cloud service/product that better suited their current needs. Hortonworks was failing to meet the client’s needs due to three main reasons,

  • The average utilization peaked around 30% in spite of Data Science and Data engineering teams running data processing jobs on Hortonworks. However, it peaked at 80% – 90% during month-end activities. This meant that the client invariably paid peak utilization fees.
  • There was a high volume of requests from both Data science and Engineering teams to the DevOps team. The service requests were operational in nature, hence the DevOps TurnAround Time (TAT) was at minimum a few days and could even go up to a few weeks.
  • This led to increased overhead costs as two full-time resources were required to spin up clusters, monitor jobs and ensure that all teams followed DevOps processes while pushing jobs onto Hortonworks.

Consequently, it was a challenge for the client to realize their ROI from Hortonworks. In addition, Agilisium predicted that the client’s data processing needs would only increase in number and complexity over the coming quarters. When the findings were presented, the client appreciated Agilisium’s proactive recommendations and requested that we offer them a best fit solution.

Our Solution

The team of experts from Agilisium tackled the challenge presented to them systematically. Firstly, they built comparative POCs for the client’s use case using three hand-picked technologies - Databricks, AWS EMR & Qubole – all in just under 12 weeks. Secondly, the POCs were presented to the client and they chose EMR for its - elastic auto-scaling compute power, pay as you go pricing, lack of licensing fee and an easy to use management console which also addressed the issue of the high overhead.

Thirdly, although on paper the migration was simple as Hortonworks and EMR are built on the same two open-source technologies - Hadoop & Spark – in reality, the two platforms are significantly different. Therefore, each data processing job on Hortonworks had to be carefully refactored for EMR. The client had stipulated that the migration would be executed with their inhouse resources working closely with Agilisium’s team. The client’s team worked closely with Agilisium and leveraged their expertise in both products to the hilt, easing their migration process.

Finally, while the client’s team handled the migration of data processing jobs, Agilisium also worked on ensuring that EMR’s connection to the rest of the architecture was stable and that it was user friendly. This involved,

  • Simplifying the complex DevOps & CI/CD processes developed originally for Hortonworks to match EMR’s features.
  • Moving data storage from HDFS to S3, dramatically bringing down costs.
  • Usage of tools like Jenkins, Ansible, Terraform and Airflow to automate the jobs flowing through the EMR platform, instead of complicating the architecture with point-to-point integration.

At the end of a 12-week migration effort, all the client’s data processing jobs were migrated to AWS EMR from Hortonworks leading to the client gaining big data processing capabilities at pace.

To fulfill SocialHi’5 need for a client self-service portal that was also easy to maintain, Agilisium’s 5-member expert team built a custom web application with a heavy focus on the visualization of campaign outcomes. They also developed in parallel a DevOps process to maintain, scale and operate this portal.

Web Application Architecture

A variety of AWS services and some open source technologies were used to build and run the web application. The web layer used the PHP framework, included a login and authentication system, and used AWS QuickSight to render its outcome dashboards.

The app layer was built on Python, and the backend services were run on Elastic Container Service (ECS) dockers with Auto Scaling and Auto Load Balancing (ALB) to ensure high availability of the portal. The database was run in a private subnet and used RDS MySQL as the database service.

DevOps Process:

As mentioned earlier, SocialHi5 necessitated that the solution offered was easy to maintain, scale, and operate. To that end, Agilisium’s DevOps engineers developed a 2-part DevOps process focusing on

  • CI/CD for web application development
  • Infrastructure Provisioning for maintenance.

Continuous Integration/Continuous Deployment (CI/CD Process)

All application (Web & App Tier) maintenance was articulated via AWS’s Code Pipeline. AWS’s Code Commit, Code Deploy, and Code Build services were invoked to automate the enhancement and maintenance of the self-service portal.

CI/CD Process Flow: Web Tier

CI/CD Process Flow: Web Tier

Infrastructure provisioning

All infrastructure was hosted on an exclusive SocialHi5 Virtual Private Cloud (VPC), to add an extra layer of confidentiality. AWS CloudFormation templates were used to spin up and maintain a host of AWS services utilized for the self-service portal.

Serverless Web application hosting: EC2, ECS, RDS, S3, SSM, VPC, NAT Gateway, ALB with Autoscaling Group, LAMBDA, Certificate Manager, Route53 were some of the services used to get the portal live.

Security: Web Application Firewall (WAF) was used with Cross-site scripting, Geo match, and SQL injection rules to protect from common cyber threats in conjunction with the AWS inspector service.

Monitoring and Logging: CloudWatch, OpsWorks, Config & Inspector services were also invoked to cover configuration management, logging, and monitoring of the application and infrastructure.

Results and Benefits
  • The Total Cost of Ownership (TCO) significantly decreased due to usage of open source technology and EMR’s lack of licensing fee and pay-as-you-go pricing.
  • On average, DevOps Team TAT reduced between 70 – 80% going from days to hours. Subsequently, Data Scientists could do continuous analysis on business data to obtain customer insights.
  • Time-to-Market for application deployment reduced significantly by about 90% without loss in performance.