Case Study
Holistic Redshift Assessment reduces Cost and uplifts Performance for M&E Giant


The customer is a Media and Entertainment giant based in the USA. Their Big Data and Analytics workloads were already running on AWS. However, the organization was looking to further optimize their system to improve performance and lower costs. Therefore, Agilisium’s experts were tasked with doing a holistic review of their workload and associated services, to recommend optimization strategies.

The Challenge

The customer’s current setup had a stable 25+ TB workload running on Redshift DS2 instances and had been live for over eight months. However, as the company grew, these workloads evolved, and the costs escalated, and performance slowed. Though the system was fairly performant, the business decided on commissioning Agilisium’s comprehensive assessment offering with a specific focus on Redshift.

Our Solution

The offering was Agilisium’s Redshift Review – a one-of-a-kind in-house assessment and intensive workshop developed to evaluate Redshift based on a host of AWS Well-Architected Framework (WAF) factors. Within two days, a report identifying 16 optimization opportunities across the 5 WAF pillars was presented to the client. The report ranked the recommendations under the based on the positive impact they would have on the system categorized under the 5 WAF pillars. Here are some of the key findings:

  • Cost: The client was running their instances on on-demand nodes. Growing data volumes were a considerable factor in this decision.
    • Leverage Data placement strategy where aged data is offloaded to S3 and leverage Redshift Spectrum to query
    • Use the newly introduced RA3 instances – separates compute and storage (compute needs are down 30-35%) – independently scale storage without touching compute.
    • Start with running a small part of their workload, say five nodes, on reserved instances to kickstart their savings.
  • Performance: Their system had growing storage requirements; however, their workloads did not need higher compute capacity.
    • Use time-series tables to split TB scale monolith tables into multiple year format and use views to combine all tables for reporting purposes.
    • Right-size/avoid larger width columns to prevent data spillover to disk and manage memory usage.
    • Leverage concurrency scaling to manage peak reporting workload and take advantage of the 1-hour free scaling time as credit offered by AWS for every 24 hour of cluster use.
    • Limit use of interleaved sort keys to unavoidable scenarios; as concurrency scaling will not work with interleaved sort keys.
    • Use the newly introduced RA3 instances – separates compute and storage (compute needs are down 30-35%) – independently scale storage without touching compute.
  • Reliability cross region backups were not set up and temporary and staging tables were marked for back up.
    • Enable cross region back up for disaster recovery and disable back up of temporary tables to improve back up times.
    • Set table backup flags to 0 for temp and staging tables to reduce time to backup and restore.
  • Operational excellence: The system had no logs and alerts enabled and did not utilize any automation functionalities that Redshift offered.
    • Ensure Redshift’s Audit log feature is turned on to track query performance and connection attempts that would help in troubleshooting.
    • Enable Cloudwatch logs and alerts for when resource usage crossed predefined limits.
    • Set up auto management of workload to effectively utilize storage and gather statistics.

To fulfill SocialHi’5 need for a client self-service portal that was also easy to maintain, Agilisium’s 5-member expert team built a custom web application with a heavy focus on the visualization of campaign outcomes. They also developed in parallel a DevOps process to maintain, scale and operate this portal.

Web Application Architecture

A variety of AWS services and some open source technologies were used to build and run the web application. The web layer used the PHP framework, included a login and authentication system, and used AWS QuickSight to render its outcome dashboards.

The app layer was built on Python, and the backend services were run on Elastic Container Service (ECS) dockers with Auto Scaling and Auto Load Balancing (ALB) to ensure high availability of the portal. The database was run in a private subnet and used RDS MySQL as the database service.

DevOps Process:

As mentioned earlier, SocialHi5 necessitated that the solution offered was easy to maintain, scale, and operate. To that end, Agilisium’s DevOps engineers developed a 2-part DevOps process focusing on

  • CI/CD for web application development
  • Infrastructure Provisioning for maintenance.

Continuous Integration/Continuous Deployment (CI/CD Process)

All application (Web & App Tier) maintenance was articulated via AWS’s Code Pipeline. AWS’s Code Commit, Code Deploy, and Code Build services were invoked to automate the enhancement and maintenance of the self-service portal.

CI/CD Process Flow: Web Tier

CI/CD Process Flow: Web Tier

Infrastructure provisioning

All infrastructure was hosted on an exclusive SocialHi5 Virtual Private Cloud (VPC), to add an extra layer of confidentiality. AWS CloudFormation templates were used to spin up and maintain a host of AWS services utilized for the self-service portal.

Serverless Web application hosting: EC2, ECS, RDS, S3, SSM, VPC, NAT Gateway, ALB with Autoscaling Group, LAMBDA, Certificate Manager, Route53 were some of the services used to get the portal live.

Security: Web Application Firewall (WAF) was used with Cross-site scripting, Geo match, and SQL injection rules to protect from common cyber threats in conjunction with the AWS inspector service.

Monitoring and Logging: CloudWatch, OpsWorks, Config & Inspector services were also invoked to cover configuration management, logging, and monitoring of the application and infrastructure.

Results and Benefits
  • The implementation of recommended optimization strategies resulted in 40% lower cost and 30% faster queries.
  • With minimal tweaks of the system the client could run a much more secure and resilient workload.