
Overview
The customer is a Media and Entertainment giant based in the USA. Their Big Data and Analytics workloads were already running on AWS. However, the organization was looking to further optimize their system to improve performance and lower costs. Therefore, Agilisium’s experts were tasked with doing a holistic review of their workload and associated services, to recommend optimization strategies.
The Challenge
The customer’s current setup had a stable 25+ TB workload running on Redshift DS2 instances and had been live for over eight months. However, as the company grew, these workloads evolved, and the costs escalated, and performance slowed. Though the system was fairly performant, the business decided on commissioning Agilisium’s comprehensive assessment offering with a specific focus on Redshift.
Our Solution
The offering was Agilisium’s Redshift Review – a one-of-a-kind in-house assessment and intensive workshop developed to evaluate Redshift based on a host of AWS Well-Architected Framework (WAF) factors. Within two days, a report identifying 16 optimization opportunities across the 5 WAF pillars was presented to the client. The report ranked the recommendations under the based on the positive impact they would have on the system categorized under the 5 WAF pillars. Here are some of the key findings:
- Cost: The client was running their instances on on-demand nodes. Growing data volumes were a considerable factor in this decision.
-
Recommendation:
- Leverage Data placement strategy where aged data is offloaded to S3 and leverage Redshift Spectrum to query
- Use the newly introduced RA3 instances – separates compute and storage (compute needs are down 30-35%) – independently scale storage without touching compute.
- Start with running a small part of their workload, say five nodes, on reserved instances to kickstart their savings.
- Performance: Their system had growing storage requirements; however, their workloads did not need higher compute capacity.
-
Recommendation:
- Use time-series tables to split TB scale monolith tables into multiple year format and use views to combine all tables for reporting purposes.
- Right-size/avoid larger width columns to prevent data spillover to disk and manage memory usage.
- Leverage concurrency scaling to manage peak reporting workload and take advantage of the 1-hour free scaling time as credit offered by AWS for every 24 hour of cluster use.
- Limit use of interleaved sort keys to unavoidable scenarios; as concurrency scaling will not work with interleaved sort keys.
- Use the newly introduced RA3 instances – separates compute and storage (compute needs are down 30-35%) – independently scale storage without touching compute.
- Reliability cross region backups were not set up and temporary and staging tables were marked for back up.
-
Recommendation:
- Enable cross region back up for disaster recovery and disable back up of temporary tables to improve back up times.
- Set table backup flags to 0 for temp and staging tables to reduce time to backup and restore.
- Operational excellence: The system had no logs and alerts enabled and did not utilize any automation functionalities that Redshift offered.
-
Recommendation:
- Ensure Redshift’s Audit log feature is turned on to track query performance and connection attempts that would help in troubleshooting.
- Enable Cloudwatch logs and alerts for when resource usage crossed predefined limits.
- Set up auto management of workload to effectively utilize storage and gather statistics.
- The implementation of recommended optimization strategies resulted in 40% lower cost and 30% faster queries.
- With minimal tweaks of the system the client could run a much more secure and resilient workload.