
Overview
Data warehouses are a significant hub in the data-to-insight journey, and frequently they are the last stop before downstream BI tools gather data for insights. However, with rapidly changing business needs, data warehouse workloads undergo continual tweaks in a reactive manner, and eventually, the performance takes a hit. Therefore, routine optimization is an important step to maintain a fairly performant data warehouse system. One of our Enterprise Clients, a Global Pharmaceutical giant whose business-critical Redshift workload was experiencing performance issues, approached Agilisium for a performance tuning exercise.
The Challenge
The performance of the Client’s 900 GB Redshift workload has been below expected levels, and they commissioned Agilisium to assess the workload and come up with recommendations. We conducted a comprehensive assessment of their Redshift workload against all 5 pillars of the AWS Well-Architected Framework (WAF) in under 3 days. This was made possible using our homegrown assessment toolkit – Agilisium’s Redshift Inspector.
On completion of the assessment, a Findings and Recommendations report with our observations/findings categorized by criticality, along with relevant recommendations was shared.
Our Solution
Given below are key recommendations from the Findings and Recommendations report, aligned towards the 5 pillars of the AWS WAF:
- Security: Though the VPC subnet was enabled for IP restriction, the Redshift Cluster was publicly accessible, and S3 VPC endpoint was not enabled.
-
Recommendation:
- Almost always, Redshift Cluster should be set up in a private subnet. Even when configured consciously in public subnet, cluster exposure should be limited through security groups and ingress rules
- Default endpoint port 5439 needs to be leveraged for additional layer defense
- S3 VPC endpoint should be enabled to ensure controlled access to S3 buckets, objects and API functions that are in the same region as the VPC
- Cost: The client was running 12 DC2 instances on-demand.
-
Recommendation:
- Our experts recommended that the client start running their workloads on reserved instances save between 50-70% of their on-demand instance cost.
- Performance: The assessment found tables with high skew values, 537 large-width columns, and unsorted percentage for 32 tables was more than 80%. In addition, Concurrency scaling was not enabled.
-
Recommendation:
- Reduce data skew values to under 4, by revisiting SORT & DIST KEYS for selected tables
- Leverage time Series tables when data is in TB scale and split into a multiple-year format and leverage materialized views to combine data from all underlying tables
- Large width columns will lead to data spill to disk, avoid larger width columns to manage the memory usage effectively
- To minimize the broadcasting of the data, ensure colocation of columns used in Joins by choosing right SORT & DIST KEY
- Ensure AUTO SORT is enabled
- Cost:The client was running a 12 node DC2.large instance.
-
Recommendation::
- The Client was already using Reserved Instances. Implementing performance optimization recommendations would lead to downsizing the cluster size by 1 node among other indirect cost savings. The total savings was estimated to be 10%
- Reliability:9 users had superuser access, and the 190 users had read only access
-
Recommendation::
- Limit SUPERUSER access to two users and group the remaining users for enhanced reliability and resiliency
- Operational excellence: : DevOps processes for cluster management were not implemented and out-of-the-box auto management Redshift features were not leveraged.
-
Recommendation:
- Enable Audit logging to track usage and make troubleshooting easier
- Automate Cluster management through Cloudformation or equivalents
- Setup auto management of workload to effectively sort data, gather statistics and reclaim deleted space
- 50% faster Redshift queries
- 10% lower Redshift running cost
- Highly resilient and secured Redshift workload aligned with AWS best practices.