A US-based international pharma giant had previously worked with Agilisium to get their Enterprise Data Lake set up for global application use. However, the client’s business requirements changed over time. Subsequently, the existing system no longer met business expectations. In this blog, we will look at how Agilisium’s experts delivered a solution, implemented and delivered a massive overhaul of a globe-spanning architecture in multiple 12-week sprints, and why Databricks played a crucial part of the new architecture.
EDL 1.0 – Architecture set up and issues
The top issue with the existing system was the unstable integration between Snaplogic & Hadoop, leading to a lack of agility, unavailability of orchestration, and data failures. Consequently, the latency on querying and data load failures was high resulting in the business unable to meet SLAs. Secondly, the system was spread across geographies and did not have a common framework—the teams that onboarded data were scattered across multiple regions like EU & JAPAC, and other geographies. The bottom line was that the Data & Analytics process was not effective end-to-end.
To evaluate best fit technologies, Agilisium performed a POC to validate if AWS or Databricks is a better option for the new business requirements instead of Snaplogic.
With its advanced features like a unified data analytics platform built on Apache Spark, proprietary Delta Lake, and parallel processing of data, Databricks met all business requirements to build a unified, robust, and scalable system.
EDL 2.0 – Building a single source of truth
Agilisium implemented the Tech Platform Transformation as multiple Minimum Viable Product (MVPs) to migrate & rearchitect all Snaplogic & some Informatica Pipelines to Databricks.
First, the fragmented HDFS/Hive enterprise data warehouse (EDW) holding 1-2TB data was migrated to Amazon Redshift, globally. Amazon Redshift Spectrum was utilized to reduce storage costs.
The Snaplogic pipelines were migrated on to Databricks. At the time of implementation, Databricks was a new technology in the market, and understanding all the features & process was a challenge. The 20-strong team overcame this by collectively earning their Databricks certification. Post the migration to Databricks; pipelines were reduced by 70% — instantly cutting costs and increasing reusability.
Through configuration driven data ingestion – product, sales, marketing, and third-party competitor data from across the various geographies were onboarded on to the Databricks Delta lake in increments. Databricks Delta tables were introduced for faster query execution, performance & reliability. This data lake eventually functioned as the single source of truth, globally, for all downstream BI tools, in this case, Tableau. Databricks cluster autoscaling was also enabled to match data processing demand.
A much-needed feature offered by Databricks was the ability to program automated spin up of servers and the team fully utilized it,
Additionally, end-to-end orchestration was introduced via Airflow resulting in better job failure handling & job re-triggers in case of failure. Truly automated GIT based CI/CD was implemented, a single developer and a tester could now maintain the new architecture.
The estimated project duration was two years, and data from each region was added to the new central architecture incrementally based on client-set region priority. Due to accurate planning, resource identification, frequent feedback, and validation, the 11-member team delivered the project in 5 MVPs (each MVP was three months long), more than six months ahead of schedule!
The Value adds
End to end orchestration using Airflow
Apart from the features implemented, the team ensured that the new system had robust disaster recovery mechanisms. Built-in failsafe like auto spin-up of a new server or entire cluster when jobs fail was implemented for Databricks and Airflow. A data quality framework was introduced to validate data based on business rules. The framework effectively eliminated the need to manually intervene if a job failed, as the process was fully automated.
The client was delighted and had this to say about the team’s efforts, “Thank you team for your effort, teamwork, and dedication to make MVP1 a successful delivery. The result is impressive…. From MVP2, we start to transform the platform from Snaplogic to Databricks. Agilisium is the only partner we use for this MVP2 technology transformation.”
- 400% reduction in the data processing time due to incremental load and parallel execution.
- 95% of jobs executed within SLA.
- 90% reduction in incidents. By leveraging Databricks Delta tables led to faster query execution, performance & reliability.
- 50% reduction in service requests. Implementation of true CI/CD led to faster deployment & reduction in code migration dependency/errors.
- Eliminating storage for intermediate layers and improved processing efficiency cut storage and compute cost.
The lack of a single source of truth, quality data and ad hoc manual reporting processes undermined top management’s visibility of integrated insights on sales, sales rep interactions, marketing reach, brand performance, market share, and territory management. Understandably, the client wanted to align information that has hitherto been in silos, to gain a 360-degree product movement view, to optimize sales planning and gain competitive edge.