Case Study
ETL Modernization on AWS Platform

About the Client

The client is an American mass media and entertainment conglomerate that focuses on the development, production, and marketing of entertainment, news, and information to a global audience

Challenge

  • With the abundant inflow of data, the client kept on adding columns to their process. Therefore, they had to update the code every day. Making changes to the raw language is like handling a double-edged sword: it increases the probability of committing mistakes.
  • Configuring the cluster was time-taking as the client couldn’t measure the effectiveness of the configuration.
  • The monitoring perspective didn’t have any control. Once a job is started, the client was depending on the emails to know which brand is running or which job is completed. But, under the hood, the client hardly knows which task is running.

Solution Highlights

Initially, we had to migrate from PySpark code to SnapLogic eXtremePlex.

We developed the application using Python and took advantage of Apache Spark API that can translate the python code to any language as and when needed to communicate with other systems. The written Python code is submitted to the EMR cluster.

When SnapLogic eXtreme is embedded, it offered a facility that enabled the client to embed the Spark code to your SnapLogic GroundPlex. Whenever the client needed the code, they can use the EMR. When they don’t need the code, they can process the data through a normal AWS EC2 machine at less cost. Thus, it delivers more value to the client.

Firstly, the monitoring facility delivered great value to the client. Secondly, the development of code is quick and easy with SQL. Thirdly, it is SnapLogic-based, and it doesn’t require new programming languages because only the working knowledge is needed in terms of attaching the Snaps with respect to logic and it is quick, reliable, and easy. Finally, without having much experience in the EMR configuration part, it is possible to execute the PySpark code through eXtreme in an effective way that yields great performance.

Industry
  • Mass Media
  • Entertainement
Exptertise
  • Data Engineering
  • Big Data
Technologies Used
  • Python
  • AWS
  • EMR
  • Sickle
  • SnapLogic Xtreme
  • GroundPlex

Software Architecture

Business Benefits
  • The client achieved superior performance, simplicity, and maintaining the proper versioning of the code.
  • If the process is getting failed in any step, it can be resumed from that step without reaching the exact developer who developed this code. Any monitoring personnel can understand the reason behind the failure of the code/job and identify the problem behind the failure. S/He can investigate the problem, go through the documentation and follow the prescribed steps to correct the data and start from the beginning without having second thoughts such as data duplication, reiteration of the process, and so on. These things can be avoided in this architecture.
  • Provided a dashboard where:
    • The client gets an edge on monitoring side.
    • How each pipeline is running, total time elapsed in running a pipeline, at what time it got executed, how many more data left for processing, and how much can it process further.
    • Detailed information on the dashboard.
    • Dashboard provides performance data for the last 3 months.
  • Since it is cloud-based, the client doesn’t have to install any third-party software, and can install it in their machine, where their cloud-based tool can be incorporated and they can submit the job to their network through which they can get the results, and leverage their ETL tool in their network.