AWS Quick Start + SNAPAHEAD: The 2-Step Big Data Journey

Patience is a virtue, but not for organizations dealing with Big Data

Given the variety of technological tools/platforms that are available for Big Data adoption, organizations must quickly experiment and know what works for them or what doesn’t. Compounding this complexity is the data variety, which needs to be streamlined and transformed into meaningful insights. Would it not be interesting to have a preconfigured setup, that can be set up in no time, which organizations can play around with sample data and quickly see results?

What are the common difficulties in the big data journey?

Businesses are under constant pressure to explore their data and quickly arrive at insights that can aid business growth.

In setting up a Big Data ecosystem, organizations are primarily faced with two challenges.

  • Firstly, expensive legacy data processing systems that do not serve the needs of today’s business. These systems not only delay experimentation, but also cost the organization a fortune in terms of licensing fees, patches and upgrades.
  • Secondly, expertise is not available on demand in order to experiment with new technologies. System Integration partners might want to part with their expensive talent only when organizations award them a big chunk of the IT budgets.  This is all the more true when it comes to setting up a cloud platform, storage services, data-lakes and data warehouses, along with data engineering and integration frameworks.

Given these issues, organizations are turning to automation for faster experimentation and quicker results.  But this is easier said than done.

Having implemented solutions for many clients across multi verticals, Agilisium is well versed with the typical challenges faced by clients and engineered accelerators to address these issues as given below.

  • AWS Quick Start – an accelerator for setting up a Data Lake development environment on AWS Cloud using SnapLogic.
  • A Legacy Migration Accelerator to SnapLogic – SNAPAHEAD.

AWS Quick Start: Get your AWS-SnapLogic environment up and running within the hour

Agilisium’s Quick Start Solution was developed with the idea of reducing the effort required to setup a data lake environment from weeks to minutes.

What are the challenges in setting up the environment for a Data-lake?

The common challenge areas in setting up an instant data lake are,

  • Requiring valid licenses
  • Familiarity with the user interface
  • Developing required code
  • Assembling compute resources
  • Understanding architectural best practices
  • Compliance issues
  • Handling varied data quantities
  • Resource Distribution

Simply put, the exercise demands the expertise of a Data Consultant, a Data Engineer, a Cloud Architect and a Business Analyst, all in collaboration. Given that developing this kind of capability is highly challenging, the process can take several weeks.

How does the Quick Start address these issues

The Quick Start builds the environment on AWS in less than half an hour!
The solution automates the design, setup and configuration of hardware and software required for building a data lake.

Data is ingested from various sources, transformed, and delivered for consumption. Amazon S3 serves as the data lake layer, Amazon Redshift forms the data warehouse layer, SnapLogic facilitates data integration and engineering, and optionally, Amazon QuickSight aids data visualization. The solution deploys Amazon infrastructure components in alignment with Well-Architected Framework guidelines.

AWS-SnapLogic Quick Start automates design, setup and configuration of cloud infrastructure services for building a data lake

How does the Quick Start work?

When the Quick Start is executed, a virtual private cloud (VPC) infrastructure is setup on the AWS platform. The VPC spans across two availability zones and is based on a two-tier architecture for public and private subnets. The NAT gateway and the bastion host are configured on the public network, while Groundplex servers and the Redshift cluster are deployed on the private subnet, to protect customer data. The Groundplex server is an EC2 instance which interacts with the SnapLogic control pane via the NAT gateway and performs the task of executing data pipelines.

The bastion host provides access to private resources for performing required administrative tasks. In addition, the bastion host and Groundplex servers are also tied to their own auto scaling groups allowing for easy scalability and auto healing. For enhanced data security, monitoring services like CloudWatch and CloudTrail can also be called to action enforcing stringent governance measures.

The below screen gives a glimpse into the SNAPAHEAD UI

How can it be customized?

The AWS Quick Start is highly configurable. One no longer needs to worry about changing cloud capacity, number of EC2 instances, type of instance, or any other such configuration issues. The solution is provisioned and managed using AWS CloudFormation templates, which allow the user to specify required infrastructure resources while setting up the environment.

How can the Quick Start be deployed?

Deploying the AWS-SnapLogic solution is performed in simple steps.

  • The user first needs to sign up for AWS and SnapLogic accounts.
  • Once done, the user is automatically provided with a configuration file for setting up the Groundplex server. The complete stack can be deployed on an already existing VPC, or a new one could be built by configuration.
  • Fill in the parameters on the deployment page, launch the Quick start and the development environment will be up and ready to use in minutes!

The AWS Quick Start page provides detailed documentation on how to get started with Quick Start and how the solution is deployed.

Stack creation

Configuration of UI parameters

The outcome: Groundplex server running on the AWS web console

SNAPAHEAD™: Migrate to SnapLogic with upto 30% lesser cost and 25% reduced effort

For organizations with a legacy ETL setup on-prem, establishing a Cloud-based solution involves refactoring the existing system, migrating the business logic and integrating latest tools into their data setup. Agilisium designed SNAPAHEAD to help ease this migration process.

Introducing SNAPAHEAD

SNAPAHEAD automatically converts legacy ETL code into SnapLogic compliant pipelines with a single click.

SNAPAHEAD is an accelerator that automates the conversion of legacy ETL code to SnapLogic compliant pipelines. At the heart of it is an intelligent code that accurately maps xml to json, optimizes logic, and performs lift-and-shift of legacy code to Snaps

How does SNAPAHEAD work?

Migrating legacy code to SnapLogic compliant pipelines is not straightforward. The process involves reading the xml file generated by the legacy setup, understanding data sources and transformations performed on them including targets, individual operations and functionalities, and eventually mapping them to the corresponding snaps in SnapLogic, which make the final pipeline.

SNAPAHEAD uses Amazon S3 storage buckets to store files with input legacy code and output SnapLogic pipelines. AWS Lambda, an event-driven and serverless computing platform, forms the backbone of the accelerator. AWS Lambda takes care of computing processes, running code in response to events and automatically managing resources according to requirement.

The architecture behind SNAPAHEAD can be categorized into four broad sections.

Snap Map

This section contains a dictionary like structure that maps ETL logic to Snaps in SnapLogic. It performs the task of bringing together the two pieces of functionality that are logically equivalent.

Link Map

The structures brought together by the Snap Map are linked together by the Link Map. This is what decides the consequent stages of the pipeline, thereby controlling the data flow.

Property Map

Every legacy ETL code comes with its own set of configurations and features or fields like table location or matrix properties, to name a few. The Property Mapper identifies such parameters and defines them in the equivalent Snaps.

Render Map

The Snaps that are delivered to SnapLogic can be viewed as a chain of functions, forming the pipeline. The UI includes a well-defined arrangement for defining how this pipeline is to be placed on the screen. The Render Map automates this look for tool rendered pipelines.

The below screen shows the SNAPAHEAD architecture

What are the limitations of SNAPAHEAD? How to work around it?

While this lift-and-shift process can easily convert up to 25% of the simple-to-medium use-cases, the remaining pipelines must be refactored to suit the new platform. But, deciding how the pipelines should be handled is easier said than done.

Agilisium, with expertise in legacy tools and SnapLogic, combined with years of hands-on experience, can help in identifying the right mix for every case over an initial discovery phase of 2 to 3 weeks, before embarking on the actual implementation.

The implementation phase also has its own set of challenges. Even though mapping an ETL logic to a Snap is well-defined, there could be cases where an equivalent Snap is not available or is more complex than existing ones. In such cases, combinations of snaps are keyed into the Snap Map, which are then automatically picked up by the tool. Also, the Snap Map dictionary structure is built to be scalable to accommodate further possibilities and scenarios.

In order to link these snaps better, sequencing information is stored while reading in the xml file itself. Later it is referred by the Link Map in order to place the snaps correctly, eventually rendering an equivalent solution in SnapLogic.

How to set up SNAPAHEAD?

From the user end, deploying SNAPAHEAD is straightforward. The accelerator solution is encapsulated in a UI that provides access to the S3 bucket. The user just needs to upload the xml file with the ETL code into the S3 storage.

The event-based trigger in AWS Lambda will automatically trigger the accelerator, which takes the file and performs pipeline conversions. The output Snaps are rendered back into the S3 bucket as json files which can be uploaded to SnapLogic to view and further work with the engineering flow.

The below video gives a quick demo of SNAPAHEAD

The below screen shows a sample implemented pipeline on SnapLogic using SNAPAHEAD

SNAPAHEAD together with Quick Start, a powerful accelerator package!

The AWS Quick Start and SNAPAHEAD, when combined, make a comprehensive jumpstart solution. The Quick Start is a one-stop destination for organizations just starting with the data journey. For those with an existing on-prem setup, the required environment can first be set up using the AWS Quick Start before using SNAPAHEAD to perform platform migration.

To familiarize with this new platform, the Quick Start also provides a sample use case to explore different scenarios. The use case contains a dataset taken through the SnapLogic pipeline and showcases how data pipelines can be easily built using SnapLogic to accelerate data lake initiatives.

Webinar:Build a Data Lake on AWS in minutes” discusses how powerful the AWS-SnapLogic combination can be and provides a quick demo to the Quick Start solution.

AWS Blog:Data Analysis with SnapLogic” is a hands-on blog on testing out the sample SnapLogic Pipeline.

A significant reduction in time, cost and resources

Using the Quick Start to set up the environment gives the user a 2 to 3-week head start. Added to this, automating migration using SNAPAHEAD reduces implementation effort by 25%. This reduction can easily result in up to 30% cost savings, which is highly significant. The result is a platform that is flexible, scalable and customizable, setting the perfect stage for all types of future engagements.

To know more about these solutions, please get in touch with Kaushik Ravi , Aswin Sethuraman

“Agilisium architected, designed and delivered an elastically scalable Cloud-based Analytics-ready Big Data solution with AWS S3 Data Lake as the single source of truth”
The client is one of the world’s leading biotechnology company, with presence in 100+ markets globally, was looking for ways to maximize impact of their sales & marketing efforts.

The lack of a single source of truth, quality data and ad hoc manual reporting processes undermined top management’s visibility of integrated insights on sales, sales rep interactions, marketing reach, brand performance, market share, and territory management. Understandably, the client wanted to align information that has hitherto been in silos, to gain a 360-degree product movement view, to optimize sales planning and gain competitive edge.