Co-authored by Deepalakshmi
The age of digital disruption differs from its predecessors in that most technologies tend to complement and augment what already exists, instead of a clean-sweep take over. For example, the advent of social media has only empowered conventional marketing and advertising strategies, and together they become invincible marketing tools. Test automation complements agile development processes, while greatly reducing the cost of test execution.
This blog talks about optimizing the use of enterprise data lakes and data warehouses for accelerated business agility.
Unpacking the buzz around big data
Big Data constitutes data sets that are not only high in volume, and high in velocity (real-time) but also have flexible and relational elements. Real-life use cases include public sector services, healthcare services, transportation, and banking sectors.
The digital revolution has propelled organizations to face and deal with alarming amount of data analytics, real-time. For example, patients checking into a hospital have their names, contact details, and physical history stored in the hospital’s records forever and for any number of uses thereafter. Doctors are resorting to remote care provision, by monitoring patient’s health condition via wearables. Information thus stored and transacted amount to unprecedented levels of unstructured data.
Traditional data warehouses have proven to be incapable of handling and tapping the potential of this horde of unstructured real-time data, paving way for the birth of Enterprise Data Lakes.
A data warehouse stores only structured and modeled data and requires pre-processing techniques for all data stored. Data lakes store all kinds of data, and call for processing only when a data item is picked for analytics.
A few more definitions of Data Lakes from the Internet:
From Wiktionary: A Data Lake is a massive, easily accessible data repository built on (relatively) inexpensive computer hardware for storing “Big Data”.
From Techtarget: A data lake is a large object-based storage repository that holds data in its native format until it is needed.
“If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption.
-The data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples. – Pentaho CTO James Dixon
To add on to the above, Agilisium defines EDL as a Reference Architecture for the Enterprise BI solution using Hadoop based Big-Data as the foundation.
Big Data cannot replace small data
Big Data poses challenges in storing and ingestion, while historic, small, and sensitive data require complex and cost-effective processing. Hence, even with the growing need for analyzing agile and ad-hoc data, the significance of traditional data mining and advanced analytics stays intact.
Complementing traditional data warehouses with enterprise data lakes helps you optimize both storage and analytics and gain a competitive edge to business intelligence.
Following are the key factors in driving an augmented data lake project:
- System analysis & recommendation for EDL or EDW
- Planning storage capacity
- Consulting for re-engineering/green-field approach
- Analytics roadmap
Organizations are also resorting to leveraging Data Lake as a Service from expert providers powered by Cloudera, Hadoop, and Spark. Data Lake as a service includes all capabilities for big data processing in the cloud via a simple suite of web interfaces.
Expert consulting and proper strategy will help you decide, whether you need a fully owned EDL of hundreds of TB or you would be better off drawing services from a cloud-hosted Data Lake.
Best practices in complementing EDW with a Data Lake
- Leveraging Data Lake as a Service
Complementing your existing on-premise data warehouse with a data lake on cloud can be highly beneficial with lower cost. Many companies go for this hybrid approach that greatly resolves speed and flexibility issues of data, while keeping costs affordable.
- Automatic Application Metadata
With all kinds of structured and unstructured data getting stored real-time, Meta data is the single key to derive fast and accurate value from data residing in data lakes. Therefore, it is critical to automate the application of meta data during ingestion. The following three kinds of meta data can effectively define your data lake data:
Technical: Defines the form and structure of each data set. For example, text, JSON, Avro.
Operational: Defines the lineage, profile, and quality of the data. For example, size, source, and destination of the data, and number of records/fields.
Business: The business significance of the data, or usefulness. For example, names, descriptions, tags, and encryption rules.
- Upgrading existing Business intelligence tools
Your existing BI tools that could work with low-latent data from your EDW are not going to be sufficient if you want to derive fast and complete value out of your data lake storage. Integration with components such as Apache, Kafka, REST APIs is necessary for streaming data for analytics purposes.
If you are using Amazon S3 as your data lake, with Amazon Redshift Spectrum, you can directly make queries on the object storage. Click here to learn more about Amazon Redshift Spectrum.