Over the last two decades, the BI environment has evolved from reporting focused Enterprise Data Warehouse/Data Mart to Analytics focused Data Lake environment with many applications running on Cloud. Earlier, the standard enterprise system landscape consisted of enterprise business applications, product management systems, CRM legacy systems etc. where companies used to interact with their client partners, retailers, and distributors through FTPs and exchange business information. This exchange of information was facilitated through a reporting system where reports were run out of databases such as Oracle Financials, SAP, CRM systems.
For the exchange of information to happen, technical specialists would write the query and get the data thereby completing the reporting requirements. The key problem with that was writing/running the query interfered with the core function of the application. This problem gave way to the evolution of BI as companies needed to address their pain points like query based reporting and non-connected systems. This ultimately led to the building of Data Warehouse/Data Marts/ODS using leading industry tools such as Micro Strategy, Business Objects etc. to address their reporting needs.
Among today’s organizations, there is huge disruption led by
- Digital: – This led to complete change of business models. (for example, companies which sold DVDs were hugely affected due to virtual formats available online) This led to a phenomenal increase in data volumes.
- Social media: – The sudden rise in social media has made companies focus on consumer’s choice and preferences which in turn led to huge volumes of customer data to be managed. This led to surge in analytics by the marketing teams to know the insights from the huge volumes of available data. With a combination of Digital & social media disruption, there was a huge focus on analytics for forecasting & reporting sales, inventory, and marketing trends.
- Cloud: – While the market was being disrupted by digital and social media, the cloud emerged from the dormant state and drove large organizations to move complex and secure workloads to cloud. Due to huge analytics demand and explosion of data, managing BI systems in internal storage /hardware was no longer possible. This called for secure cloud based systems which made accessibility easier with higher security standards. Off late, we have seen large enterprises moving their large, complex secure workload on to cloud.
How can cloud help?
One classic method where the cloud comes to rescue is by loading the analytic workload from on-prem to cloud. This is due to the elasticity of cloud which takes the variable workloads of analytics and reduces the pressure on BI systems. Cloud is huge for BI and Big Data. Analytics is extremely variable in demand, suitable for extremely on-demand requirements.
Difference between BI & Analytics
BI requires standard reports (Monthly, weekly and daily) whereas Analytics requires variable computing demand. As demand itself is extremely variable, it requires an absolutely flexible & elastic workload friendly landscape such as cloud.
The Cloud enabler: AWS Redshift
Organizations can now move their elastic workload to AWS Redshift. It is the perfect enabler for organizations who cannot move their workloads to cloud directly. As both reporting and analytics are enabled on cloud, it helps them to ramp up their power based on the demand for standard BI/Analytics.
The magic reference architecture: key points worth considering while moving all the data from on-prem to cloud
- Replication method helps to keep Analytics-DB-on-cloud in sync using the replication tool such as Attunity
- Low volume workloads can be moved to ETL to keep the Analytics-DB-on-cloud refreshed
- Deploying data streams is the most preferred method to move data from archives
- Deployment of Hadoop & Map reduce for high velocity data loads into redshift can help organizations meet latency issues
- Recognition of the need for Data Lake architecture
- What is a Data Lake ?- A repository for large quantities and varieties of data (Structured and Unstructured) supporting SQL & No-SQL Data storage
- Data Lake on Cloud AWS S3: Deploying data pipeline to streamline data load process using Hadoop can help in seamless transfer of data
- As Data Lake gets used, more data streams are added
- Deploying Hadoop Hive to provide No-SQL Analytics for the Data Scientist provides them access to any data once it is in Data Lake which in turn reduces dependency on ETL teams.
- Qubole can be implemented on Hive query making No-SQL analytics more robust and stable for data scientists
- On-Demand Redshift cluster can now be added to meet additional short term on-Demand analytics requirements
Why build Hadoop, Big Data & Data Lake?
- Data lake enables all enterprise data to be in scope for your Data scientist or business analytics users
- Enables easy access to archives
- Faster and cheaper use of No-SQL analytics
- No Dataset is too Big to process or analyze on timely manner
- Eliminates ETL related data latency
- Accelerates Data warehouse/Data Mart performance and makes it stable and robust
- Completely elastic on capacity and cost, if on cloud
- Various subscription models available
The advantages of Big-data on cloud are very evident and can be a driving factor for many enterprises to move their complex and secure workloads to cloud. Using AWS redshift, organizations can now move their workloads to cloud which wasn’t possible earlier as they couldn’t move directly to cloud due to application complexities/incompatibilities. While the above reference architecture can be used flawlessly for almost all complex/large BI & analytics needs, there are also certain grey areas/exceptional cases where this architecture might not come handy. In such situations, a few other tools would also need to be integrated in parallel with the above reference architecture as it is still evolving as does the tools and technologies.
Here’s my full talk on Big Data Analytics for Enterprises with Large Data