Build your Data Estate with Azure Databricks

Build your Data Estate with Azure Databricks-Part I-Motivating Big Data

In god we trust,all others must bring Data~W Edwards Deming

Since the dawn of humanity, certain ideas/inventions/concepts changed the course of human progress irreversibly. From the wheel to first Industrial revolution to the silicon era-which enabled VLSI and Information Systems-to the era of Big Data and AI.

Also Read: DataBricks Part 2 РBig Data Lambda Architecture and Batch Processing

However, it is essential to understand, where we come from as far as Information Systems is concerned, to appreciate the power of Big Data and the importance of Databricks. We started with extract files from diverse sources which resided in silos. While these extract files served specific reporting purposes, they often led to conflicting information due to non-standardization (solved later in Data warehousing), classic example of this being the legacy core banking systems.

Source: LinkedIn

In an attempt to solve this lack of coherence, distributed systems were used to fetch reports on the go directly from the source systems over the network. However, the technology of those days could not handle the demanding business requirements, and as a result, they were sidelined(to be resurrected!) thus, leading to the genesis of Data Warehousing.

Source: LinkedIn

Data warehousing ushered as an amalgamation of principles used from the above two paradigms, where data is extracted (E) from source systems, transformed online (T) and loaded into destination systems(L). The three steps combined formed ETL methodology, which till date is widely used in Business Intelligence.

Source: LinkedIn

ETL jobs are run in batches (typically 1-4 times a day). However, as businesses matured, information exploded in volume, in a lot of variety and with great velocity. Our traditional ETL tools could not handle such drastic change in the Data landscape, e.g. ETL, tools and traditional databases could process structured data in batches, up to a specific volume. This gave birth to a new paradigm called ‘Big Data.’

When Big data was introduced, it brought in a plethora of tools and technologies; the most famous ecosystem being Hadoop, along with a shift in methodology from ETL to ELT. As discussed earlier, traditional ETL could not handle faster changes to data.E.g. in SSIS a small change in destination metadata resulted in package changes and redeployments, thus making scale-out and code reusability cumbersome. Also, memory and compute was an issue with traditional DBMS. This problem was solved with distributed file systems like HDFS and powerful compute engines like Map Reduce and Spark.

Although these ecosystems have significantly evolved, managing these frameworks and leveraging them for Analytics is a challenge, since there is a massive deficit of people who can handle all the aspects of Big Data. A classic example of this dilemma is the exclusivity toolset used Data Scientists and Data Engineers, often leading to a communication gap between the two teams, leading to higher costs and increasing ETAs.

This is the place where a Unified Analytics platform like DataBricks comes into picture… be continued!

Disclaimer: The Questions and Answers provided on are for general information purposes only. We make no representations or warranties of any kind, express or implied, about the completeness, accuracy, reliability, suitability or availability with respect to the website or the information, products, services, or related graphics contained on the website for any purpose.