Big Data Lambda Architecture and Batch Processing

DataBricks Part 2 – Big Data Lambda Architecture and Batch Processing

Let us talk about the Big Data Lambda Architecture. In this article, we are going to walk you through a sample scenario and explain the process.

Also Read: Build your Data Estate with Azure Databricks-Part I

The greek symbol lambda(λ) signifies divergence or bifurcation into two paths. Since volume, variety, and velocity increased in the data landscape, there emerged two tracks in Data Processing, i.e., the hot path and the cold path or Real-time processing and Batch Processing. The below image illustrates the high-level overview of this concept:

Source-Wikipedia

However as technologies evolved, many alternatives to realize the lambda architecture cropped up; including Microsoft ecosystem. The below image gives an integrated view of the azure big data landscape:

Big Data Lambda Architecture

If one observes the Microsoft Big Data landscape DataBricks appears at multiple places. From batch processing for traditional ETL processes to Real-time Analytics to Machine Learning, Databricks can be leveraged for any of the tasks mentioned above. The below image represents the recommended Microsoft Big Data lambda architecture. Please note that specific components might vary here.

In this section, we will touch base on the Batch Processing aspect of Databricks. However, in this article, we will use Azure SQL Database as sync since Azure SQL DW has Polybase option available for ETL/ELT.

From a birds eye’s view, data is being extracted from Data Lake store into Databricks, transformed on the fly and then stored into Azure SQL Database. However, there are a couple of nuances that need attention:

1. Azure Databricks needs access to the Data Lake store to extract the data. Active Directory app registration comes to our rescue here.

2. To write to Azure SQL Database, we need authorization. However, we cannot expose sensitive credential information in the         Notebook. Hence, we need to define secret scope using a key-vault(applicable in data lake access control as well)

Creating App Registration:

This Microsoft doc elucidates on creating app registrations. Once done with app creation, open a notebook in your Databricks workspace. We assume that you are familiar with Databricks workspace creation, spinning up a cluster and leveraging notebooks. If not please acquaint yourself using the documentation.

Open a Scala notebook. In the Notebook, write the code in the following format(See this GitHub link for the entire code):

//spark.conf.set("dfs.adls.oauth2.access.token.provider.type", "ClientCredential")
//spark.conf.set("dfs.adls.oauth2.client.id", "<Application ID>")
//spark.conf.set("dfs.adls.oauth2.credential", "<Key>")
//spark.conf.set("dfs.adls.oauth2.refresh.url", "<https://login.microsoftonline.com/<Directory ID>/oauth2/token>")

Creating a secret scope:

Since we are using Azure SQL database as our sink, which is a PaaS offering, sensitive authentication information comes into the picture. This information cannot be exposed in the notebook and hence, we need to create a key-vault backed secret scope. I have created a secret scope called as ‘AvroScope’ as opposed to ‘key-vault-secret’ mentioned in the doc.

Once the scope is created, use secrets in Azure Key-vault to Generate/Import the secret fields. The fields username and password are the ones we will be using.

These secret credentials can be redacted using the following code:

val jdbcUsername = dbutils.secrets.get(scope = "AvroScope", key = "username")
val jdbcPassword = dbutils.secrets.get(scope = "AvroScope", key = "password")

After redacting the credentials, we build the connection string of the sink database, i.e., Azure SQL Database using the following code:

val jdbcHostname = "newsignature.database.windows.net"
val jdbcPort = 1433
val jdbcDatabase = "NewSignature"

// Create the JDBC URL without passing in the user and password parameters.
val jdbcUrl = s"jdbc:sqlserver://${jdbcHostname}:${jdbcPort};database=${jdbcDatabase}"

// Create a Properties() object to hold the parameters.
import java.util.Properties
val connectionProperties = new Properties()

connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")

Now, as the source and sink are ready, we can move ahead with the ETL process. Data is extracted from the Azure data lake using SqlContext.read. Format API. The Data Lake folder path can be found in folder properties of data explorer. Go to the folder which consists of data and copy the full path:

Paste the copied path along with the file name in the load function of the below code:

val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("adl://srcdl.azuredatalakestore.net/FinancialData/cs-training_new.csv")

Using the show function of Data frame API, we can visualize the data in tabular format, since SQLContext.read. Format reads the data into a Data frame. Initial Data Analysis reveals that there is a debt ratio in the data has outliers, while the monthly income field consists of missing values. We perform data cleansing here using the filter function:

df.show()

val df1=df.filter("DebtRatio<1")
val df2=df1.filter("MonthlyIncome <> 'NA'")

After data cleansing, we wish to add a new column with the name ‘IncomeConsumption’ which is a ratio of Monthly Income and Number of dependents(minimum being 1). The ‘withColumn’ spark SQL function comes to our aid here:

import org.apache.spark.sql.functions._
val finaldf =df2.withColumn("IncomeConsumption", df2("MonthlyIncome")/(df2("NumberOfDependents")+1))

Having performed the cleansing and transformations, we further go ahead and save the data to the sink, i.e., our Azure SQL database using jdbcUrl created in connection string formation elucidated above. Note that the mode is specified as ‘Overwrite,’ which is basic SCD-1:

finaldf.write.mode("Overwrite").jdbc(jdbcUrl, "creditdata", connectionProperties)

Final Data:

Conclusion

I hope this article was helpful in getting you started with DataBricks on Azure with different Datasets. Drop us a comment or suggestion below and we will get back to you. Thank you for visiting us!

Disclaimer: The Questions and Answers provided on https://www.gigxp.com are for general information purposes only. We make no representations or warranties of any kind, express or implied, about the completeness, accuracy, reliability, suitability or availability with respect to the website or the information, products, services, or related graphics contained on the website for any purpose.