Even though it’s flashy and glamorous being a Data Engineer these days, it comes with its own set of challenges and opportunities. Approximately 80 percent time is spent on cleaning and munging the raw data which comes in increasing variety and complexity in order to facilitate complex analytical needs; one of them being log analytics. Be it event logs or transaction logs; they help us uncover hidden patterns and enable better decision making. For instance, event logs like Active Directory logs contain vital information, that could help us track user behavior and any suspicious activity.
STRING_SPLIT function in SQL server:
Let us take an example of an AD Log for a logoff activity. It contains the following log inside a column:
Now, we want to process the column ‘Message’ to get a semantic view like this:
Fortunately, log messages are structured, and a separator systematically separates their entities. In the case above, every entity in a log message is separated by a newline character or char(10) in SQL. Hence, we need to split the message column based on the separator char(10). Here comes the STRING_SPLIT function, which was introduced in SQL Server 2017. Here is the step by step guide to the required transformation :
1. Load raw data in staging:
With Azure Data Factory, load data residing Azure Data Lake gen2 to Azure SQL DB staging table called as ADLogs. The table structure of ADLogs is as follows:
/****** Object: Table [dbo].[ADLog] Script Date: 4/15/2019 3:45:42 PM ******/ SET ANSI_NULLS ON GO SET QUOTED_IDENTIFIER ON GO CREATE TABLE [dbo].[ADLog]( [LogNumber] [int] IDENTITY(1,1) NOT NULL, [Keywords] [varchar](100) NULL, [DateandTime] [varchar](100) NULL, [Source] [varchar](500) NULL, [EventID] [int] NULL, [TaskCategory] [varchar](50) NULL, [User] [varchar](50) NULL, [Computer] [varchar](200) NULL, [Message] [varchar](4000) NULL ) ON [PRIMARY] GO
I have added an identity column called LogNumber to this table.
You can use this template to create a copy activity in Azure Data Factory, or you can create your copy activity in ADF. Once the data is loaded in SQL Database, it looks something like this.
2. Use STRING_SPLIT() to split the message:
Now to split the message into different cells, we have to use the built-in function introduced in SQL Server 2017. In order to understand the importance of this split function, it is imperative to take a look at the traditional way of doing it, i.e., by creating a user-defined function. Here is the UDF function definition:
create function split_String ( @input nvarchar(max), @character char(1) ) returns @output table ( item nvarchar(1000) ) as begin declare @startindex int, @endindex int set @startindex = 1 if substring(@input, len(@input) - 1, len(@input)) <> @character begin set @input = @input + @character end while charindex(@character, @input) > 0 begin set @endindex = charindex(@character, @input) insert into @output(item) select substring(@input, @startindex, @endindex - 1) set @input = substring(@input, @endindex + 1, len(@input)) end return end go Code Reference
Later to invoke this split_String function, we use cross apply which is similar to an Inner Join. This becomes necessary since we need to retain other columns of the row in which the message is split. Let us take one row and apply the split function.
Here is the code snippet:
SELECT TOP 1 * FROM ADLog WHERE TaskCategory = 'Logoff' SELECT T.*, S.Item FROM DBO.ADLOG T CROSS APPLY split_String(T.[MESSAGE], CHAR(10)) S WHERE LogNumber = '4' ORDER BY LOGNUMBER
The result looks like this:
However, with STRING_SPLIT built-in function, we can skip the function definition and use it directly. The function takes two inputs, i.e., the ‘string’ and the ‘separator,’ while it returns a table valued variable named ‘value.’
Here is how you use it with cross apply:
SELECT TOP 1 * FROM ADLog WHERE TaskCategory = 'Logoff' SELECT T.*, S.VALUE FROM DBO.ADLOG T CROSS APPLY STRING_SPLIT(T.[MESSAGE], CHAR(10)) S WHERE LogNumber = '4' ORDER BY LOGNUMBER
The result looks like this:
This STRING_SPLIT is a necessary function for getting the data in the format displayed above. Furthermore, we need to perform more transformations to bring the data in the form displayed in Image 2.
3. Data cleansing and transformation:
We load the above result in a temp table with the following code:
SELECT T.*, S.VALUE INTO #TEMP_LOGS FROM DBO.ADLOG T CROSS APPLY STRING_SPLIT(T.[MESSAGE], CHAR(10)) S WHERE LOGNUMBER= 4 ORDER BY LOGNUMBER
Next, we cleanse all the undesired data in the message column, i.e., the blank rows and rows not containing the character ‘:’
DELETE FROM #TEMP_LOGS WHERE VALUE = ” OR VALUE NOT LIKE ‘%:%’
Next, we use the CharIndex function to split the ‘VALUE’ column based on the character ‘:’ using the following code:
SELECT *, LTRIM(RTRIM(LEFT(VALUE, CHARINDEX(':', VALUE) - 1))) AS TITLE, LTRIM(RTRIM(REVERSE(LEFT(REVERSE(VALUE), CHARINDEX(':', REVERSE(VALUE)) - 1)))) AS [DATA] INTO #FINALDATA FROM #TEMP_LOGS WHERE LOGNUMBER =4 SELECT TITLE, DATA FROM #FINALDATA
Finally, we perform an ‘upsert,’ i.e., insert and update to get the desired view with the following code:
INSERT INTO LogoffData SELECT DATEANDTIME,COMPUTER,NULL, NULL FROM #FINALDATA GROUP BY DATEANDTIME,COMPUTER,LOGNUMBER UPDATE LogoffData SET SECURITYID= B.[DATA] FROM #FINALDATA B WHERE B.TITLE LIKE '%SECURITY ID%'AND B.LOGNUMBER =4 UPDATE LogoffData SET [ACCOUNTDOMAIN]= B.[DATA] FROM #FINALDATA B WHERE B.TITLE LIKE '%ACCOUNT DOMAIN%'AND B.LOGNUMBER =4 SELECT * FROM LogoffData
Please note the transformations after the application of STRING_SPLIT can vary based on the desired semantic view and the data. However, SPLIT_STRING is the key step to get the semi-structured cell into a structured table valued result.
Conclusion
We hope that this post was helpful in effectively doing a “string split” using T-SQL on SQL Server. If you have any questions or concerns, feel free to use the comment box below. Thanks for your visit & happy learning!