How to Effectively split a string using STRING

Even though it’s flashy and glamorous being a Data Engineer these days, it comes with its own set of challenges and opportunities. Approximately 80 percent time is spent on cleaning and munging the raw data which comes in increasing variety and complexity in order to facilitate complex analytical needs; one of them being log analytics. Be it event logs or transaction logs; they help us uncover hidden patterns and enable better decision making. For instance, event logs like Active Directory logs contain vital information, that could help us track user behavior and any suspicious activity.

STRING_SPLIT function in SQL server:

Let us take an example of an AD Log for a logoff activity. It contains the following log inside a column:

Now, we want to process the column ‘Message’ to get a semantic view like this:

Image 2

Fortunately, log messages are structured, and a separator systematically separates their entities. In the case above, every entity in a log message is separated by a newline character or char(10) in SQL. Hence, we need to split the message column based on the separator char(10). Here comes the STRING_SPLIT function, which was introduced in SQL Server 2017. Here is the step by step guide to the required transformation :

1. Load raw data in staging:

With Azure Data Factory, load data residing Azure Data Lake gen2 to Azure SQL DB staging table called as ADLogs. The table structure of ADLogs is as follows:

/****** Object: Table [dbo].[ADLog] Script Date: 4/15/2019 3:45:42 PM ******/
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE TABLE [dbo].[ADLog](
[LogNumber] [int] IDENTITY(1,1) NOT NULL,
[Keywords] [varchar](100) NULL,
[DateandTime] [varchar](100) NULL,
[Source] [varchar](500) NULL,
[EventID] [int] NULL,
[TaskCategory] [varchar](50) NULL,
[User] [varchar](50) NULL,
[Computer] [varchar](200) NULL,
[Message] [varchar](4000) NULL
) ON [PRIMARY]
GO

I have added an identity column called LogNumber to this table.

You can use this template to create a copy activity in Azure Data Factory, or you can create your copy activity in ADF. Once the data is loaded in SQL Database, it looks something like this.

2. Use STRING_SPLIT() to split the message:

Now to split the message into different cells, we have to use the built-in function introduced in SQL Server 2017. In order to understand the importance of this split function, it is imperative to take a look at the traditional way of doing it, i.e., by creating a user-defined function. Here is the UDF function definition:

create function split_String
( 
@input nvarchar(max),
@character char(1)
)
returns @output table (
item nvarchar(1000)
)
as
begin
declare @startindex int, 
@endindex int

set @startindex = 1
if substring(@input, len(@input) - 1, len(@input)) <> @character
begin
set @input = @input + @character
end

while charindex(@character, @input) > 0
begin
set @endindex = charindex(@character, @input)

insert into @output(item)
select substring(@input, @startindex, @endindex - 1)

set @input = substring(@input, @endindex + 1, len(@input))
end

return
end
go

Code Reference

Later to invoke this split_String function, we use cross apply which is similar to an Inner Join. This becomes necessary since we need to retain other columns of the row in which the message is split. Let us take one row and apply the split function.

Here is the code snippet:

SELECT TOP 1 * FROM  ADLog WHERE TaskCategory = 'Logoff'
SELECT T.*, S.Item FROM DBO.ADLOG T CROSS APPLY split_String(T.[MESSAGE], CHAR(10)) S  WHERE LogNumber = '4'  ORDER BY LOGNUMBER

The result looks like this:

However, with STRING_SPLIT built-in function, we can skip the function definition and use it directly. The function takes two inputs, i.e., the ‘string’ and the ‘separator,’ while it returns a table valued variable named ‘value.’

Here is how you use it with cross apply:

SELECT TOP 1 * FROM  ADLog WHERE TaskCategory = 'Logoff'
SELECT T.*, S.VALUE FROM DBO.ADLOG T CROSS APPLY STRING_SPLIT(T.[MESSAGE], CHAR(10)) S  WHERE LogNumber = '4' ORDER BY LOGNUMBER

The result looks like this:

This STRING_SPLIT is a necessary function for getting the data in the format displayed above. Furthermore, we need to perform more transformations to bring the data in the form displayed in Image 2.

3. Data cleansing and transformation:

We load the above result in a temp table with the following code:

SELECT T.*, S.VALUE INTO #TEMP_LOGS FROM DBO.ADLOG T CROSS APPLY STRING_SPLIT(T.[MESSAGE], CHAR(10)) S WHERE LOGNUMBER= 4 ORDER BY LOGNUMBER

Next, we cleanse all the undesired data in the message column, i.e., the blank rows and rows not containing the character ‘:’

DELETE FROM #TEMP_LOGS WHERE VALUE = ” OR VALUE NOT LIKE ‘%:%’

Next, we use the CharIndex function to split the ‘VALUE’ column based on the character ‘:’ using the following code:

SELECT *, LTRIM(RTRIM(LEFT(VALUE, CHARINDEX(':', VALUE) - 1))) AS TITLE, 
LTRIM(RTRIM(REVERSE(LEFT(REVERSE(VALUE), CHARINDEX(':', REVERSE(VALUE)) - 1)))) AS [DATA] INTO #FINALDATA
FROM #TEMP_LOGS WHERE LOGNUMBER =4

SELECT TITLE, DATA FROM #FINALDATA

Finally, we perform an ‘upsert,’ i.e., insert and update to get the desired view with the following code:

INSERT INTO LogoffData
SELECT DATEANDTIME,COMPUTER,NULL, NULL FROM #FINALDATA GROUP BY DATEANDTIME,COMPUTER,LOGNUMBER

UPDATE LogoffData 
SET SECURITYID= B.[DATA]
FROM #FINALDATA B WHERE B.TITLE LIKE '%SECURITY ID%'AND B.LOGNUMBER =4

UPDATE LogoffData 
SET [ACCOUNTDOMAIN]= B.[DATA]
FROM #FINALDATA B WHERE B.TITLE LIKE '%ACCOUNT DOMAIN%'AND B.LOGNUMBER =4

SELECT * FROM LogoffData

Please note the transformations after the application of STRING_SPLIT can vary based on the desired semantic view and the data. However, SPLIT_STRING is the key step to get the semi-structured cell into a structured table valued result.

Conclusion

We hope that this post was helpful in effectively doing a “string split” using T-SQL on SQL Server. If you have any questions or concerns, feel free to use the comment box below. Thanks for your visit & happy learning!

Disclaimer: The Questions and Answers provided on https://www.gigxp.com are for general information purposes only. We make no representations or warranties of any kind, express or implied, about the completeness, accuracy, reliability, suitability or availability with respect to the website or the information, products, services, or related graphics contained on the website for any purpose.