Chad’s Blog

Decoding datetime2 columnstore segment range values

2024-08-07T14:00:00+00:00

Disclaimer: I am by no means an expert on columnstore indexes. This was just a fun distraction I ran into and felt like talking about it. I’m always open to constructive criticism on these posts.

This is an extension on my previous blog post where I dealt with an issue involving temporal tables utilizing a clustered columnstore index and a data retention policy. I noticed that old rows were still in my history table, even though the data retention cleanup process had just run.

My guess is because SQL Server keeps multiple rowgroups open at a time while distributing new data into them before compressing the rowgroups. There’s likely a small overlap between rowgroups from day to day. So one rowgroup may contain 5 days worth of data, even though my process inserts 14M rows per day. This means the cleanup job may appear to be behind when it’s not.

As part of looking into this issue, I started skimming through the columnstore index system DMVs to see what information I could glean.

I noticed in sys.column_store_segments the min_data_id and max_data_id columns store very large bigint values in the segments for datetime2 columns. After doing a bit more googling and tinkering, I found for bit/tinyint/smallint/int/bigint it stores the min/max of the actual values rather than dictionary lookup values. So I assume it’s likely doing the same for date/time/datetime/datetime2 and storing some sort of bigint representation of the actual value.

This post is going to focus on datetime2(7) datatypes mainly because that’s what I was dealing with. Though I’m sure it wouldn’t be much work to figure out the other types.

I should also note…there may be existing blog posts covering this, I couldn’t find any. There’s also a very good chance this is covered in one of Niko Neugebauer’s many columnstore index blogs. But in the end, I really wanted to see if I could figure this out on my own because I was having fun with it.

The problem

I have a temporal table that contains a few billion rows and I have a data retention policy of 180 days. The period end column in my table is named ValidTo and the history table uses a clustered columnstore index, which means the data cleanup job works by dropping whole rowgroups.

Here’s what sys.column_store_segments looks like for that column:

SELECT ColumnName = c.[name]
   , TypeName = TYPE_NAME(c.system_type_id), c.scale
   , s.segment_id, s.min_data_id, s.max_data_id
FROM sys.column_store_segments s
    JOIN sys.partitions p ON p.[partition_id] = s.[partition_id]
    JOIN sys.columns c ON c.[object_id] = p.[object_id] AND c.column_id = s.column_id
WHERE p.[object_id] = OBJECT_ID('dbo.MyTable_History')
    AND c.[name] = 'ValidTo'
ORDER BY s.segment_id;

| ColumnName | TypeName  | scale | segment_id | min_data_id        | max_data_id        | 
|------------|-----------|-------|------------|--------------------|--------------------| 
| ValidTo    | datetime2 | 7     | 901        | 812451496414559815 | 812453025851490574 | 
| ValidTo    | datetime2 | 7     | 902        | 812453024026222779 | 812453025718816479 | 
| ValidTo    | datetime2 | 7     | 907        | 812449298004095678 | 812453476378687270 | 
| ValidTo    | datetime2 | 7     | 908        | 812452596987479114 | 812453476127092027 | 
| ValidTo    | datetime2 | 7     | 909        | 812453025927907048 | 812453475318555080 | 
| ValidTo    | datetime2 | 7     | 910        | 812453476389782465 | 812453477968585804 | 
| ValidTo    | datetime2 | 7     | 911        | 812453476378999816 | 812453692263928518 |

So the question is…what the heck do those values represent for a datetime2 column?

First things first, let’s get this out of the way…this doesn’t work:

DECLARE @bigint_value bigint = 812453476378999816;
SELECT CONVERT(datetime2, CONVERT(binary(8), @bigint_value))

'
Msg 241, Level 16, State 1, Line 155
Conversion failed when converting date and/or time from character string.
'

So much for the easy route.

Maybe it’s number of ticks?

I should mention…At this point, I had no idea how SQL Server stored datetime2 values internally. Had I known, that probably would have saved me a lot of time.

My first thought was that this might be something like Unix timestamps where it’s the number of seconds/milliseconds/whatever since 1970-01-01 UTC. So that’s where I went first. I spent a good amount of time trying to take 812449298004095678 (the min min_data_id) and convert it into a date that I assumed was 2024-02-03 10:08:23.1109310 (the MIN(ValidTo) in the actual table).

I tried all sorts of things and came up with nothing…For example, trying to convert 812449298004095678 to the number of ticks (0.0000001 second or 100 nanoseconds) since 0001-01-01 00:00:00.000, which kept producing values that were WAY too high. You can test this out in PowerShell:

([datetime]'0001-01-01').AddTicks(812449298004095678)
# Returns: Thursday, July 20, 2575 8:03:20 PM

Let’s create some more reliable data

After that failed attempt, I thought maybe I could create a new table with a clustered columnstore index and populate the columns with only a single value. This way each column segment would only represent a single known value, giving me a sort of mapping between known values and their converted value that we’re trying to decode.

The new table schema:

DROP TABLE IF EXISTS dbo.TestCCI;
CREATE TABLE dbo.TestCCI (
    dt0001      datetime2 NOT NULL, -- datetime2 min
    dt0001_1tk  datetime2 NOT NULL, -- min + 1 tick (100ns)
    dt0001_1us  datetime2 NOT NULL, -- min + 1 microsecond
    dt0001_1ms  datetime2 NOT NULL, -- min + 1 millisecond
    dt0001_1sec datetime2 NOT NULL, -- min + 1 second
    dt0001_1hr  datetime2 NOT NULL, -- min + 1 hour
    dt0001_12hr datetime2 NOT NULL, -- min + 12 hour
    dt0001_1d   datetime2 NOT NULL, -- min + 1 day
    dt0001_2d   datetime2 NOT NULL, -- min + 2 day
    dt1753      datetime2 NOT NULL, -- hardcoded date - 1753-01-01
    dt1900      datetime2 NOT NULL, -- hardcoded date - 1900-01-01
    dtMAX       datetime2 NOT NULL, -- datetime2 max

    INDEX CCI_TestCCI CLUSTERED COLUMNSTORE,
);

Populate data script. This script ensures that at least a couple compressed rowgroups are created by inserting at least 1,048,576 * 2 rows.

DECLARE @dt datetime2(7) = '0001-01-01';

WITH c1 AS (SELECT x.x FROM (VALUES(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) x(x))   -- 12
    , c2(x) AS (SELECT 1 FROM c1 x CROSS JOIN c1 y)                                         -- 12 * 12
    , c3(x) AS (SELECT 1 FROM c2 x CROSS JOIN c2 y CROSS JOIN c2 z)                         -- 144 * 144 * 144
INSERT INTO dbo.TestCCI (
    dt0001, dt0001_1tk, dt0001_1us, dt0001_1ms, dt0001_1sec, dt0001_1hr, dt0001_12hr,
    dt0001_1d, dt0001_2d, dt1753, dt1900, dtMAX
)
SELECT @dt
    , DATEADD(NANOSECOND, 100, @dt) -- 1 tick = 100 nanoseconds
    , DATEADD(MICROSECOND, 1, @dt)
    , DATEADD(MILLISECOND, 1, @dt)
    , DATEADD(SECOND, 1, @dt), DATEADD(HOUR, 1, @dt), DATEADD(HOUR, 12, @dt)
    , DATEADD(DAY, 1, @dt), DATEADD(DAY, 2, @dt)
    , '1753-01-01', '1900-01-01', '9999-12-31 23:59:59.9999999'
FROM c3;

Here’s what that data looks like in sys.column_store_segments

SELECT ColumnName = c.[name], MinValue = MIN(s.min_data_id)
FROM sys.column_store_segments s
    JOIN sys.partitions p ON p.[partition_id] = s.[partition_id]
    JOIN sys.columns c ON c.[object_id] = p.[object_id] AND c.column_id = s.column_id
WHERE p.[object_id] = OBJECT_ID('dbo.TestCCI')
GROUP BY c.column_id, c.[name]
ORDER BY c.column_id;

| ColumnName  | MinValue            | 
|-------------|---------------------| 
| dt0001      | 0                   | 
| dt0001_1tk  | 1                   | 
| dt0001_1us  | 10                  | 
| dt0001_1ms  | 10000               | 
| dt0001_1sec | 10000000            | 
| dt0001_1hr  | 36000000000         | 
| dt0001_12hr | 432000000000        | 
| dt0001_1d   | 1099511627776       | 
| dt0001_2d   | 2199023255552       | 
| dt1753      | 703582988172001280  | 
| dt1900      | 762615767467294720  | 
| dtMAX       | 4015481100312363007 |

Now we’re picking up on a pattern. It seems like my original guess was right to some extent. It is a representation of the number of ticks…Until you roll over to the next day. That part was confusing me because it’s pretty obvious that 36000000000 (1 hour) * 24 is not equal to 1099511627776 (1 day).

The next step was to start examining this in binary to see if there is any pattern there. Since we know all of these values represent a datetime2(7) value, then we know its 8 bytes. So lets convert it all to their binary and datetime2 representations.

| datetime2 value             | binary - date component    | binary - time component                      |
| ----------------------------|----------------------------|----------------------------------------------|
| 0001-01-01 00:00:00.0000000 | 00000000 00000000 00000000 | 00000000 00000000 00000000 00000000 00000000 |
| 0001-01-01 00:00:00.0000001 | 00000000 00000000 00000000 | 00000000 00000000 00000000 00000000 00000001 |
| 0001-01-01 00:00:00.0000010 | 00000000 00000000 00000000 | 00000000 00000000 00000000 00000000 00001010 |
| 0001-01-01 00:00:00.0010000 | 00000000 00000000 00000000 | 00000000 00000000 00000000 00100111 00010000 |
| 0001-01-01 00:00:01.0000000 | 00000000 00000000 00000000 | 00000000 00000000 10011000 10010110 10000000 |
| 0001-01-01 01:00:00.0000000 | 00000000 00000000 00000000 | 00001000 01100001 11000100 01101000 00000000 |
| 0001-01-01 12:00:00.0000000 | 00000000 00000000 00000000 | 01100100 10010101 00110100 11100000 00000000 |
| 0001-01-02 00:00:00.0000000 | 00000000 00000000 00000001 | 00000000 00000000 00000000 00000000 00000000 |
| 0001-01-03 00:00:00.0000000 | 00000000 00000000 00000010 | 00000000 00000000 00000000 00000000 00000000 |
| 1753-01-01 00:00:00.0000000 | 00001001 11000011 10100001 | 00000000 00000000 00000000 00000000 00000000 |
| 1900-01-01 00:00:00.0000000 | 00001010 10010101 01011011 | 00000000 00000000 00000000 00000000 00000000 |
| 9999-12-31 23:59:59.9999999 | 00110111 10111001 11011010 | 11001001 00101010 01101001 10111111 11111111 |

Once I converted the data to this view…I immediately recognized the pattern and I already show it above. It appears the date component is stored in the first 3 bytes as the number of days since 0001-01-01 and the time component uses the last 5 bytes as the number of ticks since 00:00:00.0000000.

Some of you might know this already…but this is very similar to how SQL Server stores datetime2 values internally. Unfortunately, I did not know that and I had to learn that the long way.

How do we convert it back to datetime2?

Well we already know we can’t directly convert it.

My first thought was maybe I can grab the first 3 bytes, and DATEADD(day, {value}, '0001-01-01'), then do the same for the last 5 bytes…The problem is, 5 bytes goes beyond the limits of what DATEADD can handle, which is limited to int (4 bytes). Unfortunately, there is no DATEADD_BIG() function like there is a DATEDIFF_BIG().

I could handle this with some sort of binary math, or while loop to break that larger number up. But instead, I wanted to focus on how to build a binary representation of a datetime2 value that can be directly converted

The problem is, I had no idea how datetime2 is actually stored in binary, but there’s an easy way to find out.

DECLARE @dt2now datetime2 = SYSUTCDATETIME();
SELECT CONVERT(binary(8), @dt2now);

'
Msg 8152, Level 16, State 17, Line 158
String or binary data would be truncated.
'

Uhhh….wat? Why would a value that is 8 bytes be truncated when converted to an 8 byte binary?

I’ll save you the headache this gave me…Read this blog post that I eventually found:

https://bornsql.ca/blog/datetime2-8-bytes-binary-9-bytes/

TL;DR - When converting a datetime2 value to a binary datatype, SQL Server doesn’t want to lose precision, so it includes the precision with the converted value. Including the precision adds an extra byte to the value, so we need to use binary(9) instead. This also means we need to make sure our conversion logic handles this.

Let’s try that again…

/* The value '0001-01-01 15:16:15.5813889' will create a binary value with all 0's
   for the date component and the time component will start and end with a 1.
   This will make it easy to identify which bits represent the date and which
   represent the time in the converted output so that we can compare it with the
   binary of the values we're getting from sys.column_store_segments.
*/
DECLARE @dt2now datetime2 = '0001-01-01 15:16:15.5813889';
SELECT CONVERT(binary(9), @dt2now);

-- RETURNS: 0x070100000080000000

This breaks down like so:

      Precision  Time          Date
0x    0x07       0100000080    000000

Well that’s weird…because if we use that same timestamp but create a binary value using the method used in the bigint value, we get this…

      Date       Time
0x    000000     8000000001 (which is 549755813889 as a bigint)

It took me a second to realize what happened after mentally going back to my old college assembly classes…The first one is stored in little-endian, whereas our bigint is storing it in big-endian…I won’t go into detail explaining what that is or how it works, but the basic idea is that the binary data is stored in a different “direction”, luckily that’s a pretty simple fix.

The solution

We’re finally here…We now have all the information we need to convert the original bigint values back to their original datetime2 form. We know that we need to convert our big-endian value to little-endian while also adding the missing precision information back in.

One fun thing to keep in mind here is that whether it’s a number, string data, date/time, etc, it’s all stored in bytes and those bytes can be converted into strings (nvarchar) and treated as such, including things like concatenation. Since I’m working on a SQL Server 2017 instance, I don’t have access to the newer left/right shift binary functions. So I’m going to work around it by using concatenation to handle bit shifting.

DECLARE @src_bigint_value    bigint,
        @src_binary_value    binary(8),
        @precision           binary(1) = 0x07,
        @output_binary       binary(9);

SET @src_bigint_value = 549755813889; -- '0001-01-01 15:16:15.5813889'

-- First we'll convert it to an 8-byte binary
SET @src_binary_value = CONVERT(binary(8), @src_bigint_value)
-- Then We concat the precision value (+ acts as a binary left shift)
SET @output_binary = @src_binary_value + @precision
/* That gets us: 0x000000800000000107 */

-- Now let's handle the little-endian conversion to big-endian
-- We'll do this by cheating a bit and treating it like a string
SET @output_binary = CONVERT(binary(9), REVERSE(@output_binary))
/* That gets us: 0x070100000080000000 */

-- All we need to do now is convert it to datetime2...
SELECT CONVERT(datetime2, @output_binary)
-- RETURNS: 0001-01-01 15:16:15.5813889

SUCCESS!!! 🥳

And that’s it! We now have a formula we can reduce down into a one liner and use it to decode the values stored in sys.column_store_segments for datetime2 values.

The final test

I put together the following query to run against sys.column_store_segments. It looks only at segments for our table dbo.MyTable_History and the ValidTo column, which is a datetime2. This is the column which helps tell SQL Server which rowgroups are safe to drop based on the data retention policy settings.

DECLARE @dt2_precision binary(1) = 0x07;

SELECT n.SchemaName, n.ObjectName, n.ColumnName, s.segment_id
    , s.min_data_id, s.max_data_id
    , x.min_data_val, x.max_data_val, y.min_data_val_age, y.max_data_val_age
FROM sys.column_store_segments s
    JOIN sys.partitions p ON p.[partition_id] = s.[partition_id]
    JOIN sys.columns c ON c.[object_id] = p.[object_id] AND c.column_id = s.column_id
    CROSS APPLY (SELECT SchemaName = OBJECT_SCHEMA_NAME(p.[object_id]), ObjectName = OBJECT_NAME(p.[object_id]), ColumnName = c.[name]) n
    CROSS APPLY ( -- Convert bigint values to datetime2
        SELECT min_data_val = CONVERT(datetime2, CONVERT(binary(9), REVERSE(CONVERT(binary(8), s.min_data_id) + @dt2_precision)))
            ,  max_data_val = CONVERT(datetime2, CONVERT(binary(9), REVERSE(CONVERT(binary(8), s.max_data_id) + @dt2_precision)))
    ) x
    CROSS APPLY ( -- Calculate age of datetime2 values
        SELECT min_data_val_age = DATEDIFF(SECOND, x.min_data_val, SYSUTCDATETIME()) / 86400.0
            ,  max_data_val_age = DATEDIFF(SECOND, x.max_data_val, SYSUTCDATETIME()) / 86400.0
    ) y
WHERE 1=1
    AND p.[object_id] = OBJECT_ID('dbo.MyTable_History')  -- table with columnstore index
    AND p.index_id = 1                                    -- clustered columnstore index
    AND c.[name] = 'ValidTo'                              -- target column
    AND c.system_type_id = TYPE_ID('datetime2')
ORDER BY n.SchemaName, n.ObjectName, n.ColumnName, s.segment_id

The result of the query looks like this (minus a few columns since I’m running it for 1 table)

| segment_id | min_data_id        | max_data_id        | min_data_val                | max_data_val                | min_data_val_age | max_data_val_age | 
|------------|--------------------|--------------------|-----------------------------|-----------------------------|------------------|------------------| 
| 907        | 812449298004095678 | 812453476378687270 | 2024-02-03 10:08:23.1109310 | 2024-02-07 04:02:15.9189798 | 183.7130092      | 179.9672685      | 
| 908        | 812452596987479114 | 812453476127092027 | 2024-02-06 10:09:07.9609418 | 2024-02-07 04:01:50.7594555 | 180.7125000      | 179.9675578      | 
| 909        | 812453025927907048 | 812453475318555080 | 2024-02-06 22:04:02.0037352 | 2024-02-07 04:00:29.9057608 | 180.2160300      | 179.9684953      | 
| 910        | 812453476389782465 | 812453477968585804 | 2024-02-07 04:02:17.0284993 | 2024-02-07 04:04:54.9088332 | 179.9672453      | 179.9654282      | 
| 911        | 812453476378999816 | 812453692263928518 | 2024-02-07 04:02:15.9502344 | 2024-02-07 10:02:04.4431046 | 179.9672685      | 179.7173958      | 
| 912        | 812453476378687270 | 812453694459519806 | 2024-02-07 04:02:15.9189798 | 2024-02-07 10:05:44.0022334 | 179.9672685      | 179.7148495      | 
| 913        | 812453025926031789 | 812453695400109701 | 2024-02-06 22:04:01.8162093 | 2024-02-07 10:07:18.0612229 | 180.2160416      | 179.7137615      | 
| 914        | 812452592568429350 | 812453696032378631 | 2024-02-06 10:01:46.0559654 | 2024-02-07 10:08:21.2881159 | 180.7176041      | 179.7130324      | 
| 918        | 812453023938866652 | 812453696236467422 | 2024-02-06 22:00:43.0996956 | 2024-02-07 10:08:41.6969950 | 180.2183333      | 179.7128009      | 
| 919        | 812453476297895476 | 812453695679676954 | 2024-02-07 04:02:07.8398004 | 2024-02-07 10:07:46.0179482 | 179.9673611      | 179.7134375      |

The data retention policy for this table is set to 180 days, which means rowgroups containing only data where ValidTo >= 180 days ago is safe to drop. Looking at the output of the query above, we can see why SQL Server did not drop some of these rowgroups…all of them have a max ValidTo of ~179 days old, which is not >= 180. Ths is allowing data older than 180 days to live in the table.

Why aren’t old rows dropping from my temporal history table?

2024-08-05T13:00:00+00:00

Oh wait…yes they are.

Just a small disclaimer: this post is not intended to be a technical deep dive into how SQL Server handles temporal table data retention policies behind the scenes. The intent is to just tell a fun story and maybe, hopefully, help out a future internet traveler that has also run into this issue and give them a bit of relief/clarity as to what’s happening.

TL;DR / Spoiler: I couldn’t figure out why my temporal history table kept reporting it had old rows, despite having a data retention policy set up. Turns out it was user error. Everything was working exactly as it should.

This isn’t a recipe, click here if you want to skip the story: Stats and findings

If you’re not sure what I’m talking about, read these two pages:

To be honest, if you just read those two pages very carefully, then this blog post is pretty much useless. Unfortunately, I apparently did not read those pages very carefully, and instead was stumped by this problem for quite a while.

The problem

I recently built a system for collecting index usage statistics utilizing temporal tables, clustered columnstore indexes (CCIs) and a temporal table data retention policy. The basic idea behind the system is that it collects various stats about indexes and updates this stats table. However, because it’s a temporal table, all changes are logged to the underlying history table.

My history table is built using a clustered columnstore index and had a data retention policy set up for the temporal table, like so:

WITH (
    SYSTEM_VERSIONING = ON (
        HISTORY_TABLE = dbo.MyTable_History,
        DATA_CONSISTENCY_CHECK = ON,
        HISTORY_RETENTION_PERIOD = 6 MONTHS
    )
);

Well, the 6 month mark finally hit so I was keeping an eye on that history table to see how quickly SQL Server would delete those rows. In my mind, I was expecting it to be nearly instant, especially since SQL Server handles it at the rowgroup level with CCIs.

To my surprise…Nothing was happening…every day I would check in on this table to see where we were at, and every day there was no change and every day more and more rows were being added to the table (at a rate of about 14M rows per day). A table that was already at 2.4 billion rows.

This is the check query I was running…

SELECT MIN(ValidTo)
    , DATEDIFF(HOUR, MIN(ValidTo), SYSUTCDATETIME()) / 24.0
FROM dbo.MyTable_History;

If you see this and already see the problem…I’m happy for you, because I certainly did not.

I tried to think through why nothing was being deleted. I thought maybe there’s some weird issue going on here with > vs >=…For example, maybe behind the scenes something like this is happening:

DECLARE @today    date = '2024-08-02',
        @datadate date = '2024-02-01'
SELECT 1
WHERE DATEDIFF(MONTH, @datadate, @today) > 6

Which basically means it’s a month behind, which seems like a pretty weird decision/bug for SQL Server to have. It’s more likely that I’m wrong than for me to have run into a SQL Server bug this obvious. That said, I was still concerned, so I changed the retention policy on the table to 180 DAYS instead of 6 MONTHS, hoping that if this is due to some sort of DATEDIFF weirdness that would fix it.

I should also note that the documentation clearly states they use DATEADD, and you can even see this in the execution plan when querying a temporal table using the temporal table syntax. But I wanted to test the theory anyway.

Nothing changed.

A few weeks had gone by because I was distracted with more important work. I ran my check query and it was still showing old data existed that was 206 days old.

Fortunately, querying-wise all is good because SQL Server will automatically apply a date filter based on the retention policy so that even if data is still hanging around in the history table, it won’t be included in query results. However, that doesn’t solve my data storage issue.

Aha moment

It turns out…I should try squinting harder when I read, or maybe it’s time to admit I need glasses.

[…] aged rows can be deleted by the cleanup task, at any point in time and in arbitrary order.

Source: Querying tables with retention policy

Which means, this whole time I’ve been looking at the wrong thing. I’ve been checking for the oldest row, but not how many old rows had been removed.

So I started using this check query instead, which shows by day how many rows are ready to be pruned.

DECLARE @dt datetime2 = SYSUTCDATETIME();
DECLARE @exp datetime2 = DATEADD(DAY, -180, @dt);

SELECT ValidToDate  = CONVERT(date, ValidTo)
    , [RowCount]    = FORMAT(COUNT(*),'N0') 
    , IsExpired     = IIF(CONVERT(date, ValidTo) < @exp, 1, 0)
    , DaysOld       = DATEDIFF(DAY, CONVERT(date, ValidTo), @dt)
    , RowCountRT    = FORMAT(SUM(COUNT_BIG(*)) OVER (ORDER BY CONVERT(date, ValidTo)), 'N0')
FROM dbo.MyTable_History
WHERE ValidTo < DATEADD(DAY, 5, @exp) -- Just so we can see some non-pruned days
GROUP BY CONVERT(date, ValidTo)
ORDER BY CONVERT(date, ValidTo)

This query, combined with the fact that the data ingest rate is fairly consistent, I could see some rows were being deleted…Here’s what it looks like at the time I’m writing this:

| ValidToDate | RowCount    | IsExpired | DaysOld | RowCountRT    | 
|-------------|-------------|-----------|---------|---------------| 
| 2024-01-30  |    212,558  | 1         | 185     |    212,558    | 
| 2024-01-31  |    206,691  | 1         | 184     |    419,249    | 
| 2024-02-01  |    138,146  | 1         | 183     |    557,395    | 
| 2024-02-02  |    138,428  | 1         | 182     |    695,823    | 
| 2024-02-03  |    782,870  | 1         | 181     |  1,478,693    | 
| 2024-02-04  |  6,985,658  | 1         | 180     |  8,464,351    | 
| 2024-02-05  | 13,724,560  | 0         | 179     | 22,188,911    | 
| 2024-02-06  | 13,739,960  | 0         | 178     | 35,928,871    | 
| 2024-02-07  | 13,747,964  | 0         | 177     | 49,676,835    | 
| 2024-02-08  | 13,748,268  | 0         | 176     | 63,425,103    |

You can see it’s still showing about 5 days “behind”, BUT, the daily row count is well below the typical, which means rows are being deleted, just not in perfect order. Which aligns with the documentation for data retention policies on history tables using clustered columnstore indexes.

I could have stopped here, but I wanted to get more data…for example, how quickly is it deleting data? Is it keeping up with inserts? How often does it clean up?

Stats and findings

I wanted to get more info, so I built a small process to log stats to a table on a regular basis. Things like row count, columnstore rowgroup count, etc.

Table schema:

CREATE TABLE dbo.MyTable_History_RowCount (
    InsertDate          datetime2   NOT NULL DEFAULT GETDATE(), -- yes, GETDATE, normally I'd use SYSUTCDATETIME or SYSDATETIMEOFFSET, but for a quick one off thing I'm going to drop, this was fine.
    OldRowCount         bigint      NOT NULL,
    NewRowCount         bigint      NOT NULL,
    DateThreshold       datetime2   NOT NULL,
    RG_Compressed       int         NOT NULL, -- Compressed RowGroup count
    RG_Open             int         NOT NULL, -- Open RowGroup count
    SQLServerStartTime  datetime2   NOT NULL,
);

Logger proc:

CREATE OR ALTER PROCEDURE dbo.usp_LogTemporalTableCounts
AS
BEGIN;
    SET NOCOUNT ON;

    DECLARE @OldRowCount bigint, @NewRowCount bigint, @DateThreshold datetime2, @RGC_Compressed int, @RGC_Open int, @SQLServerStartTime datetime2;

    SET @DateThreshold = '2024-08-02'; -- Picked a random date to act as the split point.

    SELECT @OldRowCount        = COUNT_BIG(*) FROM dbo.MyTable_History WHERE ValidTo <= @DateThreshold;
    SELECT @NewRowCount        = COUNT_BIG(*) FROM dbo.MyTable_History WHERE ValidTo >  @DateThreshold;
    SELECT @RGC_Compressed     = COUNT(*) FROM sys.column_store_row_groups WHERE [object_id] = OBJECT_ID('dbo.MyTable_History') AND [state] = 3;
    SELECT @RGC_Open           = COUNT(*) FROM sys.column_store_row_groups WHERE [object_id] = OBJECT_ID('dbo.MyTable_History') AND [state] = 1;
    SELECT @SQLServerStartTime = sqlserver_start_time FROM sys.dm_os_sys_info;

    INSERT INTO dbo.MyTable_History_RowCount (OldRowCount, NewRowCount, DateThreshold, RG_Compressed, RG_Open, SQLServerStartTime)
    SELECT @OldRowCount, @NewRowCount, @DateThreshold, @RGC_Compressed, @RGC_Open, @SQLServerStartTime;

    -- Clear out unchanged history, but retain first and last row for each change
    DELETE x
    FROM (
        SELECT rn1 = ROW_NUMBER() OVER (PARTITION BY OldRowCount, NewRowCount ORDER BY InsertDate)
            ,  rn2 = ROW_NUMBER() OVER (PARTITION BY OldRowCount, NewRowCount ORDER BY InsertDate DESC)
        FROM dbo.MyTable_History_RowCount
    ) x
    WHERE x.rn1 <> 1 AND x.rn2 <> 1;
END;
GO

The basic idea here is…Grab the rowcount above and below a specific point in time. Since the table is insert only, this will tell us exactly how many rows are inserted, vs cleaned up by the retention policy cleanup job.

I ran the above proc every 5 minutes for a few days and then I ran this analysis query to see what it looked like:

SELECT x.InsertDate, x.DateThreshold, x.SQLServerStartTime
    , OldRowCount = FORMAT(x.OldRowCount, 'N0')
    , NewRowCount = FORMAT(x.NewRowCount, 'N0')
    , x.RG_Compressed, x.RG_Open
    , N'█' [██]
    , OldRowDiff        = FORMAT(NULLIF(x.OldRowDiff       , 0), 'N0')
    , NewRowDiff        = FORMAT(NULLIF(x.NewRowDiff       , 0), 'N0')
    , RG_CompressedDiff = FORMAT(NULLIF(x.RG_CompressedDiff, 0), 'N0')
    , RG_OpenDiff       = FORMAT(NULLIF(x.RG_OpenDiff      , 0), 'N0')
    , N'█' [██]
    , RowCountChangeRT  = FORMAT(SUM(x.OldRowDiff + x.NewRowDiff) OVER (ORDER BY x.InsertDate), 'N0')
FROM (
    SELECT *
        , OldRowDiff        = OldRowCount   - LAG(OldRowCount)   OVER (ORDER BY InsertDate)
        , NewRowDiff        = NewRowCount   - LAG(NewRowCount)   OVER (ORDER BY InsertDate)
        , RG_CompressedDiff = RG_Compressed - LAG(RG_Compressed) OVER (ORDER BY InsertDate)
        , RG_OpenDiff       = RG_Open       - LAG(RG_Open)       OVER (ORDER BY InsertDate)
    FROM dbo.MyTable_History_RowCount
) x
ORDER BY InsertDate DESC;

The above analysis query allows you to see how many old rows were removed, new rows added, compressed and open rowgroups created/dropped, and a running total of row counts over time.

Here’s a sample export:

| InsertDate              | DateThreshold | SQLServerStartTime      | OldRowCount   | NewRowCount | RG_Compressed | RG_Open | ██ | OldRowDiff | NewRowDiff | RG_CompressedDiff | RG_OpenDiff | ██ | RowCountChangeRT | 
|-------------------------|---------------|-------------------------|---------------|-------------|---------------|---------|----|------------|------------|-------------------|-------------|----|------------------| 
| 2024-08-03 21:15:07.516 | 2024-08-02    | 2024-07-26 04:21:16.230 | 2,340,074,012 | 28,770,672  | 2258          | 3       | █  | NULL       | NULL       | NULL              | NULL        | █  | 3,301,606        | 
| 2024-08-03 20:30:08.216 | 2024-08-02    | 2024-07-26 04:21:16.230 | 2,340,074,012 | 28,770,672  | 2258          | 3       | █  | -1,048,576 | NULL       | -1                | NULL        | █  | 3,301,606        | 
| 2024-08-03 20:25:08.130 | 2024-08-02    | 2024-07-26 04:21:16.230 | 2,341,122,588 | 28,770,672  | 2259          | 3       | █  | NULL       | NULL       | 1                 | NULL        | █  | 4,350,182        | 
| 2024-08-03 17:10:06.670 | 2024-08-02    | 2024-07-26 04:21:16.230 | 2,341,122,588 | 28,770,672  | 2258          | 3       | █  | NULL       | 543,479    | 4                 | NULL        | █  | 4,350,182        | 
| 2024-08-03 17:05:06.553 | 2024-08-02    | 2024-07-26 04:21:16.230 | 2,341,122,588 | 28,227,193  | 2254          | 3       | █  | NULL       | 3,052,855  | NULL              | -4          | █  | 3,806,703        | 
| 2024-08-03 17:00:07.810 | 2024-08-02    | 2024-07-26 04:21:16.230 | 2,341,122,588 | 25,174,338  | 2254          | 7       | █  | NULL       | NULL       | NULL              | NULL        | █  | 753,848          | 
| 2024-08-03 12:30:08.010 | 2024-08-02    | 2024-07-26 04:21:16.230 | 2,341,122,588 | 25,174,338  | 2254          | 7       | █  | -6,291,456 | NULL       | -6                | NULL        | █  | 753,848          | 
| 2024-08-03 12:25:06.376 | 2024-08-02    | 2024-07-26 04:21:16.230 | 2,347,414,044 | 25,174,338  | 2260          | 7       | █  | NULL       | NULL       | NULL              | NULL        | █  | 7,045,304        | 
| 2024-08-03 11:10:06.360 | 2024-08-02    | 2024-07-26 04:21:16.230 | 2,347,414,044 | 25,174,338  | 2260          | 7       | █  | NULL       | 574,644    | 1                 | NULL        | █  | 7,045,304        | 
| 2024-08-03 11:05:06.320 | 2024-08-02    | 2024-07-26 04:21:16.230 | 2,347,414,044 | 24,599,694  | 2259          | 7       | █  | NULL       | 3,021,690  | 1                 | 2           | █  | 6,470,660        | 
| 2024-08-03 11:00:08.080 | 2024-08-02    | 2024-07-26 04:21:16.230 | 2,347,414,044 | 21,578,004  | 2258          | 5       | █  | NULL       | NULL       | 2                 | NULL        | █  | 3,448,970        | 
| 2024-08-03 05:10:07.336 | 2024-08-02    | 2024-07-26 04:21:16.230 | 2,347,414,044 | 21,578,004  | 2256          | 5       | █  | NULL       | 1,984,706  | 1                 | 2           | █  | 3,448,970        | 
| 2024-08-03 05:05:09.593 | 2024-08-02    | 2024-07-26 04:21:16.230 | 2,347,414,044 | 19,593,298  | 2255          | 3       | █  | NULL       | 1,611,628  | 2                 | -2          | █  | 1,464,264        | 
| 2024-08-03 05:00:08.253 | 2024-08-02    | 2024-07-26 04:21:16.230 | 2,347,414,044 | 17,981,670  | 2253          | 5       | █  | NULL       | NULL       | NULL              | NULL        | █  | -147,364         | 
| 2024-08-03 04:30:10.010 | 2024-08-02    | 2024-07-26 04:21:16.230 | 2,347,414,044 | 17,981,670  | 2253          | 5       | █  | -4,194,304 | NULL       | -4                | NULL        | █  | -147,364         | 
| 2024-08-03 04:25:06.500 | 2024-08-02    | 2024-07-26 04:21:16.230 | 2,351,608,348 | 17,981,670  | 2257          | 5       | █  | NULL       | NULL       | 1                 | NULL        | █  | 4,046,940        | 
| 2024-08-02 23:10:06.266 | 2024-08-02    | 2024-07-26 04:21:16.230 | 2,351,608,348 | 17,981,670  | 2256          | 5       | █  | NULL       | 676,028    | 1                 | -1          | █  | 4,046,940        | 
| 2024-08-02 23:05:07.350 | 2024-08-02    | 2024-07-26 04:21:16.230 | 2,351,608,348 | 17,305,642  | 2255          | 6       | █  | NULL       | 2,920,306  | 1                 | -1          | █  | 3,370,912        | 
| 2024-08-02 23:00:09.950 | 2024-08-02    | 2024-07-26 04:21:16.230 | 2,351,608,348 | 14,385,336  | 2254          | 7       | █  | NULL       | NULL       | NULL              | NULL        | █  | 450,606          | 
| 2024-08-02 20:30:12.170 | 2024-08-02    | 2024-07-26 04:21:16.230 | 2,351,608,348 | 14,385,336  | 2254          | 7       | █  | -3,145,728 | NULL       | -3                | NULL        | █  | 450,606          | 
| 2024-08-02 20:25:07.330 | 2024-08-02    | 2024-07-26 04:21:16.230 | 2,354,754,076 | 14,385,336  | 2257          | 7       | █  | NULL       | NULL       | NULL              | NULL        | █  | 3,596,334        | 
| 2024-08-02 17:10:05.263 | 2024-08-02    | 2024-07-26 04:21:16.230 | 2,354,754,076 | 14,385,336  | 2257          | 7       | █  | NULL       | 870,749    | 3                 | -1          | █  | 3,596,334        | 
| 2024-08-02 17:05:05.943 | 2024-08-02    | 2024-07-26 04:21:16.230 | 2,354,754,076 | 13,514,587  | 2254          | 8       | █  | NULL       | 2,725,585  | 1                 | 4           | █  | 2,725,585        | 
| 2024-08-02 17:00:06.480 | 2024-08-02    | 2024-07-26 04:21:16.230 | 2,354,754,076 | 10,789,002  | 2253          | 4       | █  | NULL       | NULL       | NULL              | NULL        | █  | 0                | 
| 2024-08-02 16:45:08.340 | 2024-08-02    | 2024-07-26 04:21:16.230 | 2,354,754,076 | 10,789,002  | 2253          | 4       | █  | NULL       | NULL       | NULL              | NULL        | █  | NULL             |

From what I can see, the cleanup process is keeping up perfectly fine over time. The rate at which rows are deleted (technically rowgroups) is keeping up with the rate at which rows are added.

The background job runs every 8 hours based on when SQL Server was started. For example, I noticed when the instance is restarted at around 4:30am, the background cleanup job runs at around 12:30pm, 8:30pm, 4:30am.

The total number of rowgroups dropped seems to be inconsistent, but this is likely due to how the rowgroups are filled at the time the data is inserted. All that matters to me is that it’s working and it’s keeping up.

My assumption is that because multiple rowgroups are kept open at a time, some of those could be open for days. As new data is inserted, it’s distributed into those rowgroups. So if there’s 5 open rowgroups, and it takes about 5 days for them to fill up and compress…then it would make sense that the oldest data in the history table is typically around 5 days.

As far as why the table was backed up by 26 days when this whole thing started? My guess is that was a remnant of development. When I first started building the process, I was only inserting a few thousand rows at a time, instead of a few million like I do now. Which means there was likely more open rowgroups for the data to be distributed into. When the cleanup routine tried to run…it couldn’t find any rowgroups containing ONLY expired rows. Then at some point, my process started inserting millions of rows per day, which triggers the rowgroups to get compressed much quicker, closing that window.

Next blog post

Will be a sort of extension on this one…

I searched and searched around online hoping I could find some system view or undocumented function that would let me inspect the contents of an individual columnstore segment, similar to using DBCC PAGE to view the contents of an individual page, but unfortunately I couldn’t find anything. I was able to inspect individual columnstore index pages, but inspecting a single page doesn’t really help me unless I know which segment it’s coming from and I was having trouble figuring out that relationship.

I thought it would be cool if I could inspect the actual contents of the columnstore rowgroup and see why that particular rowgroup hasn’t been dropped.

Well…after about 5 hours of pulling my hair out…I discovered that sys.column_store_segments contains a min_data_id and a max_data_id value, but for columns of type datetime2 it’s just the raw value, rather than a pointer to some dictionary value or something…

So my next blog post will be about how I figured that out and my solution for it. I didn’t want this post to be even longer than it already is 😂

Everything’s a CASE statement!

2024-07-30T18:00:00+00:00

Just a quick public service announcement:

Yes…I know it’s a “CASE expression” and not a “CASE statement”.

I’ve now received quite a few responses to this post saying nearly the same thing…“Not to be pedantic, but it’s a case expression, not a case statement”.

Yes…Thank you, you are correct, it is an “expression” not a “statement”.

When I originally posted it, I considered fixing it. But in the end, it really doesn’t matter, it doesn’t change the information in the post. And as far as SEO goes…it appears most people don’t search for “sql case expression” anyway. So, this PSA is my compromise.

Exhibit A:

Back to the post

Everything’s a CASE statement!

Well…not really, but a handful of functions in T-SQL are simply just syntactic sugar for plain ol’ CASE statements and I thought it would be fun to talk about them for a bit because I remember being completely surprised when I learned this. I’ve also run into a couple weird scenarios directly because of this.

For those who don’t know what the term “syntactic sugar” means…It’s just a nerdy way to say that the language feature you’re using is simply a shortcut for another typically longer and more complicated way of writing that same code and it’s not unique to SQL.

Back to the post
COALESCE
- What about ISNULL?
IIF
NULLIF
CHOOSE
How do you see this for yourself?

Going from (what I assume to be) most popular to least popular…

COALESCE

Behind the scenes when you’re using COALESCE what exactly do you think is happening? If you’re use to working with something like C#, you might think it’s some sort of generic method with overloads like…

T COALESCE(T p1, T p2);
T COALESCE(T p1, T p2, T p3);
T COALESCE(T[] p);

And then behind the scenes when your plan is compiled, it’s just picking some overload of an internal function…right? Nope. In reality, COALESCE(x.ColA, x.ColB, x.ColC), is translated into this:

CASE
    WHEN [x].[ColA] IS NOT NULL
    THEN [x].[ColA]
    ELSE
        CASE
            WHEN [x].[ColB] IS NOT NULL
            THEN [x].[ColB]
            ELSE [x].[ColC]
        END
END

What about ISNULL?

You might be wondering to yourself…“is ISNULL is the same way?”

Nope…

In the execution plan, it’s still just isnull([x].[ColA],[x].[ColB])…well, unless x.ColA is NOT NULL, in which case it’s smart enough to just ask for x.ColA since the ISNULL is unnecessary.

Unfortunately, COALESCE does not seem to have this optimization, even when the first column supplied is NOT NULL, it still converts to a CASE statement…I would hope y’all aren’t using ISNULL/COALESCE when the first column is NOT NULL anyway 😉.

So now that you know this about COALESCE and ISNULL…that might help explain why they handle data types differently. Where ISNULL always returns the datatype of the first expression (the check expression), whereas COALESCE returns the datatype of the highest type precedence among all the expressions, which is the same behavior as CASE.

IIF

These next two are pretty short and sweet as their CASE translations are straightforward.

When you use IIF(x.ColA > 10, x.ColB, x.ColC) it translates to:

CASE
    WHEN [x].[ColA] > (10)
    THEN [x].[ColB]
    ELSE [x].[ColC]
END

NULLIF

When you use NULLIF(x.ColA, 0), it translates to:

CASE
    WHEN [x].[ColA] = (0)
    THEN NULL
    ELSE [x].[ColA]
END

You might notice that the check expression is copied twice in this CASE statement. This opens up a problem when you use non-deterministic functions. It’s probably pretty rare to run into this situation with NULLIF, but here’s an example:

SELECT NULLIF(SIGN(CHECKSUM(NEWID())), 1);

The expression SIGN(CHECKSUM(NEWID())) will randomly pick either 1 or -1. So the expected behavior is that when the expression evaluates to 1, the NULLIF will catch that and return NULL. So in theory, it should NEVER return 1…but, if you run it, it does. And it’s because the check expression is copied multiple times, which means the randomization is also run multiple times.

Here’s what it looks like…

CASE
    WHEN SIGN(CHECKSUM(NEWID())) = (1) -- Returns -1 so it evaluates to false
    THEN NULL
    ELSE SIGN(CHECKSUM(NEWID())) -- Re-runs this expression, which returns 1
END

So there are cases where it will return 1 when your expectation is it shouldn’t.

CHOOSE

This final function is probably the least used, but it’s also one of my favorites. Most of the time I use it, it’s for a fun reason. CHOOSE also has the same issue you run into with NULLIF due to how it generates the CASE statement.

A sample usage of CHOOSE is CHOOSE(x.ColA,'Foo','Bar','Baz').

For those who aren’t familiar with using CHOOSE, basically this is saying…if x.ColA is 1 then return “Foo”, if x.ColA is 2 then return “Bar”, etc.

If I were to ask you how this gets translated into a CASE statement…you might think it looks like this:

CASE x.ColA
    WHEN 1 THEN 'Foo'
    WHEN 2 THEN 'Bar'
    WHEN 3 THEN 'Baz'
    ELSE NULL
END

And if that were that case (heh, pun intended)…I think that would be ideal…Unfortunately, that’s not what happens. Instead, this is what it looks like in the execution plan:

CASE
    WHEN [x].[ColA] = (1)
    THEN 'Foo'
    ELSE
        CASE
            WHEN [x].[ColA] = (2)
            THEN 'Bar'
            ELSE
                CASE
                    WHEN [x].[ColA] = (3)
                    THEN 'Baz'
                    ELSE NULL
                END
        END
END

😢

The issue here is that our check expression is copied multiple times rather than being used once. Which means, if your check expression is not deterministic within the query, you could run into some weird issues just like we did with NULLIF.

For example, I’ve used CHOOSE in the past to act as a sort of “round-robin” picker. For example, maybe I have some sort of EventTypeID and I want to pick one at random for generating a test script. So I’ll write something like this:

DECLARE @RandEventTypeID int;
SELECT @RandEventTypeID = CHOOSE(ABS(CHECKSUM(NEWID())%5)+1, 1, 2, 5, 7, 21)
SELECT @RandEventTypeID

ABS(CHECKSUM(NEWID())%5)+1 will pick a random number from 1 to 5. So the expected behavior of the script above would be to return one of those EventTypeID values at random…But that’s not what happens. Try running it yourself, and you’ll see it occasionally returns NULL.

Here’s why:

CASE
    WHEN (abs(checksum(newid())%(5))+(1))=(1) THEN (1)
    ELSE
        CASE
            WHEN (abs(checksum(newid())%(5))+(1))=(2) THEN (2)
            ELSE
                CASE
                    WHEN (abs(checksum(newid())%(5))+(1))=(3) THEN (5)
                    ELSE
                        CASE
                            WHEN (abs(checksum(newid())%(5))+(1))=(4) THEN (7)
                            ELSE
                                CASE
                                    WHEN (abs(checksum(newid())%(5))+(1))=(5) THEN (21)
                                    ELSE NULL
                                END
                        END
                END
        END
END

Just like with NULLIF, that check expression was copied over and over, which means each time it is evaluated, it generates a new random value.

So how do we fix/avoid this? Don’t put your random expression directly into CHOOSE (or NULLIF), you need to create an alias for it or use a variable, like so:

DECLARE @RandEventTypeID int,
        @RandSeed int = ABS(CHECKSUM(NEWID())%5)+1; -- Computed first, one time
SELECT @RandEventTypeID = CHOOSE(@RandSeed, 1, 2, 5, 7, 21)
SELECT @RandEventTypeID

-- OR if you need to do it for multiple rows...
SELECT CHOOSE(x.RandSeed, 1, 2, 5, 7, 21)
FROM (VALUES (1), (2)) t(foo)
    CROSS APPLY (SELECT RandSeed = ABS(CHECKSUM(NEWID())%5)+1) x; -- Computed first as a "Compute Scalar" in the plan, then passed into CHOOSE

In both of those cases, instead of copying the random expression, the random expression is computed first and then later passed into CHOOSE as a constant value.

How do you see this for yourself?

Rather than pasting a bunch of screenshots in for every example, I’m just going to do it once here.

If you want to see this for yourself, there are two thing you need to do.

Ensure you’re testing on a query with a FROM clause, otherwise SQL Server won’t generate an execution plan. I’m sure there are exceptions to that, but at least in regard to building the small test cases for this post, I had to make sure each query had a FROM clause, even if it was something small like FROM (SELECT x = 1) x.
Enable “Include Actual Execution Plan”

Run your test query:

CREATE TABLE #tmp (ColA int NULL);
INSERT INTO #tmp VALUES (1)

SELECT COALESCE(x.ColA, 10)
FROM #tmp x

Then take a look at the execution plan, and view the properties for the operator (there should only be one or two if it’s one of these test queries).

This is the most consistent way to see it. I’ve found that depending on the query, you might also be able to see it in that query text preview under “Query 1:”, as well as in the operator stats pop-up, like this:

Fun with Unicode characters in SQL Queries

2024-07-09T14:00:00+00:00

I never thought this would make for a good blog post, but here we are. Every single time I share a query that uses Unicode characters, someone always asks me what it is and why I’m using it. So now I have this blog post I can send to anyone who asks about it 😄.

I don’t want to get too far into the weeds explaining encodings, code points, etc. Mostly because you can just Google it, but also because it’s very confusing. Despite all the hours I’ve spent trying to learn about it, I still don’t get a lot of it. There’s also a lot of nuance regarding encodings when it comes to SQL Server, different collations, and different SQL versions. However, I did come across this blog post that seems to break it down well.

For the purposes of this post, all you really need to know is Unicode is what allows applications to support non-english text (データベース), special symbols (•, ™, °, ©,π), diacritics (smörgåsbord, jalapeño, résumé), and much more. Unicode makes all this possible.

Unicode is HUGE, and there are a ton of characters that most people don’t even know exist. Sometimes I find myself scrolling through Unicode lookup sites just to see if I can find any cool/fun/useful characters I could use…A totally normal Saturday afternoon activity…👀

Out of all the random Unicode characters I’ve found…the one I use on a daily basis is █…that’s it, just a plain boring block. For the most part, this blog post will be about how this one boring character can help make your SQL queries a little easier to look at.

How do you type these!?

Before anyone asks “how do you type these”…To be honest, I don’t, because I use SQL Prompt snippets where I’ve copy pasted my most used Unicode characters. I guess if you really want to type them yourself every time, you can use keyboard shortcuts. Here’s a website I found that has a list of common Unicode symbols and their “alt codes”. Where you hold alt and type the code. In the case of █, you would type alt+219.

‼ A very important note…if you are using ANY Unicode characters in a string literal in SQL Server, you have to make sure you prefix the string with N. Otherwise the Unicode characters won’t render, and you’ll just end up with blanks, question marks, etc. For example:

SELECT N'This is a Unicode string in SQL Server! 🦄'
SELECT 'This is NOT a Unicode string in SQL Server! 😭' -- Except when using UTF-8 collations in 2019+...read the blog post linked above

Adding a column set separator

Have you ever written a query that joins a whole bunch of tables with a SELECT * at the top? …Of course you have, you’re a SQL developer. The problem is now you’re staring at a massive dataset 100 columns wide.

For example…

SELECT *
FROM sys.indexes i
    JOIN sys.objects o ON o.[object_id] = i.[object_id]
    JOIN sys.stats s ON s.[object_id] = i.[object_id] AND s.stats_id = i.index_id
    JOIN sys.partitions p ON p.[object_id] = i.[object_id] AND p.index_id = i.index_id

Just scrolling through all those columns, how do you know which columns are in which table? You could probably figure it out pretty quick if you know the data and have a good idea what the first column in each table is, but I’ve found that to be annoying. If you were working in Excel, would you add any special border formatting to make it a little easier to read? Because I would.

In the past, I would do something like this…

SELECT 'sys.indexes ->'  , i.*
    , 'sys.objects ->'   , o.*
    , 'sys.stats ->'     , s.*
    , 'sys.partitions ->', p.*
FROM ...

" as a way to visually separate related columns" />

Don’t try to convince me that after staring at result grids all day that you’re going to easily and quickly spot that out of 100 columns.

Now, my typical pattern is to do something like this:

SELECT N'█ sys.indexes -> █'   , i.*
    ,  N'█ sys.objects -> █'   , o.*
    ,  N'█ sys.stats -> █'     , s.*
    ,  N'█ sys.partitions -> █', p.*
FROM ...

█" to visually separate related columns" />

I find that to be significantly easier to spot…Though, most times I really only do this…

SELECT N'█' [█], i.*
    ,  N'█' [█], o.*
    ,  N'█' [█], s.*
    ,  N'█' [█], p.*
FROM ...

Does it make the SELECT portion of the queries just a little bit ugly? Sure, but I’ve gotten used to it. And I feel the pros outweigh the cons.

Adding a visual row identifier

My second most common usage for █ is to easily spot specific rows I’m targeting while looking at a larger dataset. For example, I have a table of records with expiration dates, but I’m doing some data analysis, looking for patterns and I want to see the whole dataset, and not just those that are expired or vice versa.

Here’s a sample query/data generator:

SELECT TOP(100) x.ItemID, y.StartDate, z.ExpirationDate
    , Expired = IIF(z.ExpirationDate <= GETDATE(), N'██', '')
FROM (VALUES(1),(2),(3),(4),(5),(6),(7),(8),(9),(10)) x(ItemID) -- If you're on SQL2022 try using GENERATE_SERIES(1,10) 😁
    CROSS APPLY (SELECT StartDate      = DATEADD(MILLISECOND,-FLOOR(RAND(CHECKSUM(NEWID()))*864000000), GETDATE())) y
    CROSS APPLY (SELECT ExpirationDate = DATEADD(MILLISECOND, FLOOR(RAND(CHECKSUM(NEWID()))*864000000), y.StartDate)) z

And here’s what that output might look like…

Obviously, you don’t HAVE to use Unicode here, you’d probably be just as well off using 1 or ## or whatever you want. I personally find that this makes it incredibly obvious and easy to spot.

Creating a bar chart

Now…this is more of a hack. By this point, if you’re creating bar charts with Unicode in SQL queries, you should probably be using some sort of reporting/GUI tool anyway. But it’s still fun.

I often find use in this because I can throw it into a simple utility script and then share that SQL script with others. It has the little bar graph graph in without them having to do anything special other than run it.

I won’t paste the whole script, but you can see where I’ve done this in a simple Drive Usage script here.

The result of which looks like this:

Except here you’ll notice I’m actually using two different characters. █ to represent used space, and ▒ (alt+177) to represent unused space.

Which boils down to these expressions:

DECLARE @barwidth int          = 50, -- Controls the overall width of the bar
        @pct      decimal(3,2) = 0.40; -- The percentage to render as a bar chart

-- Dark portion of the bar represents the percentage (ex. Percent used space)
SELECT REPLICATE(N'█', CONVERT(int,   FLOOR((    @pct) * @barwidth)))
     + REPLICATE(N'▒', CONVERT(int, CEILING((1 - @pct) * @barwidth)));

-- Light portion of the bar represents the percentage (ex. Percent free space)
SELECT REPLICATE(N'█', CONVERT(int,   FLOOR((1 - @pct) * @barwidth)))
     + REPLICATE(N'▒', CONVERT(int, CEILING((    @pct) * @barwidth)));

Use as a delimiter

I’m pretty sure I stole this idea from Adam Bertrand, but I can’t seem to find the post. The idea is to use a Unicode character that has a very unlikely chance of occurring in your data to use as a split point / delimiter.

The article I stole it from uses nchar(9999), which is just this ✏, a pencil, so that’s also what I happen to use now. You could pick from thousands of other characters as long as it’s not going to show up in your (hopefully clean) data.

For example, I’ll occasionally write something like this…

DECLARE @d nchar(1) = NCHAR(9999);

SELECT STRING_AGG(s.servicename, @d) WITHIN GROUP (ORDER BY s.servicename)
FROM sys.dm_server_services s

Which results in…

SQL Full-text Filter Daemon Launcher (MSSQLSERVER)✏SQL Server (MSSQLSERVER)✏SQL Server Agent (MSSQLSERVER)

This isn’t necessarily a great option in all cases, you could also use something more appropriate like JSON or XML. But depending on what I’m working on, sometimes it’s nice to have something a bit lighter weight.

Wrap it up…

This is really only scratching the surface of what Unicode has to offer and how you can use it in SQL. I recommend checking out various Unicode blocks (related sections of characters). Some good ones to check out would be block elements (what we’ve been using in this post), box drawing, arrows (there’s actually like 4 blocks just for arrows), playing cards…just to name a few. You can view the full list here.

I’ve also seen some pretty cool stuff for writing 3D text in SQL comments, using box drawing characters to visualize a parent-child hierarchy (kinda like when you run the windows tree command), etc.

Let me know what some of your favorite tricks are using Unicode characters.

What’s new in SQL Server 2022

2022-06-02T19:30:00+00:00

Note: It seems since this post was written, some functions have been added, or some syntax has changed. So until I have time to update this post with the latest information, keep that in mind.

I’ve been excited to play with the new features and language enhancements in SQL Server 2022 so I’ve been keeping an eye on the Microsoft Docker repository for the 2022 image. Well they finally added it! I immediately pulled the image and started playing with it.

I want to focus on the language enhancements as those are the easiest to demonstrate, and I feel that’s what you’ll be able to take advantage of the quickest after upgrading.

Here’s the official post from Microsoft.

Table of contents:

Docker Tag
GENERATE_SERIES()
GREATEST() and LEAST()
STRING_SPLIT()
DATE_BUCKET()
FIRST_VALUE() and LAST_VALUE()
WINDOW clause
JSON functions
Wrapping up

Docker Tag

I won’t go into the details of how to set up or use Docker, but you should definitely set aside some time to learn it. You can copy paste the command supplied by Microsoft on their Docker Hub page for SQL Server, but this is the one I prefer to use:

docker run -it `
    --name sqlserver `
    -e ACCEPT_EULA='Y' `
    -e MSSQL_SA_PASSWORD='yourStrong(!)Password' `
    -e MSSQL_AGENT_ENABLED='True' `
    -p 1433:1433 `
    mcr.microsoft.com/mssql/server:2022-latest;

This sets it up to always use the same name of “sqlserver” for the container, this keeps you from creating multiple SQL server containers. It keeps it in interactive mode so you can watch for system errors, and it starts it up with SQL Agent running. Also, this will automatically download and run the SQL Server image if you don’t already have it.

You won’t need to worry about loading up any specific databases for this blog post, but if that’s something you’d like to learn how to do, I’ve blogged about it here.

GENERATE_SERIES()

Microsoft Documentation

I want to cover this function first so we can use it to help us with building sample data for the rest of this post.

Generating a series of incrementing (or decrementing) values is extremely useful. If you’ve never used a “tally table” or a “numbers table” plenty of other SQL bloggers have covered it and I highly recommend looking up their posts.

A few uses for tally tables:

Can often be the solution that avoids reverting to what Jeff Moden likes to call “RBAR”…Row-By-Agonizing-Row. Tally tables can help you perform iterative / incremental tasks without having to build any looping mechanisms. In fact, one of the fastest solutions for splitting strings (prior to STRING_SPLIT()) is using a tally table. Up until recently (we’ll cover that later), that tally table string split function was still one of the best methods even despite STRING_SPLIT() being available.
Can help you with reporting, such as building a list of dates so that you don’t have gaps in your aggregated report that is grouped by day or month. If you group sales by month, but a particular month had no sales you can use the tally table to fill the gaps with “0” sales.
They’re great for helping you generate sample data as you’ll see throughout this post.

Prior to this new function, the best way I’ve seen to generate a tally table is using the CTE method, like so:

WITH c1 AS (SELECT x.x FROM (VALUES(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) x(x))  -- 10
    , c2(x) AS (SELECT 1 FROM c1 x CROSS JOIN c1 y)                                -- 10 * 10
    , c3(x) AS (SELECT 1 FROM c2 x CROSS JOIN c2 y CROSS JOIN c2 z)                -- 100 * 100 * 100
    , c4(rn) AS (SELECT 0 UNION SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1)) FROM c3)  -- Add zero record, and row numbers
SELECT TOP(1000) x.rn
FROM c4 x
ORDER BY x.rn;

This will generate rows with values from 0 to 1,000,000. In this sample, it is using a TOP(1000) and an ORDER BY to only return the first 1,000 rows (0 - 999). It can be easily modified to generate more or less rows or ranges of rows and it’s extremely fast.

Another method I personally figured out while trying to work on a code golf problem was using XML:

DECLARE @x xml = REPLICATE(CONVERT(varchar(MAX),''), 999); --Table size
WITH c(rn) AS (
    SELECT 0 
    UNION ALL
    SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1))
    FROM @x.nodes('n') x(n)
)
SELECT c.rn
FROM c;

This method is more for fun, and I typically wouldn’t use this in a production environment. I’m sure it’s plenty stable, I just prefer the CTE method more. This method also returns 1000 records (0 - 999).

Now, in comes the GENERATE_SERIES() function. You specify where it starts, where it ends and what to increment by (optional). Though, this is certainly not a direct drop in replacement for the options above and I’ll show you why.

SELECT [value]
FROM GENERATE_SERIES(START = 0, STOP = 999, STEP = 1);

This is pretty awesome, and it definitely beats typing all that other junk from the other options and it’s a lot more straight forward and intuitive to read.

I think it’s great that you can customize it to increment, decrement, change the range and even change the datatype by supplying decimal values. You can also set the “STEP” size (i.e. only return every N values). I could see this coming in handy for generating date tables. For example, generate a list of dates going back every 30 days or every 14 days.

-- List of dates going back every 30 days for 180 days
SELECT DateValue = CONVERT(date, DATEADD(DAY, [value], '2022-06-01'))
FROM GENERATE_SERIES(START = -30, STOP = -180, STEP = -30);

/* Result:
| DateValue  |
|------------|
| 2022-05-02 |
| 2022-04-02 |
| 2022-03-03 |
| 2022-02-01 |
| 2022-01-02 |
| 2021-12-03 |
*/

You could certainly do this with the CTE method, it just wouldn’t be as obvious as this.

However, I quickly discovered one major caveat…performance. This GENERATE_SERIES() function is an absolute pig 🐷. I don’t know why it’s so slow, maybe they’re still working out the kinks, or maybe it will improve in a future update.

Here’s how it stacks up on my local machine in docker.

Generating 1,000,001 rows from 0 to 1,000,000:

Method	CPU Time (ms)	Elapsed Time (ms)
CTE	1,361	500
XML	775	784
`GENERATE_SERIES()`	47,801	44,371

Unless there’s something wrong with my docker image…this doesn’t seem ready for prime time. I could see this being used in utility scripts, sample scripts (like this blog post), reporting procs, etc. Where you only need to generate a small set of records, but if you need to generate a large set of records often, it seems you’re best sticking with the CTE method for now.

I could possibly see this being useful when used inline where you need to generate a different number of records for each row (e.g. in an APPLY operator). However, even then, seeing how slow this is, you might be better off building your own TVF using the CTE method 🤷‍♂️. So while it may be shorter and much easier to use, I’m not sure if the performance trade-off is worth it. Hopefully it’s just my machine?

Now that we’ve got this one out of the way, we can use it to help us with generating sample data for the rest of this post.

GREATEST() and LEAST()

Microsoft Documentation:

These are not exactly new. They have made their way around the blogging community for a while after they were discovered as undocumented functions in Azure SQL Database, but they’re still worth demonstrating since they are part of the official 2022 release changes.

I’m sure all of you know how to use MIN() and MAX(). These are aggregate functions that run against a grouping, or a window. Their usage is fairly straight forward. If you want to find the highest or lowest value for a single column in a GROUP BY or a window function, you would use one of those.

But what if you want to get the highest or lowest value from multiple columns within a row? For example, maybe you have LastModifiedDate, LastAccessDate and LastErrorDate columns and you want the most recent date in order to determine the last interaction with that item?

Previously, you’d need to use a case statement or a table value constructor.

It would look something like this:

-- Generate sample data

DROP TABLE IF EXISTS #event;
CREATE TABLE #event (
    ID                  int     NOT NULL IDENTITY(1,1),
    LastModifiedDate    datetime    NULL,
    LastAccessDate      datetime    NULL,
    LastErrorDate       datetime    NULL,
);

INSERT INTO #event (LastModifiedDate, LastAccessDate, LastErrorDate)
SELECT DATEADD(SECOND, -(RAND(CHECKSUM(NEWID())) * 200000000), GETDATE())
    ,  DATEADD(SECOND, -(RAND(CHECKSUM(NEWID())) * 200000000), GETDATE())
    ,  DATEADD(SECOND, -(RAND(CHECKSUM(NEWID())) * 200000000), GETDATE())
FROM GENERATE_SERIES(START = 1, STOP = 5); -- See...nifty, right?

-- Old method using table value constructor
SELECT LastModifiedDate, LastAccessDate, LastErrorDate
    , y.[Greatest], y.[Least]
FROM #event
    CROSS APPLY (
        SELECT [Least] = MIN(x.val), [Greatest] = MAX(x.val)
        FROM (VALUES (LastModifiedDate), (LastAccessDate), (LastErrorDate)) x(val)
    ) y;

-- New method using LEAST/GREATEST functions
SELECT LastModifiedDate, LastAccessDate, LastErrorDate
    , [Greatest] = GREATEST(LastModifiedDate, LastAccessDate, LastErrorDate)
    , [Least]    = LEAST(LastModifiedDate, LastAccessDate, LastErrorDate)
FROM #event;

Result:

Of course this also comes with a caveat. These new functions are great if all you want to do is find the highest or lowest value…but if you want to use any other aggregate function, like AVG() or SUM()…unfortunately you’d still need to use the old method.

STRING_SPLIT()

Microsoft Documentation

This is also not a new function, however, after (I’m sure) many requests…it has been enhanced. Most people probably don’t know, or maybe just haven’t bothered to care, but up until now, you should never rely on the order that STRING_SPLIT() returns its results. They are not considered to be returned in any particular order, and that is still the case.

However, they have now added an additional “ordinal” column that you can turn on using an optional setting.

Before, you would often seen people use STRING_SPLIT() like this:

SELECT [value], ordinal
FROM (
    SELECT [value]
        , ordinal = ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
    FROM STRING_SPLIT('one fish,two fish,red fish,blue fish', ',')
) x;

/* Result:
| value     | ordinal |
|-----------|---------|
| one fish  | 1       |
| two fish  | 2       |
| red fish  | 3       |
| blue fish | 4       |
*/

And while you more than likely will get the right numbers associated with the correct position of the item…you really shouldn’t do this because it’s undocumented behavior. At any time, Microsoft could change how this function works internally, and now all of a sudden that production code you wrote relying on its order is messing up.

But now you can enable an “ordinal” column to be included in the output. The value of the column indicates the order in which the item occurs in the string.

SELECT [value], ordinal
FROM STRING_SPLIT('one fish,two fish,red fish,blue fish', ',', 1);

/* Result:
| value     | ordinal |
|-----------|---------|
| one fish  | 1       |
| two fish  | 2       |
| red fish  | 3       |
| blue fish | 4       |
*/

DATE_BUCKET()

Microsoft Documentation

Now this is a cool new function that I’m looking forward to testing out. It’s able to give you the beginning of a date range based on the interval you provide it. For example “what’s the first day of the month for this date”.

Simple usage:

DECLARE @date datetime = GETDATE();
SELECT Interval, [Value]
FROM (VALUES
      ('Source' , @date)
    , ('SECOND' , DATE_BUCKET(SECOND , 1, @date))
    , ('MINUTE' , DATE_BUCKET(MINUTE , 1, @date))
    , ('HOUR'   , DATE_BUCKET(HOUR   , 1, @date))
    , ('DAY'    , DATE_BUCKET(DAY    , 1, @date))
    , ('WEEK'   , DATE_BUCKET(WEEK   , 1, @date))
    , ('MONTH'  , DATE_BUCKET(MONTH  , 1, @date))
    , ('QUARTER', DATE_BUCKET(QUARTER, 1, @date))
    , ('YEAR'   , DATE_BUCKET(YEAR   , 1, @date))
) x(Interval, [Value]);

/* Result:
| Interval | Value                   |
|----------|-------------------------|
| Source   | 2022-06-02 13:30:48.353 |
| SECOND   | 2022-06-02 13:30:48.000 |
| MINUTE   | 2022-06-02 13:30:00.000 |
| HOUR     | 2022-06-02 13:00:00.000 |
| DAY      | 2022-06-02 00:00:00.000 |
| WEEK     | 2022-05-30 00:00:00.000 |
| MONTH    | 2022-06-01 00:00:00.000 |
| QUARTER  | 2022-04-01 00:00:00.000 |
| YEAR     | 2022-01-01 00:00:00.000 |
*/

See how each interval is being rounded down to the nearest occurrence? This is super useful for things like grouping data by month. For example, “group sales by month using purchase date”. Prior to this you’d have to use methods like the following:

SELECT DATEPART(MONTH, PurchaseDate), DATEPART(YEAR, PurchaseDate)
FROM dbo.Sale
GROUP BY DATEPART(MONTH, PurchaseDate), DATEPART(YEAR, PurchaseDate);

--OR

SELECT MONTH(PurchaseDate), YEAR(PurchaseDate)
FROM dbo.Sale
GROUP BY MONTH(PurchaseDate), YEAR(PurchaseDate);

Those work, but they’re ugly, because now you have a column for month and a column for year. So then you might use something like:

SELECT DATEFROMPARTS(YEAR(PurchaseDate), MONTH(PurchaseDate), 1)
FROM dbo.Sale
GROUP BY DATEFROMPARTS(YEAR(PurchaseDate), MONTH(PurchaseDate), 1);

--OR

SELECT DATEADD(MONTH, DATEDIFF(MONTH, 0, GETDATE()), 0)
FROM dbo.Sale
GROUP BY DATEADD(MONTH, DATEDIFF(MONTH, 0, GETDATE()), 0);

These methods work too…but they’re both a bit ugly, especially that second method. But that second method comes in handy when you need to use other intervals, like WEEK or QUARTER because then the DATEFROMPARTS() method doesn’t work.

So rather than using all those old methods, now you can use:

SELECT DATE_BUCKET(MONTH, 1, PurchaseDate)
FROM dbo.Sale
GROUP BY DATE_BUCKET(MONTH, 1, PurchaseDate);

Easy as that. Easier to read, easier to know what it’s doing.

It also allows you to specify a “bucket width”. To put it in plain terms, it allows you to round down to the nearest increment of time. For example, you could use it to round down to the nearest interval of 5 minutes. So 06:33:34 rounds down to 06:30:00. This is great for reporting. You can break data up into chunks, for example, maybe you want to break the day up into 8 hour shifts.

DROP TABLE IF EXISTS #log;
CREATE TABLE #log (
    InsertDate datetime NULL,
);

-- Generate 1000 events with random times spread out across a single day
INSERT INTO #log (InsertDate)
SELECT DATEADD(SECOND, -(RAND(CHECKSUM(NEWID())) * 86400), '2022-06-02')
FROM GENERATE_SERIES(START = 1, STOP = 1000); -- I told you this would be useful

SELECT TOP(5) InsertDate FROM #log;
/*
| InsertDate              |
|-------------------------|
| 2022-06-01 19:22:54.000 |
| 2022-06-01 08:01:13.000 |
| 2022-06-01 09:35:48.000 |
| 2022-06-01 22:28:38.000 |
| 2022-06-01 05:26:08.000 |
*/

In this example, I’ve generate 1,000 random events to simulate a log table. Prior to using DATE_BUCKET() how would you have broken this up into 8 hour chunks? Here’s how I would have done it:

SELECT DATEADD(HOUR, (DATEDIFF(HOUR, 0, InsertDate) / 8) * 8, 0)
    , Total = COUNT(*)
FROM #log
GROUP BY DATEADD(HOUR, (DATEDIFF(HOUR, 0, InsertDate) / 8) * 8, 0);

All this is doing is getting the number of hours since 1900-01-01 (0), then dividing by 8. Since I’m dividing an int by an int, it automatically floors the result. So 10 / 8 = 1, 15 / 8 = 1, 16 / 8 = 2, otherwise you would need to explicitly use FLOOR(). Then it is re-adding those hours back to 0 to get the datetime rounded to the nearest increment of 8 hours. Fortunately, increments of 2, 3, 4, 6, 8 and 12 all work nicely with this method.

However, DATE_BUCKET() makes this a lot easier:

SELECT Bucket = DATE_BUCKET(HOUR, 8, InsertDate)
    , Total = COUNT(*)
FROM #log
GROUP BY DATE_BUCKET(HOUR, 8, InsertDate);

/* Result:
| Bucket                  | Total |
|-------------------------|-------|
| 2022-06-01 00:00:00.000 | 378   |
| 2022-06-01 08:00:00.000 | 303   |
| 2022-06-01 16:00:00.000 | 319   |
*/

FIRST_VALUE() and LAST_VALUE()

Microsoft Documentation:

Similar to SPLIT_STRING(), neither of these are new, but they have been greatly enhanced. After years of waiting, we finally have the ability to control how NULL values are handled with the use of IGNORE NULLS and RESPECT NULLS.

In SQL Server, NULL values are always sorted to the “lowest” end. So if you sort ascending, NULL values will be at the top. Unfortunately, we don’t have a choice over that matter. In other RDBMSs such as Postgres, you can control this behavior (e.g. ORDER BY MyValue ASC NULLS LAST).

I’ve personally never run into this as a major problem, there’s always ways around it, such as ORDER BY IIF(MyValue IS NULL, 0, 1), MyValue, which will sort NULL values to the bottom first, then sort by MyValue.

In a similar way, you can run into issues with this when using FIRST_VALUE() or LAST_VALUE() and the data contains NULL values. It’s not exactly the same issue, but it goes along the same lines as having control over how NULL values are treated.

I was going to build a example for this, but then I ran across this article from Microsoft, which uses the exact example I was going to build, and it perfectly explains and demonstrates how you can use this new feature to fill in missing data using IGNORE NULLS:

https://docs.microsoft.com/en-us/azure/azure-sql-edge/imputing-missing-values

WINDOW clause

Microsoft Documentation

I’m honestly very surprised that this was included, but I’m glad it was. If you’re familiar with using window functions, then you are going to love this.

Let’s use a very simple example. I’m going to use GENERATE_SERIES() to get a list of values 1 - 10. Now I want to perform some window operations on those values, partitioning them by odd vs even. So for both odd and even numbers, I want to see a row number, a running total (sum), a running count, and a running average.

SELECT [value]
    , RowNum = ROW_NUMBER() OVER (PARTITION BY [value] % 2 ORDER BY [value])
    , RunSum = SUM([value]) OVER (PARTITION BY [value] % 2 ORDER BY [value])
    , RunCnt = COUNT(*)     OVER (PARTITION BY [value] % 2 ORDER BY [value])
    , RunAvg = AVG([value]) OVER (PARTITION BY [value] % 2 ORDER BY [value])
FROM GENERATE_SERIES(START = 1, STOP = 10)
ORDER BY [value];

/* Result:
| value | RowNum | RunSum | RunCnt | RunAvg |
|-------|--------|--------|--------|--------|
| 1     | 1      | 1      | 1      | 1      |
| 2     | 1      | 2      | 1      | 2      |
| 3     | 2      | 4      | 2      | 2      |
| 4     | 2      | 6      | 2      | 3      |
| 5     | 3      | 9      | 3      | 3      |
| 6     | 3      | 12     | 3      | 4      |
| 7     | 4      | 16     | 4      | 4      |
| 8     | 4      | 20     | 4      | 5      |
| 9     | 5      | 25     | 5      | 5      |
| 10    | 5      | 30     | 5      | 6      |
*/

The problem here is we’re repeating a lot of code…OVER (PARTITION BY [value] % 2 ORDER BY [value]) is repeated four times. That’s a bit wasteful, and open to error. All it takes is for that window to change and a developer accidentally forgets to update one of them.

That’s where the new WINDOW clause comes in. Instead, you can define your window with a name/alias and then reuse it. So it is only defined once.

SELECT [value]
    , RowNum = ROW_NUMBER() OVER win
    , RunSum = SUM([value]) OVER win
    , RunCnt = COUNT(*)     OVER win
    , RunAvg = AVG([value]) OVER win
FROM GENERATE_SERIES(START = 1, STOP = 10)
WINDOW win AS (PARTITION BY [value] % 2 ORDER BY [value])
ORDER BY [value];

I love how simple this is. Now our window is defined only once. Any future changes only need to alter a single line. I’m looking forward to using this one.

JSON functions

I saved this section for last on purpose because I have almost no experience working with JSON so I likely won’t have great real-world examples, but I can at least walk through the usage of these functions.

Microsoft has great examples in their documentation already, so this walk-through is more for me than you because it’s forcing me to learn how to use these functions.

Microsoft Documentation:

ISJSON()

The ISJSON() function is not new (thank you to the reddit user that pointed this out to me), but it was enhanced. There is now a json_type_constraint parameter.

Without the new parameter, this one is about as simple as it gets…It checks whether the value you pass is valid JSON or not.

SELECT ISJSON('{ "name":"Chad" }'); -- Returns 1 because it is valid JSON
SELECT ISJSON('{ name:"Chad" }');   -- Returns 0 because it is invalid JSON

However, the new parameter allows you to do a little more than just check whether the blob you pass to it is valid or not. Now you can check if its type is valid as well. Maybe you’re generating JSON and you want to test the individual parts rather than testing the entire blob at the end of your task.

Here are some test cases:

SELECT *
FROM (VALUES  ('string','"testing"'), ('empty string','""'), ('bad string','asdf')
            , ('scalar','1234')
            , ('boolean','true'), ('bad boolean', 'TRUE')
            , ('array','[1,2,{"foo":"bar"}]'), ('empty array', '[]')
            , ('object','{"name":"chad"}'), ('empty object','{}')
            , ('null literal','null')
            , ('blank value', '')
            , ('NULL value', NULL)
) x([type], [value])
    CROSS APPLY (
        -- Case statements to make visualization of results easier
        SELECT [VALUE]  = CASE ISJSON(x.[value], VALUE)  WHEN 1 THEN 'True' WHEN 0 THEN '' ELSE NULL END
            ,  [SCALAR] = CASE ISJSON(x.[value], SCALAR) WHEN 1 THEN 'True' WHEN 0 THEN '' ELSE NULL END
            ,  [ARRAY]  = CASE ISJSON(x.[value], ARRAY)  WHEN 1 THEN 'True' WHEN 0 THEN '' ELSE NULL END
            ,  [OBJECT] = CASE ISJSON(x.[value], OBJECT) WHEN 1 THEN 'True' WHEN 0 THEN '' ELSE NULL END
    ) y

/* Result:
| type         | value               | VALUE | SCALAR | ARRAY | OBJECT | 
|--------------|---------------------|-------|--------|-------|--------| 
| string       | "testing"           | True  | True   |       |        | 
| empty string | ""                  | True  | True   |       |        | 
| bad string   | asdf                |       |        |       |        | 
| scalar       | 1234                | True  | True   |       |        | 
| boolean      | true                | True  |        |       |        | 
| bad boolean  | TRUE                |       |        |       |        | 
| array        | [1,2,{"foo":"bar"}] | True  |        | True  |        | 
| empty array  | []                  | True  |        | True  |        | 
| object       | {"name":"chad"}     | True  |        |       | True   | 
| empty object | {}                  | True  |        |       | True   | 
| null literal | null                | True  |        |       |        | 
| blank value  |                     |       |        |       |        | 
| NULL value   | NULL                | NULL  | NULL   | NULL  | NULL   | 
*/

Based on these results you can see that VALUE is a generic check, determining whether the value is valid regardless of type. Whereas SCALAR, ARRAY and OBJECT are more granular and check for specific types.

JSON_PATH_EXISTS()

Checks to see whether the path you specify exists in the provided JSON blob.

DECLARE @jsonblob nvarchar(MAX) = N'
{
    "name":"Chad Baldwin",
    "addresses":[
        {"type":"billing", "street":"123 Main Street", "city":"New York", "state":"NY", "zip":"01234"},
        {"type":"shipping", "street":"2073 Beech Street", "city":"Pleasanton", "state":"CA", "zip":"94566"}
    ]
}';

SELECT ISJSON(@jsonblob); -- returns 1 because it is valid JSON
SELECT JSON_PATH_EXISTS(@jsonblob, '$.addresses[0].zip'); -- returns 1 because the path exists

Explanation of $.addresses[0].zip:

$ - represents the root of the blob
addresses[0] - returns the first object within the addresses array.
zip - looks for a property named zip within that object

JSON_OBJECT()

This is an interesting one, the syntax is a bit odd, but you’re basically passing the function key:value pairs, which it then uses to build a simple JSON string.

SELECT item = x.y, jsonstring = JSON_OBJECT('item':x.y)
FROM (VALUES ('one fish'),('two fish'),('red fish'),('blue fish')) x(y);

/* Result:
| item      | jsonstring           | 
|-----------|----------------------| 
| one fish  | {"item":"one fish"}  | 
| two fish  | {"item":"two fish"}  | 
| red fish  | {"item":"red fish"}  | 
| blue fish | {"item":"blue fish"} | 
*/

So it allows you to generate a JSON object for a set of values/columns on a per row basis.

JSON_ARRAY()

This is similar as JSON_OBJECT() in regard to generating JSON from data, except instead of creating an object with various properties, it’s creating an array of values or objects.

SELECT JSON_ARRAY('one fish','two fish','red fish','blue fish');

/* Result:
["one fish","two fish","red fish","blue fish"]
*/

From there you can combine JSON_OBJECT and JSON_ARRAY to generate nested JSON blobs from your data.

Wrapping up

This ended up being much longer than I had originally anticipated, but I’m glad I went through it as it helped me gain a much better understanding of all these changes, new functions, enhancements, and how to use them in real world situations.

Thanks for reading!

Handling log files in PowerShell

2022-04-04T14:00:00+00:00

Inspecting and monitoring log files.

Let’s talk about how to make something that’s already super exciting, even more fun, by using PowerShell. Why bother with fancy GUIs and polished tools when you can do it the fun way?

Yes, there’s lots of good options now when it comes to logging, like structured logs, AWS CloudWatch, Azure Monitor, ELK, etc. Tools that give you a lot of power when it comes to filtering, alerts, and monitoring. However, I still often find myself digging through good ol’ *.log files on a server.

There’s lots of “tail” style GUIs and CLI tools out there, but it’s still good to know how to do it using plain PowerShell, especially when you don’t want to deal with installing or downloading some app to a blank server.

This post ended up being MUCH longer than I had initially anticipated…first time I’ve had to add a table of contents to one of my posts.

Table of contents:

Inspecting a log file
Filtering output
Modifying output
- Add color by assignment
Dealing with multiple log files
Live monitoring with -Wait
Working with multiple files using ForEach-Object -Parallel
- Monitoring multiple files
- Add color randomly
Final thoughts

Throughout this post, I use a variety of PowerShell commands. For brevity, I prefer to use the default aliases provided by PowerShell. For one off scripts, that’s usually fine, but for a production script, you should try to use the full name of a command and not its alias.

If you’re unsure about what a particular alias is, such as gc, %, ?, oh, etc. You can use Get-Alias to look up what it means.

Inspecting a log file

Let’s get the basics out of the way…Using the Get-Content command (aliases: cat, gc). If you’re not familiar with this command, it’s pretty simple. It takes a file path and returns the contents of that file as messages to the console window. By default, it returns each line as a separate string. So if you have a text file with 100 lines, it will return 100 strings.

On its own, it’s probably not very useful for checking on a log file, but combine it with other commands like more, Where-Object and custom parsing functions and you can do some pretty cool stuff.

The simplest of examples would be:

gc '.\2022-04-03.log'

This will return the entire file to the console…not too useful if it’s 50,000 lines of log data. If all you want to do is manually step through the file, you can pipe the results to the Out-Host (aliases: oh) command using the -Paging option.

gc '.\2022-04-03.log' | oh -Paging

This gives you the ability to page (space) or step (enter) through each line of the log file. This is meant to be the PowerShell equivalent to using more.exe. Personally, I still prefer to use more.exe as it seems to run better, and it doesn’t output the instructions every time.

gc '.\2022-04-03.log' | more

The usage here is the same, space to page results, and enter to step to the next line.

You can test them out using these commands:

1..300 | oh -Paging
1..300 | more

Filtering output

Stepping through the logs is great…but if you have 10,000 lines to go through, that may be a waste of time. There’s a few options you have for limiting the output.

Using `-Tail` and `-TotalCount` to limit total output

Output the last 10 lines:

gc '.\2022-04-03.log' -Tail 10

Output the first 10 lines:

gc '.\2022-04-03.log' -TotalCount 10

Using `Where-Object` to filter results

If you’re dealing with a noisy log file, it may be useful to filter out certain log messages, or only include certain messages. For example, maybe you have multiple applications logging to the same file, and each log message includes the name of the app it comes from.

Using Where-Object (aliases: where, ?) you can use globs (-Like, -NotLike) or regex (-Match, -NotMatch) to include or exclude lines based on criteria you specify.

For example, lets say we have a log file that looks like this:

2022-04-03T15:14:55 [MyApp] [INFO] :: Downloading file
2022-04-03T15:14:57 [AnotherApp] [INFO] :: Cleaning up temporary files
2022-04-03T15:14:59 [OtherApp] [INFO] :: Loading data into database table

It will get annoying trying to sift through this log file if you don’t care about “AnotherApp” or “OtherApp”. Let’s filter those out using both inclusive and exclusive logic.

Inclusive: This will only return messages that match the regex pattern \[MyApp\]

gc '.\2022-04-03.log' | ? { $_ -Match '\[MyApp\]' }

Exclusive: This will exclude the other two apps we’re not interested in:

gc '.\2022-04-03.log' | ? { $_ -NotMatch '\[(AnotherApp|OtherApp)\]' }

If you’re not familiar with regex, this is saying to exclude any messages that matches either [AnotherApp] or [OtherApp].

If you want to avoid regex, you can use the -Like and -NotLike filters. It would work the same way as the two regex examples above, but instead you would use globs:

Inclusive:

gc '.\2022-04-03.log' | ? { $_ -Like '*[MyApp]*' }

Exclusive:

gc '.\2022-04-03.log' |
    ? { ($_ -NotLike '*[AnotherApp]*') -and ($_ -NotLike '*[OtherApp]*') }

As far as I know, globs don’t allow you to specify multiple criteria in a single pattern, so you need to use two separate filters

These filters also come in handy when you want to search for a specific keyword, such as “Exception” or “Error”, or maybe a specific error message that’s getting returned to a UI.

Using `Select-String` to filter results

Another option you have for filtering output is the Select-String (aliases: sls) command.

You can use it directly by supplying a file path, or piping to it.

The simple usage of Select-String is very similar to using -Match with Where-Object, but you don’t have as much of the overhead:

sls -Pattern '\[MyApp\]' -Path '.\2022-04-03.log'

gci '.\2022-04-03.log' | sls '\[MyApp\]'

By default, Select-String will highlight the search term and output the entire line prepended with the filename. It also stores some search metadata behind the scenes. This may not be necessary if all you care about is the output.

Use -NoEmphasis to disable highlighting and -Raw to disable highlighting and disable capturing all extra metadata. This way it acts more like an output filter and runs much quicker.

Modifying output

Another great option besides filtering is being able to modify the output by passing each line through a script.

Add color by assignment

In the course of writing this post, I discovered a fun new trick…coloring the messages based on their content.

If you’re looking at a plain log file, then the output is going to be black and white. What if you could identify certain keywords, and assign a color for that message?

Here’s an example:

gc '.\2022-04-03.log' |
    % {
        if ($_ -match 'ERROR') {
            Write-Host $_ -ForegroundColor Red
        } else {
            Write-Host $_
        }
    }

Any time the string “ERROR” occurs, it will write the entire line in Red. You could expand on this using even more complex logic. Maybe you want to assign a color based on which app is logging…so [MyApp] gets Blue, [AnotherApp] gets Green and [OtherApp] gets Magenta.

gc '.\2022-04-03.log' |
    % {
        if ($_ -match '\[MyApp\]') {
            Write-Host $_ -ForegroundColor Blue
        } elseif ($_ -match '\[AnotherApp\]') {
            Write-Host $_ -ForegroundColor Green
        } elseif ($_ -match '\[OtherApp\]') {
            Write-Host $_ -ForegroundColor Magenta
        } else {
            Write-Host $_
        }
    }

Ending up with this:

Dealing with multiple log files

This is more of a quick note, but everything that has been shown so far can be run against multiple files at the same time. This can be done using either the Get-Content command, or by using any other means of getting a list of files, such as using Get-ChildItem (aliases: gci, ls, dir).

Example:

gc -Path *.log

gci -Filter *.log | gc

Both of these commands will scan the current directory for all files matching *.log and pass them through to Get-Content.

One downside here is that the files are read one by one. Parameters like -Tail are applied on the per-file level. So if you say -Tail 5, it will return the last 5 lines from each file.

This can help if you need to scan a directory of log files for certain messages. Just keep in mind, this may not be very efficient. If you are scanning millions of log messages across dozens of files and you are applying a Where-Object filter, you may run into performance issues. At that point, you may want to consider something that’s a little better at scanning files, or possibly a dedicated logging tool or logging platform.

Live monitoring with `-Wait`

Now for the fun part…monitoring a log file in “realtime”; this is what we’ve been working up to.

gc '.\2022-04-03.log' -Wait

That’s it. This command will output everything that is in the file…once it reaches the end, it sits and waits. Any time new lines are added to the file, it will output them. This on its own is worth its weight in gold.

Everything we’ve talked about up to this point can also be applied to live monitoring of log files. That means paging, filtering, coloring, etc.

Now combine that with -Tail.

gc '.\2022-04-03.log' -Wait -Tail 0

Every time this command is run, it starts at the end and only listens for new lines added to the file. This is very useful when you are iteratively testing something…make some change, run your code, stop, repeat. You can clear your screen, re-run this command and only listen for new log entries generated by your code.

Working with multiple files using `ForEach-Object -Parallel`

I was recently working on a project where I needed to monitor 27 different log files all living in 27 different directories. I didn’t want to deal with installing some sort of tool, and it’s something I needed to get up and running on the fly.

I tried searching online to find a solution, and I found this Stack Overflow question. The answer on that question only works for Windows PowerShell, so I tried to come up with a PowerShell Core equivalent.

Monitoring multiple files

In my case, all of the log files I needed to monitor shared a common folder structure and file naming convention. So I was able to figure it out using this script:

gci -Path '*\log\*' -Recurse -Filter '*2022-04-03.log' |
    % -Parallel {
        $file = $_
        gc -Wait -Tail 0 -Path $file |
            % { Write-Host "$($file.Name): ${_}" }
    } -ThrottleLimit 30

Let’s break that down…

gci -Path '*\log\*' -Recurse -Filter '*2022-04-03.log'
- First it searches for all paths that match *\log\*, and then searches for files whose name matches *2022-04-03.log.
- This returned all 27 of the files I needed to monitor.
% -Parallel { ... } -ThrottleLimit 30
- This allows us to run actions in parallel, rather than all 27 files one by one sequentially.
- I know I need to monitor 27 files, so we can manually set -ThrottleLimit to 30.
$file = $_
- Set aside the file we’re working with into a named variable because we’ll lose scope of $_ later on in the script and we won’t be able to reference it.
gc -Wait -Tail 0 -Path $file
- Start watching the file and output new lines as they are added.
% { Write-Host "$($file.Name): ${_}" }
- Prepend every line returned by Get-Content with the name of the log file.

Which results in something like this:

MyApp_2022-04-03.log: Original log message 1
OtherApp_2022-04-03.log: Original log message 1
AnotherApp_2022-04-03.log: Original log message 1

Add color randomly

I’ve never used in practice, but I thought it would be fun to figure out and I could see how it could prove to be useful if you need to randomly assign colors to output. Since I was monitoring 27 actively used log files, I needed some way to visually separate one file from another. I applied our -ForegroundColor trick we used earlier.

Unfortunately, there’s only 16 colors defined in the [ConsoleColor] .NET enum, so the best I could do is pick from one of those.

gci -Path '*\log\*' -Recurse -Filter '*2022-04-03.log' |
    % -Parallel {
        $file = $_;
        $color = 1..15 | Get-Random;
        gc -Wait -Tail 0 -Path $file |
            % {
                Write-Host "$($file.Name): ${_}" -ForegroundColor $color;
            }
    } -ThrottleLimit 30

Here I’ve added code to generate a random number between 1 and 15 (inclusive), which may not make sense at first. This is just a shortcut way of picking a random enum value from the [ConsoleColor] .NET enum, where 1 = DarkBlue, 2 = DarkGreen, and so on. Since I use a black background, and 0 = black, I start the numbering at 1. PowerShell/.NET is smart enough to know when I pass in a number between 1 and 15, that it translates it to the associated color.

Another way you could do this that would be a bit more obvious would be:

Get-Random ([System.ConsoleColor].GetEnumNames() | ? { $_ -NE 'Black' })

Final thoughts

If you actually read this giant blog post and are bothering to read my final thoughts section…I applaud and thank you. Obviously there are a lot of tools out there that would probably make this easier. There’s also a lot of tools that may be better suited for scanning, searching and filtering large text files. My personal favorite is ripgrep, and I hope to write a post about it one day.

That said, I feel it’s good to learn how to do things the long way. You won’t always have access to your fancy GUIs and CLI tools, and you may have to deal with what you’ve got at hand.

I’d love to hear feed back on what you think, along with any tips and tricks you might have on this topic as well.

Thanks for reading!

Restore database from backup in Docker

2021-11-04T19:45:00+00:00

If you google “restore sql database in docker”, you’ll probably find 20 other blog posts covering this exact topic…But, for some reason, I still managed to look right past them when I was stuck, and it took me a good hour or so to figure how to get this to work. So I’m sharing it anyway.

This is more of a personal note for future Chad to come back to.

Everything below is basically a summarized version of the official docs, with small tweaks here and there: https://docs.microsoft.com/en-us/sql/linux/tutorial-restore-backup-in-sql-server-container

Yesterday, I was watching a Pluralsight course which provided a database .bak file to follow along with the examples. I generally like to use Docker when working with SQL Server locally…but as a somewhat novice user, I have found it to be a bit of a pain if you need to deal with restoring or attaching a database.

When I run into these scenarios, I usually spin up an AWS EC2 instance, install SQL server, and work with it that way. There’s probably a simpler way to do it using RDS or Azure, but I’m not familiar with those just yet. The other option is if I have a Linux machine at hand, I will use that with Docker and mapped volumes work great.

I do happen to have a Linux machine ready to use…but I was determined to figure out how to get this working on Windows.

I was hoping that since I’m running WSL v2, that using a mapped volume would simply work, but for some reason, I could not get the container to see the files in the directory I mapped. I tried using something like this -v /mnt/d/docker/volume:/var/opt/mssql/backup but no luck. Docker would create the backup directory, but no files we’re visible. To my best effort, my google-fu did not come up with any solutions.

I’ll try to keep this as short and sweet as I can.

Get the container running

This is the docker command I typically use to start an instance of SQL Server 2019 in Docker. Nothing fancy, it’s pretty much a copy paste from Docker Hub.

I personally like to use -it, which will mean the logs/output from the container are streamed to the console. I like being able to watch the output so I can spot when system errors pop up. It’s generally not necessary, so if you prefer to run it silently in the background, then swap -it with -d to run in detached mode.

docker run -it `
    --name sqlserver `
    -e ACCEPT_EULA='Y' `
    -e MSSQL_SA_PASSWORD='yourStrong(!)Password' `
    -e MSSQL_AGENT_ENABLED='True' `
    -p 1433:1433 `
    mcr.microsoft.com/mssql/server:2019-latest;

Once you run this, if you’re using -d you’ll probably want to check in on the container and make sure it’s running without error using docker ps -a.

Copy backup file into container

Now that the container is up and running, you need to copy the backup file.

If anyone knows how to get mapped volumes to work between Windows and this Linux SQL Server container…I would love your feedback/tips.

docker cp backup.bak sqlserver:/var/opt/mssql/data/

I’m choosing to copy this to the data directory because that’s the default backup directory, and eliminates an extra step. Other solutions tell you to create a new backup directory, but in this case, since it’s a sandbox, I don’t really care about these types of best practices.

Restore the database

This part will require a bit of manual tweaking on your part, but it’s not too bad.

Open SSMS and connect to the instance using the credentials you set in the docker run command.

To restore the backup, you’ll need to use the RESTORE DATABASE...WITH MOVE method. If you don’t use WITH MOVE, you’ll get an error, at least I do. To do that, you first need to know what the file names are inside of the .bak file, and then you need to construct the RESTORE using those file names.

So first run this to ensure you have access to the backup file, and it will list the files within the backup. No need to specify the full path to the file since we copied the backup file to the default directory.

RESTORE FILELISTONLY FROM DISK = 'backup.bak'

Then using the list of file names returned by the above command, construct the backup script similar to below. Here you do need to specify the full destination path, for some reason it’s unable to figure that out even when the default directories are explicitly set.

RESTORE DATABASE RestoredDB
FROM DISK = 'backup.bak'
WITH
    MOVE 'backup'     TO '/var/opt/mssql/data/backup.mdf',
    MOVE 'backup_log' TO '/var/opt/mssql/data/backup_log.ndf'

And that’s it, 3 steps…copy, list files, restore…assuming this all runs without error, you have now restored a database into a Linux Docker container running SQL Server on Linux.

Working with secure FTP in PowerShell

2021-11-01T14:00:00+00:00

Update

Since posting this I’ve had a few people respond with some great suggestions, such as:

Posh-SSH PowerShell Module
Transferetto PowerShell Module
Using the System.Net.FtpWebRequest .NET class for working with FTPS in PowerShell

I’ll definitely be checking these out to learn more about them and see how they compare to using the WinSCP module.

Thanks for all the suggestions and responses I’ve received to this post! This is how I improve my own skills by learning from you, and hopefully you learn a thing or two from me.

Back to the post

Disclaimer: While WinSCP does support FTPS, I will be focusing on SFTP in the examples since that’s what I had at hand to test with. If you don’t know the differences between FTP, SFTP or FTPS, there are plenty of resources online that cover it. The main thing to know is that SFTP/FTPS are secure alternatives to using plain FTP and the info I provide here, can easily be adjusted to work for FTPS.

For the impatient ones: TL;DR

When building ETL processes in other languages (i.e. C#), usually I like to build a “draft” version of the process in PowerShell first. The code is shorter, there’s less nuances to deal with and you can take advantage of some pretty great built in and community written modules. It’s a nice, quick way to knock out a proof of concept.

Currently I’m working on a data append ETL/integration. These are pretty common…you send someone a CSV file, they do some stuff with it, add on some new columns of data, and send it back to you.

For me, it usually looks something like this:

Run stored procedure in SQL
Export results to CSV file abiding by the 3rd party’s specs (i.e. headers, delimiter, quote qualifiers, line endings, header/trailer records)
Copy file to their server via SFTP
Wait for a response file to appear, could be minutes, could be days
Download the response file to disk
Parse and import file into a table in SQL
Archive file

Over the years I’ve written dozens of these, one thing that often hangs me up are the “copy to SFTP” and “copy from SFTP” steps. usually what happens is I build two scripts…an “export script”, which has a manual step of “open FileZilla and upload file”, and then an Import script with another manual step to download the file.

After some Google searching to see how to handle SFTP in PowerShell, I ran into this StackOverflow answer (the creator of WinSCP) which introduced me to some cool new alternatives for dealing with FTP, FTPS, SFTP, SCP, etc using PowerShell.

Using built in commands

Linux is nice because it has native support for SSH, SCP and SFTP.

Windows is a bit different, by default, it does not. However, as of Windows 10 build 1809, there is now an optional feature for OpenSSH support (client and server) that can be installed directly in the OS or via PowerShell. See the instructions here. Once the client is installed, it will add the ssh, scp and sftp commands.

Another option would be to use WSL, to run ssh, scp and sftp, though I would argue this is a bit overkill if that’s the only thing you plan to use it for. I highly recommend checking out WSL in general though, it’s really fun to play with.

Using WinSCP

While both of the methods mentioned above are great options and will get the job done, I learned a new method using WinSCP from that StackOverflow answer.

If you’re not familiar with WinSCP, it’s been around for quite a while and is a very popular file transfer client for Windows.

Despite all the years I’ve used this tool, I never knew it has a .NET assembly that allows you to work with SFTP, FTP, S3, SCP, etc…all using .NET languages and environments…C#, VB.NET, PowerShell, and more.

But what really got my interest is a WinSCP PowerShell module…It does not appear to be “official” but it’s trusted enough to be linked by the official WinSCP documentation.

The cool part about the Module is that it does not require the installation of WinSCP first, it uses its own copy of the WinSCP EXE and DLL files.

Without the module, you would need to load the DLL file as a new type into PowerShell using Add-Type, and then use it like you would in .NET, by using New-Object, calling class methods, and then disposing the objects when you’re done. This can be a bit of a pain, at that point, you might as well be using C#. This is where the module comes in. The module is a wrapper for all of that and simplifies the implementation and usage. It also returns everything as objects, so you can easily work with them in PowerShell.

TLDR

For those who hate reading and feel this is looking too much like a recipe write-up where I tell you my life story before giving you what you came here for, here’s the 🥩 and 🥔’s…

Various links:

Working with WinSCP via PowerShell
PowerShell Module

You can install the PowerShell module like normal:

# install module
Install-Module winscp

# import module into current session
Import-Module winscp

That’s it, you’re ready to go.

An overview of a few common commands:

New-WinSCPSessionOption # Info about the connection you plan to make - Hostname, credentials, protocol, port, etc
New-WinSCPSession # Takes a SessionOption object, represents the active connection to the host
Remove-WinSCPSession # Takes a Session object, disconnects / disposes the active connection
Get-WinSCPHostKeyFingerprint # Return the public key of a remote host

Test-WinSCPPath # Test whether a path exists
Get-WinSCPItem # Return info about a file or directory
Get-WinSCPChildItem # Return info about the children of a specific item (i.e. list of files within a directory)

Send-WinSCPItem # Upload file(s)
Receive-WinSCPItem # Download file(s)
Remove-WinSCPItem # Delete file(s)

That’s only a portion of the commands available. If you want more info, you’ll need to read the docs :)

Example

Here’s an example of how it could be used:

# Execute stored procedure usp_ExportData
# Export data as tab delimited, with double quote qualifiers to 'export.csv'
Invoke-DbaQuery -SqlInstance ServerA -Database DBFoo `
                -CommandType StoredProcedure -Query 'usp_ExportData' |
    Export-Csv -Path .\export.csv -Delimiter '|'

# Manually get credentials
# Could also use database, Amazon Secrets, Vault, SecretStore, config file, etc
$credential = Get-Credential

$options = @{
  Credential = $credential # This will provide the Username and Password
  Protocol = 'Sftp'
  HostName = 'sftp.someclient.com'
  GiveUpSecurityAndAcceptAnySshHostKey = $true
}

# Configure options for the session
$sessionOption = New-WinSCPSessionOption @options

# Open connection to server
$session = New-WinSCPSession -SessionOption $sessionOption

# Send export file to server via SFTP connection
Send-WinSCPItem -WinSCPSession $session -LocalPath .\export.csv

# Disconnect and dispose of connection
Remove-WinSCPSession -WinSCPSession $session

Note: GiveUpSecurityAndAcceptAnySshHostKey = $true is likely not something you want in a production process. Instead, you can get the public key of the remote host and supply it as a parameter to the SessionOption. If you don’t know what the public key of the remote host is, it comes with a nifty cmdlet that gets it for you Get-WinSCPHostKeyFingerprint -SessionOption $sessionOptions -Algorithm SHA-256.

This is a fairly crude example, no error handling, not checking to see if the file already exists on the remote host, not using any sort of config file to make it reusable, etc. But I would say this was a pretty simple and quick script to run a proc, export to CSV and send it via SFTP to a remote host.

Copy a large table between servers, a couple wrong ways, maybe one right way

2021-10-19T14:00:00+00:00

This was a task that popped up for me a few days ago…

You have a table with 50 million records and about 3GB in size. You need to copy it from ServerA to ServerB. You do not have permission to change server settings, set up replication, backup & restore, set up linked servers, etc. Only DML/DDL access.

…what do you do?

You may immediately have an answer…or you may have absolutely no clue. I was somewhere in the middle. I could think of a few ways…but none of them sounded ideal.

The majority of these solutions will be using dbatools cmdlets. If you’re not familiar with what that is…I highly recommend you check it out, learn it, install it, use it.

More info here: https://dbatools.io/

A few disclaimers

While reading this post, please keep in mind…this is not about “best practices”. The point is to show you the iterations of failure and success I went through to learn and figure this out.

These transfers were through my slow network connection. Running these transfers directly from server to server, or using a machine that lives on the same network will give much better performance.

This is why things such as “jump boxes” and servers dedicated to data transfer tasks can be very useful in cutting down these transfer times.

Attempt #1 - Export to CSV using PowerShell

My immediate thought when I encountered this problem was…I’ll export the table to CSV (as terrible as that sounds)…and then import that file to the other server.

Exporting data from SQL to CSV is something I do regularly for development, testing and reporting so I’m pretty comfortable with it. You can throw a script together pretty quickly using PowerShell and dbatools cmdlets.

$Query = 'SELECT * FROM dbo.SourceTable';
Invoke-DbaQuery -SqlInstance ServerA -Database SourceDB -Query $Query |
    Export-CSV D:\export.csv

Explanation: Run the query stored in $Query, and export the results to file as a CSV.

The failure

I kicked it off and let it run in the background. After an hour, I noticed my computer getting slower…and sloooower, so I checked in on it…

Yeah, that’s not good 🔥🚒

This wasn’t too surprising. I’ve run into memory issues with PowerShell in the past, usually when working with large CSV files. I’m not sure if it’s an issue with PowerShell or CSV related cmdlets.

I immediately killed the process. Checking the export file, it had only made it to about 2 million records, not even a dent in the 50 million we needed to export.

Attempt #2 - Export to CSV using PowerShell…but do it better

Now I need to handle this memory issue. I’ve run into these before with PowerShell. Usually if you batch your process better these problems go away. So this was my next iteration…

$c = 0; # counter
$b = 100000; # batch size
foreach ($num in 1..500) {
    write "Pulling records ${c} - $($c+$b)";
    $query = "
        SELECT *
        FROM dbo.SourceTable
        ORDER BY ID -- Sort by the clustered key
        OFFSET ${c} ROWS FETCH NEXT ${b} ROWS ONLY
    ";
    # write $query;
    Invoke-DbaQuery -SqlInstance ServerA -Database SourceDB -Query $query |
        Export-CSV E:\export.csv -UseQuotes AsNeeded -Append
    $c += $b;
}

This time, I broke the export up into batches of 100,000 records. I changed the query to sort the table by the clustered key, and added an OFFSET clause to grab the data in segments. FYI, the ranges output from the loop are not exact, it’s just meant to give a basic idea of where it’s at.

I’m doing a bit of math trickery here so I don’t have to figure out when the loop needs to stop.

Since the table has just under 50 million records, and I’m pulling in batches of 100k, that’s no more than 500 batches. So I’m using the range operator (x..y) to spit out a list of 500 values. Once the loop reaches the end of the range it will stop.

Less failure

After kicking this process off and letting it run for a bit, I did some math and projected that it would take about 90 minutes to finish, and that’s just to export the data, I still needed to import the data to the other server.

On the upside, it was only using 234MB of RAM. So I guess that’s better, but not good enough. So I killed the process to move on to the next attempt.

Attempt #3 - Using the right tool for the job

I reached out to the SQL Community Slack to see if anyone had some better ideas. Almost immediately I had a couple great suggestions.

Andy Levy recommended Copy-DbaDbTableData from dbatools.

Constantine Kokkinos suggested the bcp.exe SQL utility.

Both options sounded good, but since I have quite a bit of experience with PowerShell as well as working with the dbatools library, I gave that a shot first.

The final attempt

Copy-DbaDbTableData is made for this exact task. With a description of “Copies data between SQL Server tables”.

Their documentation page has a handful of examples which made it easy to use…

$params = @{
  # Source
  SqlInstance = 'ServerA'
  Database = 'SourceDB'
  Table = 'SourceTable'

  # Destination
  Destination = 'ServerB'
  DestinationDatabase = 'TargetDB'
  DestinationTable = 'TargetTable'
}

Copy-DbaDbTableData @params

This example uses a technique called parameter splatting. It allows you to set all of your parameters in a dictionary and then supply it to the function to help keep things nice and pretty.

The SUCCESS

Immediately I could tell it was significantly faster, on top of the fact that it was performing the export and the import at the same time.

Total runtime was 28 minutes. That’s right, 28 minutes to move all 50 million rows from one server to the other. Compared to my previous attempts…that’s lightning quick.

Honorable mentions and notes

bcp.exe utility

The bcp utility can be used to export table/view/query data to a data file, and can also be used to import the data file into a table or view. I think you can accomplish many of the same tasks using dbatools cmdlets, but I do think bcp has some advantages that make it uniquely useful for a number of tasks.

Can export table data to a data file with very low overhead (takes up less space than a CSV)
Supports storing the table structure in an XML “format” file. This maintains datatypes for when you need to import the data. Rather than importing everything as character data, you can import it as the original datatype
Maintains NULL values in the exported data rather than converting them to blank
Is incredibly fast and efficient

These features and capabilities come as both pros and cons depending on the usage.

Here’s a few great uses I could personally think of for bcp

If you have table data you need to restore to SQL often, say for a testing or demo database, but you don’t want/need to restore the entire DB every time. Store your table(s) as data files (and their XML format files) on disk. Then write a script that restores them using bcp.
If you need to copy a table from one server to another, but you do not have direct access to both servers from the same machine. In that case Copy-DbaDbTableData isn’t useful as it needs access to both machines. But with bcp, you can save the table to a data and format file, transfer them somewhere else, and then use bcp to import the data.
Technically, you can generate a CSV using bcp, but when I tried it, I ran into a handful of issues. Such as…you can’t add text qualification or headers, and the workarounds to add them may not be worth it. It also retains NULL values by storing them as a NUL character (0x0). If you’re planning on sending this file out to another system…you’d likely want to convert those NULL values to a blank value. But if none of these caveats affect you…then this may be a great option since it’s so fast at exporting the data to disk.

Other dbatools cmdlets

I don’t want to go into great detail on all of the ways dbatools can import and export data, but I thought I should at least mention the ones I know of, and give a very high level summary of what each is able to do:

Copy-DbaDbTableData
- Table/View/Query -> Table
- Use this cmdlet if you need to copy data from one table to another table, even if that table is in the same database, a different database or even different servers.
- Alias - Copy-DbaDbViewData - This cmdlet is just a wrapper for Copy-DbaDbTableData. The only difference is that it doesn’t have a parameter for -Table. So it’s probably best you just use Copy-DbaDbTableData.
Export-DbaDbTableData
- Table -> Script
- Use this cmdlet if you want to export the data of a table into a .sql script file. Each row is converted into an insert statement. Be careful with large tables due to the high overhead. If you need to store a large amount of data…consider a format with lower overhead, such as csv, or using bcp.exe to export to a raw data file.
- Does not support exporting views or queries
- Internally, it is a wrapper for Export-DbaScript.
Import-DbaCsv
- CSV -> Table
- Use this cmdlet if you want to import data from a CSV file. This cmdlet is very efficient at loading even extremely large CSV files.
Write-DbaDbTableData
- DataTable -> Table
- I would argue this is one of the most versatile cmdlets for importing data into SQL. This cmdlet can import any DataTable object from PowerShell into a table in SQL. This allows you to import things like JSON, CSV, XML etc. As long as you can convert the data into a DataTable.
Invoke-DbaQuery
- Query -> DataTable
- Use this cmdlet to export the results of a query to a DataTable object in PowerShell.
- Technically, the default return type is an array of DataRow objects. But you can configure it to use a number of different return types.
- The results of this can be written to CSV, JSON or fed back into Write-DbaDbTableData to write into another SQL table.
Table/View/Query -> CSV
- dbatools does not currently have a cmdlet dedicated for writing directly to CSV.
- To achieve this, you can use Invoke-DbaQuery ... | Export-CSV ..., but be careful of memory issues as experienced in attempt #1 above.

As you can see…there’s quite a few options to choose from.

Final thoughts

Hopefully you were able to learn something from this post. It may not be showing you the best way to do something, but I wanted to show that we don’t always know the best way to do something. Sometimes we have to go through trial and error, sometimes we have to reach out and ask for help.

The next time this task pops up, I’ll now have a few more tricks in my developer toolbelt to try and solve that problem.

T-SQL Tuesday #143 – Short code examples

2021-10-12T14:00:00+00:00

For the October T-SQL Tuesday invitation, John McCormack is inviting others to share some of their favorite short code examples. These could be SSMS/SQL Prompt snippets, one liners, keyboard query shortcuts or snippets you’ve committed to memory. It doesn’t have to be T-SQL, it could be Python, PowerShell, or anything else you use on a daily basis.

I’m excited that this will be my first time participating in a T-SQL Tuesday topic!

Most of my time is spent writing T-SQL, PowerShell and working in the PowerShell terminal, so that’s how I’ll split the post up.

I had to cut it short otherwise this post would be a mile long. If you’re interested in seeing more quick tricks, SQL Prompt snippets, etc, please leave a comment and let me know and I can do a Part 2 in the future.

T-SQL

Right justify values

While you usually shouldn’t be formatting output at the data layer, I find this to be particularly useful when building utility stored procedures that are used directly and not feeding a report or an application.
Note, be careful using FORMAT() it’s pretty inefficient, but if your code only returns a handful of records, then there’s likely no need to worry. Just don’t use it on queries that are returning large datasets.

SELECT NoAlign         = x.val
     , AlignRight      = RIGHT(CONCAT(SPACE(16), x.Val), 16)
     , AlignRightComma = RIGHT(CONCAT(SPACE(16), FORMAT(x.Val, 'N0')), 16)
FROM (
    VALUES (24114),(372559),(117940),(16),(0),(589892558),(46827)
) x(val);

Divide by zero safe percentage

This snippet gets used in almost every report I build. You need to return a percentage, but you need to handle NULL values. Most of the time, if the denominator is zero, then I want to return a percentage of 0%.
Depending on whether you want to return the percentage as a whole number or as a decimal you can change the multiplier on the denominator. 0.01 will give you a whole number percentage whereas 1.00 will give you the percentage as a proper decimal.
Once you write your percentage, you can convert / cast the result to the preferred data type, just be mindful of arithmetic overflows.

DECLARE @Numerator   decimal(10,4) = 123456.1234,
        @Denominator decimal(10,4) = 689758.5678;
-- Return percentage as a whole
SELECT PctAsWhole   = CONVERT(decimal(6,3), COALESCE(@Numerator / NULLIF(@Denominator * 0.01, 0), 0.00))
-- Return percentage as a decimal
     , PctAsDecimal = CONVERT(decimal(6,5), COALESCE(@Numerator / NULLIF(@Denominator * 1.00, 0), 0.00));

Generating random numbers within a range

These are great for generating test data
The values returned by the snippets below could be fed into a DATEADD() function for generating random dates within a range

-- Generate random number within an inclusive range
DECLARE @RangeStart int = 100,
        @RangeEnd int = 104;
SELECT FLOOR(RAND(CHECKSUM(NEWID()))*(@RangeEnd-@RangeStart+1))+@RangeStart;

-- Generate random number within a range of size n starting at point x
DECLARE @RangeStart int = 100,
        @RangeSize int = 5;
SELECT ABS(CHECKSUM(NEWID())%@RangeSize)+@RangeStart;

-- Randomly pick either 1 or -1
SELECT SIGN(CHECKSUM(NEWID()));

-- Randomly pick a number between -N and N (inclusive)
DECLARE @RangeSize int = 1;
SELECT CHECKSUM(NEWID())%(@RangeSize+1);

Tally table

Tally / Numbers tables can be used for all sorts of things. For example, avoiding cursors, loops, etc by performing those tasks by row. Jeff Moden has a great article using a tally table to build a string split function. He’s also written about how tally tables can be used to replace loops.
Another great use of them is generating sample data. Utilizing the snippets above for generating random numbers, you can easily generate random data for testing, including date ranges.
Plenty of other bloggers have written about them, including Itzik Ben-Gan who I believe is the first person I learned this from.

-- Using CTEs
-- Careful, this returns 1,000,001 rows if not limited
WITH c1 AS (SELECT x.x FROM (VALUES(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) x(x))  -- 10
    , c2(x) AS (SELECT 1 FROM c1 x CROSS JOIN c1 y)                                -- 10 * 10
    , c3(x) AS (SELECT 1 FROM c2 x CROSS JOIN c2 y CROSS JOIN c2 z)                -- 100 * 100 * 100
    , c4(rn) AS (SELECT 0 UNION ALL SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1)) FROM c3)  -- Add zero record, and row numbers
SELECT TOP(1000) x.rn
FROM c4 x;

-- Using XML
-- This is a method I came up with while trying to work on a SQL code golf problem
-- FYI, this has not been tested for efficiency against the CTE method
DECLARE @x xml = REPLICATE(CONVERT(varchar(MAX),''), 1000); --Table size
WITH c(rn) AS (SELECT 0 UNION ALL SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1)) FROM @x.nodes('n') x(n))
SELECT c.rn
FROM c;

Session settings for controlling plans and stats

These ones I have to google almost every time because I forget what each one does.

-- SET SHOWPLAN_TEXT      ON -- SET SHOWPLAN_TEXT      OFF -- Returns ESTIMATED execution plan info as a result set
-- SET SHOWPLAN_ALL       ON -- SET SHOWPLAN_ALL       OFF -- Returns ESTIMATED info, similar to SHOWPLAN_TEXT, but returns much more detailed info as additional columns
-- SET SHOWPLAN_XML       ON -- SET SHOWPLAN_XML       OFF -- Returns ESTIMATED execution plans as XML
-- SET STATISTICS XML     ON -- SET STATISTICS XML     OFF -- Returns ACTUAL execution plan as XML
-- SET STATISTICS PROFILE ON -- SET STATISTICS PROFILE OFF -- Returns ACTUAL execution plan as a result set
-- SET STATISTICS TIME    ON -- SET STATISTICS TIME    OFF -- Returns ACTUAL parse, compile, execution times for each statement in messages tab
-- SET STATISTICS IO      ON -- SET STATISTICS IO      OFF -- Returns ACTUAL disk activity metrics in messages tab

-- My most used snippet being:
SET STATISTICS IO, TIME ON;

PowerShell / CLI

Rename a column when using Select-Object

A fairly simple one, but I remember how long I went before learning this when I first started out in PowerShell, so including it here is a must.

gci | select @{N='FileName'; E={$_.Name}}

Convert all files to UTF-8

If you’ve ever dealt with storing your SQL files in git, you may have realized that UTF-8 isn’t always the encoding of choice. Unfortunately, git doesn’t like files encoded with things like UTF-16LE BOM.
This is a snippet I like to use to convert whole directories recursively to UTF-8. This makes it easier for viewing diffs with git.

gci -File -Recurse | ? Extension -In @('.sql') | % { $body = $_ | gc -Raw; $body | Set-Content -Encoding utf8 -NoNewline; }

Clean out recursively empty directories

This is probably one of my most used PowerShell snippets when it comes to file cleanup.
The snippet will find all directories that contain no files recursively. So if all you have is a chain of empty directories, it will remove it.
Just be careful using this if whatever application is using the folder structure doesn’t crash because it’s unable to find the directory and does not create missing ones itself.

# The '1' means "AllDirectories", aka, recurse
gci -Directory -Recurse | ? {-Not $_.GetFiles("*",1)} | rm -Recurse;

First/Last N characters of a string

Something I learned very quickly about PowerShell is that it’s not nearly as forgiving as T-SQL when you are using the LEFT(), RIGHT() and SUBSTRING() functions. If you provide a length that is larger than the string length, PowerShell throws an error. This is a common practice in T-SQL, to provide a very large number to get “the rest” of the string from a certain point. But you can’t do that in PowerShell.

$string = 'foobar';
$n = 20

# First N:
$string.substring(0, [System.Math]::Min($n, $string.Length))
# Shorter alternative
$string[0..($n-1)] -join ''

# Last N:
$string.Substring($string.Length - [System.Math]::Min($n, $string.Length))

Monitor a log file with filtering

I often need to monitor an active log file to wait for a particular event to occur. Sometimes it can be difficult to catch if the file is extremely active.
There are tools out there specifically for parsing and monitoring log files, but sometimes you don’t have them available, such as on an auto created EC2 instance
This command is useful for monitoring a log file and also passing in a list of filters to exclude or include
The two filters use regex strings to perform the matching

gc .\.log -Wait | ? { $_ -match '(error|warning)' } | ? { $_ -notmatch '(debug)' }

Pseudo sudo

Since windows doesn’t yet support an alternative to the linux sudo command, I use this in my $PROFILE to make it easy to quickly spawn a shell window with admin privileges.

function sudo {
    Start-Process pwsh -v RunAs;
};

Set window title

This is about as simple as it gets, but it’s not as straight forward as I expected it to be in PowerShell. If you ever have something that auto-opens PowerShell windows, you might want to set the window title to something easily identifiable.
Keep in mind, this title can quickly get overwritten by whatever you are running under it.

$host.ui.RawUI.WindowTitle = 'changed title';

Run SQL Server from Docker

This is a straight copy-paste from the SQL Server docker hub page. If you use Docker Desktop, this is an awesome way to quickly spin up a local instance of SQL Server, ready for all sorts of testing. I use this before every blog post to give me a clean working environment to test in, and then I can quickly and easily clean it up when I’m done by removing the container.

docker run -e 'ACCEPT_EULA=Y' -e "MSSQL_SA_PASSWORD=${pass}" -p 1433:1433 -d mcr.microsoft.com/mssql/server:2019-latest;

Use ripgrep to search files

I actually want to do a full length blog post just on this tool and other similar CLI workflows, but for now, I’m just putting this out there as a quick tip.
This is more of a public service announcement than anything…Please install, learn and use ripgrep. It is one of the most useful tools I have added to my CLI toolbelt. You will not regret it, especially if you are a Windows user because the built in alternatives such as FIND in cmd and Select-String in PowerShell are sooo sloooooowwwww compared to this, and don’t even compare when it comes to features.

rg -i some_text_or_regex_to_search

Notepad++ alias

Do yourself a favor and add an alias for notepad++. This is assuming Notepad++ is in your PATHS variable, if it isn’t for you, that’s outside the scope of this tip. But for my default installation, it is. Typing notepad++.exe can be a bit annoying, so I like to alias it to npp for when I need to pipe a list of files to open.

Set-Alias -Name npp -Value 'notepad++.exe';

Bonus

Shruggie: ¯\_(ツ)_/¯

That’s all I’ve got for you today…despite how long this post is, I really had to cut it short and leave a lot out. If you found any of these useful or if you have any snippets you’d like to suggest yourself, please leave a comment and let me know.

Chad’s Blog

Decoding datetime2 columnstore segment range values

The problem

Maybe it’s number of ticks?

Let’s create some more reliable data

How do we convert it back to datetime2?

The solution

The final test

Why aren’t old rows dropping from my temporal history table?

The problem

Aha moment

Stats and findings

Next blog post

Everything’s a CASE statement!

Back to the post

COALESCE

What about ISNULL?

IIF

NULLIF

CHOOSE

How do you see this for yourself?

Fun with Unicode characters in SQL Queries

How do you type these!?

Adding a column set separator

Adding a visual row identifier

Creating a bar chart

Use as a delimiter

Wrap it up…

What’s new in SQL Server 2022

Docker Tag

GENERATE_SERIES()

GREATEST() and LEAST()

STRING_SPLIT()

DATE_BUCKET()

FIRST_VALUE() and LAST_VALUE()

WINDOW clause

JSON functions

ISJSON()

JSON_PATH_EXISTS()

JSON_OBJECT()

JSON_ARRAY()

Wrapping up

Handling log files in PowerShell

Inspecting a log file

Filtering output

Using -Tail and -TotalCount to limit total output

Using Where-Object to filter results

Using Select-String to filter results

Modifying output

Add color by assignment

Dealing with multiple log files

Live monitoring with -Wait

Working with multiple files using ForEach-Object -Parallel

Monitoring multiple files

Add color randomly

Final thoughts

Restore database from backup in Docker

Get the container running

Copy backup file into container

Restore the database

Working with secure FTP in PowerShell

Update

Back to the post

Using built in commands

Using WinSCP

TLDR

Example

Copy a large table between servers, a couple wrong ways, maybe one right way

A few disclaimers

Attempt #1 - Export to CSV using PowerShell

The failure

Attempt #2 - Export to CSV using PowerShell…but do it better

Less failure

Attempt #3 - Using the right tool for the job

The final attempt

The SUCCESS

Honorable mentions and notes

bcp.exe utility

Other dbatools cmdlets

Final thoughts

Using `-Tail` and `-TotalCount` to limit total output

Using `Where-Object` to filter results

Using `Select-String` to filter results

Live monitoring with `-Wait`

Working with multiple files using `ForEach-Object -Parallel`