Lesson Learned #1 Column Store compressed rowgroups (Clustered vs NonClustered)


I worked on a case recently and I would like to share this information

When loading data in Columnstore index using Bulk loading there are some differences between Clustered Columstore index and Non Clustered Columnstore index

Looking on documentation, it says that every some thousands it will create compressed row group the rest will be sent to delta rowgroup

– If the batch size is >= 102400, the rows are directly into the compressed rowgroups. It is recommended that you choose a batch size >=102400 for efficient bulk import because you can avoid moving data rows to a delta rowgroups before the rows are eventually moved to compressed rowgroups by a background thread, Tuple mover (TM).

– If the batch size < 102,400 or if the remaining rows are < 102,400, the rows are loaded into delta rowgroups.

I made a test script and ran on Azure SQL DB (Premium tier) to test it. Find below the script and results

DROP TABLE IF EXISTS COLUMNSTOREBULK
GO
CREATE TABLE  COLUMNSTOREBULK (
         test varchar(50)
)
 
CREATE NONCLUSTERED COLUMNSTORE INDEX IX_COLUMNSTORE ON COLUMNSTOREBULK (test)
GO
-----------------------------------------------------------------------
--Generate file
--SELECT A.NAME FROM sys.columns a cross apply sys.columns b
--C:\SERGIO\TEMP\File_1052676_rows.txt
 
-----------------------------------------------------------------------
IF EXISTS ( SELECT 1 FROM sys.external_data_sources WHERE Name = 'MyAzureBlobStorage' )
         DROP EXTERNAL DATA SOURCE MyAzureBlobStorage
 
IF EXISTS ( SELECT 1 FROM sys.database_scoped_credentials WHERE Name = 'MyAzureBlobStorageCredential' )
         DROP DATABASE SCOPED CREDENTIAL MyAzureBlobStorageCredential
GO
 
CREATE DATABASE SCOPED CREDENTIAL MyAzureBlobStorageCredential
WITH IDENTITY = 'SHARED ACCESS SIGNATURE',
SECRET = 'xxxxxxxxxxxx';
GO
 
CREATE EXTERNAL DATA SOURCE MyAzureBlobStorage
WITH (  TYPE = BLOB_STORAGE, 
                 LOCATION = 'https://fonsecanetstorage.blob.core.windows.net/csv', 
                 CREDENTIAL= MyAzureBlobStorageCredential
);
GO
----------------------------------------------------------------------- 
BULK INSERT COLUMNSTOREBULK
FROM 'File_1052676_rows.txt'
WITH (DATA_SOURCE = 'MyAzureBlobStorage',
         Format='CSV',
         FIELDTERMINATOR = ',',
    ROWTERMINATOR = '0x0a',
         ROWS_PER_BATCH = 500
         )
 
GO
 
BULK INSERT COLUMNSTOREBULK
FROM 'File_500_rows.txt'
WITH (DATA_SOURCE = 'MyAzureBlobStorage',
         Format='CSV',
         FIELDTERMINATOR = ',',
    ROWTERMINATOR = '0x0a',
         ROWS_PER_BATCH = 500
         )
GO
 
-----------------------------------------------------------------------
 
SELECT OBJECT_NAME(ps.object_id) AS [TableName],  
         i.[name] AS [IndexName], ps.index_id, ps.partition_number, created_time,
         ps.delta_store_hobt_id, ps.state_desc, ps.total_rows, ps.size_in_bytes,
         ps.trim_reason_desc, ps.generation, ps.transition_to_compressed_state_desc,
         ps.has_vertipaq_optimization, ps.deleted_rows,
         100 * (ISNULL(ps.deleted_rows, 0))/ps.total_rows AS [Fragmentation]
FROM sys.dm_db_column_store_row_group_physical_stats AS ps WITH (NOLOCK)
INNER JOIN sys.indexes AS i WITH (NOLOCK)
ON ps.object_id = i.object_id 
AND ps.index_id = i.index_id
ORDER BY i.name, ps.row_group_id;

REF: https://docs.microsoft.com/en-us/sql/t-sql/statements/bulk-insert-transact-sql?view=sql-server-2017#f-importing-data-from-a-file-in-azure-blob-storage

I ran:

  • 3 times the BULK insert of 1.052.676 rows file and
  • 3 times the 500 rows file

We can see that it created:

  • 1 compressed row group with 1.048.576 rows (Max row in a compressed row group) and
  • after that, it creates a second one with just 4101 row also compressed.
  • When inserting just 500 rows as bulk insert it also creates as compressed row group

At end you got a lot of small compressed rowgroups. Speaking with the product group they informed that this behavior is by design for NONCLUSTED columnsstore index and the documentation applies to Clustered columnstore index

If we change the script to use Clustered column store index we can see that behavior changes

DROP TABLE IF EXISTS CLUSTERED_COLUMNSTOREBULK
GO
CREATE TABLE  CLUSTERED_COLUMNSTOREBULK (
         test varchar(50)
)
CREATE CLUSTERED COLUMNSTORE INDEX CIX_CLUSTERED_COLUMNSTOREBULK ON CLUSTERED_COLUMNSTOREBULK
GO

This time I’ve executed:

  • only once the BULK insert of 1052676 rows file and
  • also 1 times the 500 rows file

In this scenario we got

  • 1 compressed row group with 1.048.576 rows
  • 1 delta with 4601 rows [4101 (with diff from first insert) + 500 rows from small bulk]

So only using Clustered columns store index it will respect the information on documentation

Conclusion

  • For nonclustered columnstore indexes, bulk insert (through bcp or the insert bulk statement) always creates compressed rowgroups, even if the number of rows inserted is very low. This behavior is by design, and the product has behaved this way ever since the feature was introduced in SQL Server 2016.
  • If you are using NONCLUSTERED Column store index you may noticed a high number of small rowgroups in sys.dm_db_column_store_row_group_physical_stats
    • You can merge these small compressed rowgroups scheduling ALTER INDEX REORGANIZE command
  • In cases where there is an excessive number of such rowgroups, as in customer’s database, our recommendation is:
    • If it’s practical, the application should insert data in larger batches.
    • If that’s not practical, it’s preferable for the application to use the regular insert API instead of the bulk one. That way, the rows will be inserted into deltastore rowgroups, which get compressed when the reach a size of 1M rows.
    • It may also be worthwhile for the customer to consider clustered columnstore indexes, which is what that article talks about. If they have a typical data warehousing application without a lot of deletes or updates, clustered columnstores normally perform better and get better compression.

Advertisement

One thought on “Lesson Learned #1 Column Store compressed rowgroups (Clustered vs NonClustered)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s