Compression
SQreamDB uses a variety of compression and encoding methods to optimize query performance and to save disk space.
Encoding
Encoding is an automatic operation used to convert data into common formats. For example, certain formats are often used for data stored in columnar format, in contrast with data stored in a CSV file, which stores all data in text format.
Encoding enhances performance and reduces data size by using specific data formats and encoding methods. SQream encodes data in a number of ways in accordance with the data type. For example, a date is stored as an integer, starting with March 1st 1CE, which is significantly more efficient than encoding the date as a string. In addition, it offers a wider range than storing it relative to the Unix Epoch.
Lossless Compression
Compression transforms data into a smaller format without sacrificing accuracy, known as lossless compression.
After encoding a set of column values, SQream packs the data and compresses it and decompresses it to make it accessible to users. Depending on the compression scheme used, these operations can be performed on the CPU or the GPU. Some users find that GPU compression provide better performance.
Automatic Compression
By default, SQream automatically compresses every column (see Specifying Compression Strategies below for overriding default compression). This feature is called automatic adaptive compression strategy.
When loading data, SQreamDB automatically decides on the compression schemes for specific chunks of data by trying several compression schemes and selecting the one that performs best. SQreamDB tries to balance more aggressive compression with the time and CPU/GPU time required to compress and decompress the data.
Compression Methods
The following table shows the supported compression methods:
Compression Method |
Supported Data Types |
Description |
Location |
|---|---|---|---|
|
All types |
No compression (forced) |
NA |
|
All types |
Automatic scheme selection |
NA |
|
All types |
Dictionary compression with RLE. For each chunk, SQreamDB creates a dictionary of distinct values and stores only their indexes. Works best for integers and texts shorter than 120 characters, with <10% unique values. Useful for storing ENUMs or keys, stock tickers, and dimensions. If the data is optionally sorted, this compression will perform even better. |
GPU |
|
|
Patched frame-of-reference + Delta Based on the delta between consecutive values. Works best for monotonously increasing or decreasing numbers and timestamps |
GPU |
|
Text types |
Lempel-Ziv general purpose compression, used for texts |
CPU |
|
Text types |
General purpose compression, used for texts |
CPU |
|
|
Run-Length Encoding. This replaces sequences of values with a single pair. It is best for low cardinality columns that are used to sort data ( |
GPU |
|
|
Optimized RLE + Delta type for built-in identity columns. |
GPU |
|
All types |
GPU |
NVIDIA nvCOMP Compression
NVIDIA nvCOMP is a high-speed data compression and decompression library specifically optimized for NVIDIA GPUs.
Its main purpose is to accelerate data-intensive applications—like AI training, High-Performance Computing, data science, and analytics—by significantly reducing data transfer bottlenecks. Since these applications often involve moving massive amounts of data, nvCOMP enables the compression and decompression to happen efficiently directly on the GPU, which is much faster than relying solely on the CPU.
Supported Algorithms:
Algorithm Name |
Description |
|---|---|
Snappy |
Known for its balance of speed and reasonable compression. It is a general-purpose, byte-level compressor well-suited for a wide range of datasets. |
LZ4 |
An extremely fast compression and decompression algorithm. It’s a no-entropy, byte-level compressor ideal for maximizing query performance. It works well on most data types, particularly TEXT and ARRAY types. |
ZSTD |
Provides a much better compression ratio than LZ4 at the cost of some performance. It’s a good choice for users who want to prioritize storage efficiency. Like LZ4, it is a general-purpose compressor for a wide range of data. |
GDeflate |
A GPU-optimized version of the DEFLATE format, designed to extract parallelism from the bitstream and achieve high throughput, especially during decompression. Compression algorithm levels (permitted values): 0 – highest-throughput, entropy-only compression (use for symmetric compression/decompression performance) 1 – high-throughput, low compression ratio (default) 2 – medium-throughput, medium compression ratio, beats Zlib level 1 3 – placeholder for future compression levels; currently maps to medium compression 4 – lower-throughput, higher compression ratio, beats Zlib level 6 5 – lowest-throughput, highest compression ratio |
Deflate |
Combines LZ77 (replacing repeating data strings) and Huffman coding (assigning shorter codes to frequent symbols), making it a widely used and efficient algorithm (e.g., Gzip, ZIP). Compression algorithm levels (permitted values): 1 – high-throughput, low compression ratio (default) 2 – medium-throughput, medium compression ratio, beats Zlib level 1 3 – placeholder for future compression levels; currently maps to medium compression 4 – lower-throughput, higher compression ratio, beats Zlib level 6 5 – lowest-throughput, highest compression ratio |
Cascading |
Pipelines multiple compression algorithms (e.g., LZ4 followed by ZSTD) to improve both ratio and speed, getting the best of both worlds. It’s great for data with mixed characteristics. |
Bitcomp |
A bit-level compression algorithm for numerical data. It optimizes storage by using the minimum number of bits required for each value, rather than a fixed size. For example, it can compress 32-bit integers that only use 8 bits down to their actual size, saving space. |
ANS |
ANS (Asymmetric Numeral Systems) is a fast, modern entropy encoder. It is more parallel-friendly and often more efficient than older methods like Huffman coding. NVComp leverages ANS on the GPU to achieve high-ratio, high-speed compression, perfect for high-performance computing. |
Note
The system does not incorporate an automatic selection mechanism for this compression type, and this capability is not supported in the current version. Consequently, the configuration must be specified manually.
Supported Data Types:
All these algorithms are generic and can be applied to any data type as they operate on raw bytes. This includes fixed-length types (INTEGER, BIGINT, FLOAT, DOUBLE), variable-length types (TEXT and ARRAY types).
NVComp Parameters tunning:
The following parameters can be adjusted to tune the performance and compression ration for NVComp.
Parameter Name |
Default Value |
Description |
|
|---|---|---|---|
nvcompChunkSize |
65,536 |
Defines the size of internal data chunks processed by NVComp. Smaller values result in more chunks, which can increase parallel execution and potentially speed up processing, though very small chunks may introduce overhead. |
|
numRLEsCascadedCompress |
2 |
Specifies the number of Run-Length Encodings (RLE) to perform as part of the cascaded compression process. |
|
numDeltasCascadedCompress |
1 |
Specifies the number of Delta Encodings to perform as part of the cascaded compression process. |
|
useBPCascadedCompress |
1 |
A boolean flag (1 for true, 0 for false) indicating whether to apply bit-packing to the final layers of the cascaded compression pipeline. |
|
nvcompBitcompAlgoOption |
0 |
Selects the algorithm used for Bitcomp compression. |
|
Syntax:
CHECK saved command that is used to manually specify the compression to be used, will be extend to support NVComp and its algorithm - e.g. CHECK(‘CS “nv_cascading”’)
NVComp Options should include: nv_cascading, nv_lz4, nv_snappy, nv_gdeflate, nv_deflate, nv_ans, nv_bitmap,nv_zstandard(ZSTD).
When using nv_gdeflate or nv_deflate, the syntax requires appending a numeric compression level, as specified in the table above. For example: nv_deflate_5.
CREATE TABLE <table_name> (
<column_name_1> <data_type_1> [CHECK('CS "nv_<ALGORITHM>"')],
<column_name_2> <data_type_2> [CHECK('CS "nv_<ALGORITHM>"')],
...
);
Usage examples:
CREATE TABLE sales_data (
transaction_id BIGINT CHECK('CS "nv_snappy"'), -- Fast, general-purpose compression for the ID
product_description TEXT CHECK('CS "nv_zstandard"'), -- High-ratio compression for text
quantity INTEGER CHECK('CS "nv_lz4"') -- Max performance for an integer column
);
Specifying Compression Strategies
When you create a table without defining any compression specifications, SQream defaults to automatic adaptive compression ("default"). However, you can prevent this by specifying a compression strategy when creating a table.
This section describes the following compression strategies:
Explicitly Specifying Automatic Compression
When you explicitly specify automatic compression, the following two are equivalent:
CREATE TABLE t (
x INT,
y TEXT(50)
);
In this version, the default compression is specified explicitly:
CREATE TABLE t (
x INT CHECK('CS "default"'),
y TEXT(50) CHECK('CS "default"')
);
Forcing No Compression
Forcing no compression is also known as “flat”, and can be used in the event that you want to remove compression entirely on some columns. This may be useful for reducing CPU or GPU resource utilization at the expense of increased I/O.
The following is an example of removing compression:
CREATE TABLE t (
x INT NOT NULL CHECK('CS "flat"'), -- This column won't be compressed
y TEXT(50) -- This column will still be compressed automatically
);
Forcing Compression
In other cases, you may want to force SQream to use a specific compression scheme based on your knowledge of the data, as shown in the following example:
CREATE TABLE t (
id BIGINT NOT NULL CHECK('CS "sequence"'),
y TEXT(110) CHECK('CS "lz4"'), -- General purpose text compression
z TEXT(80) CHECK('CS "dict"'), -- Low cardinality column
);
However, if SQream finds that the given compression method cannot effectively compress the data, it will return to the default compression type.
Examining Compression Effectiveness
Queries made on the internal metadata catalog can expose how effective the compression is, as well as what compression schemes were selected.
This section describes the following:
Querying the Catalog
The following is a sample query that can be used to query the catalog:
SELECT c.column_name AS "Column",
cc.compression_type AS "Actual compression",
AVG(cc.compressed_size) "Compressed",
AVG(cc.uncompressed_size) "Uncompressed",
AVG(cc.uncompressed_size::FLOAT/ cc.compressed_size) -1 AS "Compression effectiveness",
MIN(c.compression_strategy) AS "Compression strategy"
FROM sqream_catalog.chunk_columns cc
INNER JOIN sqream_catalog.columns c
ON cc.table_id = c.table_id
AND cc.database_name = c.database_name
AND cc.column_id = c.column_id
WHERE c.table_name = 'some_table' -- This is the table name which we want to inspect
GROUP BY 1,
2;
Example Subset from “Ontime” Table
The following is an example (subset) from the ontime table:
stats=> SELECT c.column_name AS "Column",
. cc.compression_type AS "Actual compression",
. AVG(cc.compressed_size) "Compressed",
. AVG(cc.uncompressed_size) "Uncompressed",
. AVG(cc.uncompressed_size::FLOAT/ cc.compressed_size) -1 AS "Compression effectiveness",
. MIN(c.compression_strategy) AS "Compression strategy"
. FROM sqream_catalog.chunk_columns cc
. INNER JOIN sqream_catalog.columns c
. ON cc.table_id = c.table_id
. AND cc.database_name = c.database_name
. AND cc.column_id = c.column_id
.
. WHERE c.table_name = 'ontime'
.
. GROUP BY 1,
. 2;
Column | Actual compression | Compressed | Uncompressed | Compression effectiveness | Compression strategy
--------------------------+--------------------+------------+--------------+---------------------------+---------------------
actualelapsedtime@null | dict | 129177 | 1032957 | 7 | default
actualelapsedtime@val | dict | 1379797 | 4131831 | 2 | default
airlineid | dict | 578150 | 2065915 | 2.7 | default
airtime@null | dict | 130011 | 1039625 | 7 | default
airtime@null | rle | 93404 | 1019833 | 116575.61 | default
airtime@val | dict | 1142045 | 4131831 | 7.57 | default
arrdel15@null | dict | 129177 | 1032957 | 7 | default
arrdel15@val | dict | 129183 | 4131831 | 30.98 | default
arrdelay@null | dict | 129177 | 1032957 | 7 | default
arrdelay@val | dict | 1389660 | 4131831 | 2 | default
arrdelayminutes@null | dict | 129177 | 1032957 | 7 | default
arrdelayminutes@val | dict | 1356034 | 4131831 | 2.08 | default
arrivaldelaygroups@null | dict | 129177 | 1032957 | 7 | default
arrivaldelaygroups@val | p4d | 516539 | 2065915 | 3 | default
arrtime@null | dict | 129177 | 1032957 | 7 | default
arrtime@val | p4d | 1652799 | 2065915 | 0.25 | default
arrtimeblk | dict | 688870 | 9296621 | 12.49 | default
cancellationcode@null | dict | 129516 | 1035666 | 7 | default
cancellationcode@null | rle | 54392 | 1031646 | 131944.62 | default
cancellationcode@val | dict | 263149 | 1032957 | 4.12 | default
cancelled | dict | 129183 | 4131831 | 30.98 | default
carrier | dict | 578150 | 2065915 | 2.7 | default
carrierdelay@null | dict | 129516 | 1035666 | 7 | default
carrierdelay@null | flat | 1041250 | 1041250 | 0 | default
carrierdelay@null | rle | 4869 | 1026493 | 202740.2 | default
carrierdelay@val | dict | 834559 | 4131831 | 14.57 | default
crsarrtime | p4d | 1652799 | 2065915 | 0.25 | default
crsdeptime | p4d | 1652799 | 2065915 | 0.25 | default
crselapsedtime@null | dict | 130449 | 1043140 | 7 | default
crselapsedtime@null | rle | 3200 | 1013388 | 118975.75 | default
crselapsedtime@val | dict | 1182286 | 4131831 | 2.5 | default
dayofmonth | dict | 688730 | 1032957 | 0.5 | default
dayofweek | dict | 393577 | 1032957 | 1.62 | default
departuredelaygroups@null | dict | 129177 | 1032957 | 7 | default
departuredelaygroups@val | p4d | 516539 | 2065915 | 3 | default
depdel15@null | dict | 129177 | 1032957 | 7 | default
depdel15@val | dict | 129183 | 4131831 | 30.98 | default
depdelay@null | dict | 129177 | 1032957 | 7 | default
depdelay@val | dict | 1384453 | 4131831 | 2.01 | default
depdelayminutes@null | dict | 129177 | 1032957 | 7 | default
depdelayminutes@val | dict | 1362893 | 4131831 | 2.06 | default
deptime@null | dict | 129177 | 1032957 | 7 | default
deptime@val | p4d | 1652799 | 2065915 | 0.25 | default
deptimeblk | dict | 688870 | 9296621 | 12.49 | default
month | dict | 247852 | 1035246 | 3.38 | default
month | rle | 5 | 607346 | 121468.2 | default
origin | dict | 1119457 | 3098873 | 1.78 | default
quarter | rle | 8 | 1032957 | 136498.61 | default
securitydelay@null | dict | 129516 | 1035666 | 7 | default
securitydelay@null | flat | 1041250 | 1041250 | 0 | default
securitydelay@null | rle | 4869 | 1026493 | 202740.2 | default
securitydelay@val | dict | 581893 | 4131831 | 15.39 | default
tailnum@null | dict | 129516 | 1035666 | 7 | default
tailnum@null | rle | 38643 | 1031646 | 121128.68 | default
tailnum@val | dict | 1659918 | 12395495 | 22.46 | default
taxiin@null | dict | 130011 | 1039625 | 7 | default
taxiin@null | rle | 93404 | 1019833 | 116575.61 | default
taxiin@val | dict | 839917 | 4131831 | 8.49 | default
taxiout@null | dict | 130011 | 1039625 | 7 | default
taxiout@null | rle | 84327 | 1019833 | 116575.86 | default
taxiout@val | dict | 891539 | 4131831 | 8.28 | default
totaladdgtime@null | dict | 129516 | 1035666 | 7 | default
totaladdgtime@null | rle | 3308 | 1031646 | 191894.18 | default
totaladdgtime@val | dict | 465839 | 4131831 | 20.51 | default
uniquecarrier | dict | 578221 | 7230705 | 11.96 | default
year | rle | 6 | 2065915 | 317216.08 | default
Notes on Reading the “Ontime” Table
The following are some useful notes on reading the “Ontime” table shown above:
Higher numbers in the Compression effectiveness column represent better compressions. 0 represents a column that has not been compressed.
Column names are an internal representation. Names with
@nulland@valsuffixes represent a nullable column’s null (boolean) and values respectively, but are treated as one logical column.The query lists all actual compressions for a column, so it may appear several times if the compression has changed mid-way through the loading (as with the
carrierdelaycolumn).When your compression strategy is
default, the system automatically selects the best compression, including no compression at all (flat).
Best Practices
This section describes the best compression practices:
Letting SQream Determine the Best Compression Strategy
In general, SQream determines the best compression strategy for most cases. If you decide to override SQream’s selected compression strategies, we recommend benchmarking your query and load performance in addition to your storage size.
Maximizing the Advantage of Each Compression Scheme
Some compression schemes perform better when data is organized in a specific way. For example, to take advantage of RLE, sorting a column may result in better performance and reduced disk-space and I/O usage. Sorting a column partially may also be beneficial. As a rule of thumb, aim for run-lengths of more than 10 consecutive values.
Choosing Data Types that Fit Your Data
Adapting to the narrowest data type improves query performance while reducing disk space usage. However, smaller data types may compress better than larger types.
For example, SQream recommends using the smallest numeric data type that will accommodate your data. Using BIGINT for data that fits in INT or SMALLINT can use more disk space and memory for query execution. Using FLOAT to store integers will reduce compression’s effectiveness significantly.