Release Notes 2020.1
SQream DB v2020.1 contains lots of new features, improved performance, and bug fixes.
This is the first release of 2020, with a strong focus on integration into existing environments. The release includes connectivity to Hadoop and other legacy data warehouse ecosystems. We’re also bringing lots of new capabilities to our analytics engine, to empower data users to analyze more data with less friction.
The latest release vastly improves reliability and performance, and makes getting more data into SQream DB easier than ever.
The core of SQream DB v2020.1 contains new integration features, more analytics capabilities, and better drivers and connectors.
New features
Integrations
Load files directly from S3 buckets. Customers with columnar data in S3 data lakes can now access the data directly. All that is needed is to simply point an external table to an S3 bucket with Parquet, ORC, or CSV objects. This feature is available on all deployments of SQream DB – in the cloud and on-prem.
Load files directly from HDFS. SQream DB now comes with built-in, native HDFS support for directly loading data from Hadoop-based data lakes. Our focus on helping Hadoop customers do more with their data led us to develop this feature, which works out of the box. As a result, SQream DB can now not only read but also write data, and intermediate results back to HDFS for HIVE and other data consumers. SQream DB now fits seamlessly into a Hadoop data pipeline.
Import ORC files, through external_tables. ORC files join Parquet as files that can be natively accessed and inserted into SQream DB tables.
Python driver (pysqream) is now DB-API v2.0 compliant. Customers can write high-performance Python applications that make full use of SQream DB - connect, query, delete, and insert data. Data scientists can use pysqream with Pandas, Numpy, and AI/ML frameworks like TensorFlow for direct queries of huge datasets.
Certified Tableau JDBC connector (taco), now also supported on MacOS. Users are encouraged to install the new JDBC connector.
All logs are now unified into one log, which can be analyzed with SQream DB directly. See Logging for more information.
SQL support
Added frames and frame exclusions to Window Functions. This is available for preview, with more features coming in the next version.
The new frames and frame exclusionsfeature adds complex analytics capabilities to the already powerful window functions.
New datatype -
TEXT
, which replacesNVARCHAR
directly with UTF-8 support and improved performance.Unlike
VARCHAR
, the newTEXT
data type has no restrictions on size, and carries no performance overhead as the text sizes grow.TEXT
join keys are now supportedAdded lots of new aggregate functions, including
VAR_SAMP
,VAR_POP
,COVAR_POP
, etc.
Improvements and fixes
SQream DB v2020.1 includes hundreds of small new features and tunable parameters that improve performance, reliability, and stability. Existing SQream DB users can expect to see a general speedup of around 10% on most statements and queries!
207 bug fixes, including:
Improved performance of both inner and outer joins
Fixed wrong results on STDDEV (0 instead of
NULL
)Fixed wrong results on nested Parquet files
Fixed failing cast from
VARCHAR
toFLOAT
Fix
INSERT
that would fail on nullable values and non-nullable columns in some scenariosImproved memory consumption, so
Out of GPU memory
errors should not occur anymoreReduced long compilation times for very complex queries
Improved ODBC reliability
Fixed situation where some logs would clip very long queries
Improved error messages when dropping a schema with many objects
Fixed situation where Spotfire would not show table names
Fixed situation where some queries with UTF-8 literals wouldn’t run through Tableau over ODBC
Significantly improved cache freeing and memory allocation
Fixed situation in which a malformed time (
24:00:00
) would get incorrectly inserted from a CSVFixed race condition in which loading thousands of small files from HDFScaused a memory leak
The saved query feature can now be used with INSERT statements
Faster “Deferred gather” algorithm for joins with text keys
Faster filtering when using DATEPART
Faster metadata tagging during load
Fixed situation where some queries would get compiled twice
Saved Queries now support INSERT statements
highCardinalityColumns
can be configured to tell the system about high selectivity columnssqream sql starts up faster, can run on any Linux machine
Additional CSV date formats (date parsers) added for compatibility
Behaviour changes
ClientCmd
is now known as sqream sqlNVARCHAR
columns are now known asTEXT
internallyDeprecated the ability to run
SELECT
andCOPY
at the same time on the same worker. This change is designed to protect againstout of GPU memory
issues. This comes with a configuration change, namely thelimitQueryMemoryGB
setting. See the operations section for more information.All logs are now unified into one log. See Logging for more information
Compression changes:
The latest version of SQream DB could select a different compression scheme if data is reloaded, compared to previous versions of SQream DB. This internal change improves performance.
With
LZ4
compression, the maximum chunk size is limited to 2.1GB. If the chunk size is bigger, another compression may be selected - primarilySNAPPY
.
The following configuration flags have been deprecated:
addStatementRechunkerAfterGpuToHost
increasedChunkSizeFactor
gpuReduceMergeOutputFactor
fullSortInputMemFactor
reduceInputMemFactor
distinctInputMemFactor
useAutoMemFactors
autoMemFactorsVramFactor
catchNotEnoughVram
useNetworkRechunker
useMemFactorInJoinOutput
Operations
The client-server protocol has been updated to support faster data flow, and more reliable memory allocations on the client side. End users are required to use only the latest sqream sql, JDBC, and ODBC drivers delivered with this version. See the client driver download page for the latest drivers and connectors.
When upgrading from a previous version of SQream DB (for example, v2019.2), the storage version must be upgraded using the upgrade_storage utility:
./bin/upgrade_storage /path/to/storage/sqreamdb/
A change in memory allocation behaviour in this version sees the introduction of a new setting,
limitQueryMemoryGB
. This is an addition to the previousspoolMemoryGB
setting.A good rule-of-thumb is to allow 5% system memory for other processes. The spool memory allocation should be around 90% of the total memory allocated.
limitQueryMemoryGB
defines how much total system memory is used by the worker. The recommended setting is (total host memory
- 5%) /sqreamd workers on host
.spoolMemoryGB
defines how much memory is set aside for spooling, out of the total system memory allocated inlimitQueryMemoryGB
. The recommended setting is 90% of thelimitQueryMemoryGB
.
This setting must be set lower than the
limitQueryMemoryGB
setting.For example, for a machine with 512GB of RAM and 4 workers, the recommended settings are:
limitQueryMemoryGB
-⌊(512 * 0.95 / 4)⌋ → ~ 486 / 4 → 121
.spoolMemoryGB
-⌊( 0.9 * limitQueryMemoryGB )⌋ → ⌊( 0.9 * 121 )⌋ → 108
Example settings per-worker, for 512GB of RAM and 4 workers:
"runtimeGlobalFlags": { "limitQueryMemoryGB" : 121, "spoolMemoryGB" : 108
Known Issues & Limitations
An invalid formatted CSV can cause an
insufficient memory
error on a COPY FROM statement if a quote isn’t closed and the file is much larger than system memory.TEXT
columns cannot be used in a window functions’ partitionParsing errors are sometimes hard to read - the location points to the wrong part of the statement
LZ4 compression may not be applied correctly on very large
VARCHAR
columns, which decreases performanceUsing
SUM
on very large numbers in window functions can error (overflow
) when not used with anORDER BY
clauseSlight performance decrease with DATEADD in this version (<4%)
Operations on Snappy-compressed ORC files are slower than their Parquet equivalents.
Upgrading to v2020.1
Versions are available for IBM POWER9, RedHat (CentOS) 7, Ubuntu 18.04, and other OSs via Docker.
Contact your account manager to get the latest release of SQream.