What’s new in 2020.1

SQream DB v2020.1 contains lots of new features, improved performance, and bug fixes.

This is the first release of 2020, with a strong focus on integration into existing environments. The release includes connectivity to Hadoop and other legacy data warehouse ecosystems. We’re also bringing lots of new capabilities to our analytics engine, to empower data users to analyze more data with less friction.

The latest release vastly improves reliability and performance, and makes getting more data into SQream DB easier than ever.

The core of SQream DB v2020.1 contains new integration features, more analytics capabilities, and better drivers and connectors.

New features

Integrations

  • Load files directly from S3 buckets. Customers with columnar data in S3 data lakes can now access the data directly. All that is needed is to simply point an external table to an S3 bucket with Parquet, ORC, or CSV objects. This feature is available on all deployments of SQream DB – in the cloud and on-prem.
  • Load files directly from HDFS. SQream DB now comes with built-in, native HDFS support for directly loading data from Hadoop-based data lakes. Our focus on helping Hadoop customers do more with their data led us to develop this feature, which works out of the box. As a result, SQream DB can now not only read but also write data, and intermediate results back to HDFS for HIVE and other data consumers. SQream DB now fits seamlessly into a Hadoop data pipeline.
  • Import ORC files, through External Tables. ORC files join Parquet as files that can be natively accessed and inserted into SQream DB tables.
  • Python driver (pysqream) is now DB-API v2.0 compliant. Customers can write high-performance Python applications that make full use of SQream DB - connect, query, delete, and insert data. Data scientists can use pysqream with Pandas, Numpy, and AI/ML frameworks like TensorFlow for direct queries of huge datasets.
  • Certified Tableau JDBC connector (taco), now also supported on MacOS. Users are encouraged to install the new JDBC connector.
  • All logs are now unified into one log, which can be analyzed with SQream DB directly. See Logging for more information.

SQL support

  • Added frames and frame exclusions to Window functions. This is available for preview, with more features coming in the next version.

    The new frames and frame exclusionsfeature adds complex analytics capabilities to the already powerful window functions.

  • New datatype - TEXT, which replaces NVARCHAR directly with UTF-8 support and improved performance.

    Unlike VARCHAR, the new TEXT data type has no restrictions on size, and carries no performance overhead as the text sizes grow.

  • TEXT join keys are now supported

  • Added lots of new aggregate functions, including VAR_SAMP, VAR_POP, COVAR_POP, etc.

Improvements and fixes

SQream DB v2020.1 includes hundreds of small new features and tunable parameters that improve performance, reliability, and stability. Existing SQream DB users can expect to see a general speedup of around 10% on most statements and queries!

  • 207 bug fixes, including:

    • Improved performance of both inner and outer joins
    • Fixed wrong results on STDDEV (0 instead of NULL)
    • Fixed wrong results on nested Parquet files
    • Fixed failing cast from VARCHAR to FLOAT
    • Fix INSERT that would fail on nullable values and non-nullable columns in some scenarios
    • Improved memory consumption, so Out of GPU memory errors should not occur anymore
    • Reduced long compilation times for very complex queries
    • Improved ODBC reliability
    • Fixed situation where some logs would clip very long queries
    • Improved error messages when dropping a schema with many objects
    • Fixed situation where Spotfire would not show table names
    • Fixed situation where some queries with UTF-8 literals wouldn’t run through Tableau over ODBC
    • Significantly improved cache freeing and memory allocation
    • Fixed situation in which a malformed time (24:00:00) would get incorrectly inserted from a CSV
    • Fixed race condition in which loading thousands of small files from HDFScaused a memory leak
  • The saved query feature can now be used with INSERT statements

  • Faster “Deferred gather” algorithm for joins with text keys

  • Faster filtering when using DATEPART

  • Faster metadata tagging during load

  • Fixed situation where some queries would get compiled twice

  • Saved queries now support INSERT statements

  • highCardinalityColumns can be configured to tell the system about high selectivity columns

  • sqream sql starts up faster, can run on any Linux machine

  • Additional CSV date formats (date parsers) added for compatibility

Behaviour changes

  • ClientCmd is now known as sqream sql

  • NVARCHAR columns are now known as TEXT internally

  • Deprecated the ability to run SELECT and COPY at the same time on the same worker. This change is designed to protect against out of GPU memory issues. This comes with a configuration change, namely the limitQueryMemoryGB setting. See the operations section for more information.

  • All logs are now unified into one log. See Logging for more information

  • Compression changes:

    • The latest version of SQream DB could select a different compression scheme if data is reloaded, compared to previous versions of SQream DB. This internal change improves performance.
    • With LZ4 compression, the maximum chunk size is limited to 2.1GB. If the chunk size is bigger, another compression may be selected - primarily SNAPPY.
  • The following configuration flags have been deprecated:

    • addStatementRechunkerAfterGpuToHost
    • increasedChunkSizeFactor
    • gpuReduceMergeOutputFactor
    • fullSortInputMemFactor
    • reduceInputMemFactor
    • distinctInputMemFactor
    • useAutoMemFactors
    • autoMemFactorsVramFactor
    • catchNotEnoughVram
    • useNetworkRechunker
    • useMemFactorInJoinOutput

Operations

  • The client-server protocol has been updated to support faster data flow, and more reliable memory allocations on the client side. End users are required to use only the latest sqream sql, JDBC, and ODBC drivers delivered with this version. See the client driver download page for the latest drivers and connectors.

  • When upgrading from a previous version of SQream DB (for example, v2019.2), the storage version must be upgraded using the upgrade_storage utility: ./bin/upgrade_storage /path/to/storage/sqreamdb/

  • A change in memory allocation behaviour in this version sees the introduction of a new setting, limitQueryMemoryGB. This is an addition to the previous spoolMemoryGB setting.

    A good rule-of-thumb is to allow 5% system memory for other processes. The spool memory allocation should be around 90% of the total memory allocated.

    • limitQueryMemoryGB defines how much total system memory is used by the worker. The recommended setting is (total host memory - 5%) / sqreamd workers on host.
    • spoolMemoryGB defines how much memory is set aside for spooling, out of the total system memory allocated in limitQueryMemoryGB. The recommended setting is 90% of the limitQueryMemoryGB.

    This setting must be set lower than the limitQueryMemoryGB setting.

    For example, for a machine with 512GB of RAM and 4 workers, the recommended settings are:

    • limitQueryMemoryGB - ⌊(512 * 0.95 / 4)⌋ ~ 486 / 4 121.
    • spoolMemoryGB - ⌊( 0.9 * limitQueryMemoryGB )⌋ ⌊( 0.9 * 121 )⌋ 108

    Example settings per-worker, for 512GB of RAM and 4 workers:

    "runtimeGlobalFlags": {
       "limitQueryMemoryGB" : 121,
       "spoolMemoryGB" : 108
    

Known Issues & Limitations

  • An invalid formatted CSV can cause an insufficient memory error on a COPY FROM statement if a quote isn’t closed and the file is much larger than system memory.
  • TEXT columns cannot be used in a window functions’ partition
  • Parsing errors are sometimes hard to read - the location points to the wrong part of the statement
  • LZ4 compression may not be applied correctly on very large VARCHAR columns, which decreases performance
  • Using SUM on very large numbers in window functions can error (overflow) when not used with an ORDER BY clause
  • Slight performance decrease with DATEADD in this version (<4%)
  • Operations on Snappy-compressed ORC files are slower than their Parquet equivalents.

Upgrading to v2020.1

Versions are available for IBM POWER9, RedHat (CentOS) 7, Ubuntu 18.04, and other OSs via Docker.

Contact your account manager to get the latest release of SQream DB.