Optimization and best practices¶
This topic explains some best practices of working with SQream DB.
In this topic:
- Table design
- Query best practices
- Data loading considerations
This section describes best practices and guidelines for designing tables.
When creating tables with dates or timestamps, using the purpose-built
DATETIME types over integer types or
VARCHAR will bring performance and storage footprint improvements, and in many cases huge performance improvements (as well as data integrity benefits). SQream DB stores dates and datetimes very efficiently and can strongly optimize queries using these specific types.
VARCHAR type, the length has a direct effect on query performance.
If the size of your column is predictable, by defining an appropriate column length (no longer than the maximum actual value) you will get the following benefits:
- Data loading issues can be identified more quickly
- SQream DB can reserve less memory for decompression operations
- Third-party tools that expect a data size are less likely to over-allocate memory
SQream DB executes JOIN operations very effectively. It is almost always better to JOIN tables at query-time rather than flatten/denormalize your tables.
This will also reduce storage size and reduce row-lengths.
We highly suggest using
BIGINT as join keys, rather than a text/string type.
SQream DB’s native storage is heavily optimized for analytic workloads. It is always faster for querying than other formats, even columnar ones such as Parquet. It also enables the use of additional metadata to help speed up queries, in some cases by many orders of magnitude.
CREATE TABLE native_table AS SELECT * FROM external_table
The one situation when this wouldn’t be as useful is when data will be only queried once.
Knowing the data types and their ranges can help design a better table.
For example, if a value can’t be missing (or
NULL), specify a
NOT NULL constraint on the columns.
Not only does specifying
NOT NULL save on data storage, it lets the query compiler know that a column cannot have a
NULL value, which can improve query performance.
Data sorting is an important factor in minimizing storage size and improving query performance.
- Minimizing storage saves on physical resources and increases performance by reducing overall disk I/O. Prioritize the sorting of low-cardinality columns. This reduces the number of chunks and extents that SQream DB reads during query execution.
- Where possible, sort columns with the lowest cardinality first. Avoid sorting
TEXT/NVARCHARcolumns with lengths exceeding 50 characters.
- For longer-running queries that run on a regular basis, performance can be improved by sorting data based on the
GROUP BYparameters. Data can be sorted during insert by using Foreign Tables or by using CREATE TABLE AS.
This section describes best practices for writing SQL queries.
Reducing the input to a
JOIN clause can increase performance.
Some queries benefit from retreiving a reduced dataset as a subquery prior to a join.
SELECT store_name, SUM(amount) FROM store_dim AS dim INNER JOIN store_fact AS fact ON dim.store_id=fact.store_id WHERE p_date BETWEEN '2018-07-01' AND '2018-07-31' GROUP BY 1;
Can be rewritten as
SELECT store_name, sum_amount FROM store_dim AS dim INNER JOIN (SELECT SUM(amount) AS sum_amount, store_id FROM store_fact WHERE p_date BETWEEN '2018-07-01' AND '2018-07-31' group by 2) AS fact ON dim.store_id=fact.store_id;
Use the high selectivity hint when you expect a predicate to filter out most values.
SELECT store_name, SUM(amount) FROM store_dim WHERE HIGH_SELECTIVITY(p_date = '2018-07-01') GROUP BY 1;
This hint tells the query compiler that the
WHERE condition is expected to filter out more than 50% of values. It does not affect the query results, but when used correctly can improve query performance.
When using an
INT or smaller type, the
COUNT operations return a value of the same type.
To avoid overflow on large results, cast the column up to a larger type.
SELECT store_name, SUM(amount :: BIGINT) FROM store_dim GROUP BY 1;
SQream DB optimizes
COUNT(*) queries very strongly. This also applies to
COUNT(column_name) on non-nullable columns. Using
COUNT(column_name) on a nullable column will operate quickly, but much slower than the previous variations.
Returning only the columns you need to client programs can improve overall query performance. This also reduces the overall result set, which can improve performance in third-party tools.
SQream is able to optimize out unneeded columns very strongly due to its columnar storage.
Saved queries are compiled when they are created. The query plan is saved in SQream DB’s metadata for later re-use.
Because the query plan is saved, they can be used to reduce compilation overhead, especially with very complex queries, such as queries with lots of values in an IN predicate.
When executed, the saved query plan is recalled and executed on the up-to-date data stored on disk.
See how to use saved queries in the saved queries guide.
Pre-filter to reduce JOIN complexity¶
Filter and reduce table sizes prior to joining on them
SELECT store_name, SUM(amount) FROM dimention dim JOIN fact ON dim.store_id = fact.store_id WHERE p_date BETWEEN '2019-07-01' AND '2019-07-31' GROUP BY store_name;
Can be rewritten as:
SELECT store_name, sum_amount FROM dimention AS dim INNER JOIN (SELECT SUM(amount) AS sum_amount, store_id FROM fact WHERE p_date BETWEEN '2019-07-01' AND '2019-07-31' GROUP BY store_id) AS fact ON dim.store_id = fact.store_id;
Very often, tabular data is already naturally ordered along a dimension such as a timestamp or area.
This natural order is a major factor for query performance later on, as data that is naturally sorted can be more easily compressed and analyzed with SQream DB’s metadata collection.
For example, when data is sorted by timestamp, filtering on this timestamp is more effective than filtering on an unordered column.
Natural ordering can also be used for effective DELETE operations.