Query Split

The split query operation optimizes long-running queries by executing them in parallel on different GPUs and/or Workers, reducing overall runtime. This involves breaking down a complex query into parallel executions on small data subsets. To ensure an ordered result set aligned with the original complex query, two prerequisites are essential. First, create an empty table mirroring the original result set’s structure. Second, define the @@SetResult operator to split the query using an INTEGER, DATE, DATETIME or DATETIME2 column, as these types are compatible with the operator’s min and max variables.

Splitting is exclusive to the CLI and UI, utilizing Meta-scripting, a unique CLI & UI feature. Keep in mind that not all queries benefit, as this method introduces overhead runtime.

Syntax 

Creating an empty table mirroring the original query result set’s structure using the same DDL:

CREATE TABLE <final_result_table>
AS
(
  SELECT
   -- Original query..
  WHERE
    <false_filter>
)
   -- A false_filter example: 1=2

Defining the @@setresult operator to split the original query using an INTEGER, BIGINT, DATE, DATETIME or DATETIME2 column with min and max variables. If the column you’re splitting by is used in a WHERE clause in the original query, use a WHERE clause when setting the SetResult operator as well. The name you alias in the @@SetResult section is the name you reference in the @@SplitQueryBy… section. For example, aliasing as minMax allows you to reference it as minMax[0].min. The @@SetResult operator has a single-row limitation; an error is thrown for 0 or more than 1 row to prevent a huge memory buffer.

@@SetResult minMax
SELECT
  MIN(<column>) AS min,
  MAX(<column>) AS max
FROM
  <my_table>
[WHERE
  <column> BETWEEN
   -- Integer Range:
   1 AND 100
   | -- Date Range:
   'yyyy-mm-dd' AND 'yyyy-mm-dd'
   | -- DateTime Range:
   'yyyy-mm-dd hh:mm:ss:SSS' AND 'yyyy-mm-dd hh:mm:ss:SSS']

Defining the operator that determines the number of instances (splits) based on the data type of the column by which the query is split:

INTEGER column: use the @@SplitQueryByNumber operator

@@SplitQueryByNumber instances = <number of instances>, from = minMax[0].min, to = minMax[0].max
INSERT INTO <final_result_table>
(
  SELECT
   -- Original query..
  WHERE
    <column_to_split_by> BETWEEN '${from}' and '${to}'
)

DATE column: use the @@SplitQueryByDate operator

@@SplitQueryByDate instances = <number of instances>, from = minMax[0].min, to = minMax[0].max
INSERT INTO <final_result_table>
(
  SELECT
   -- Original query..
  WHERE
    <column_to_split_by> BETWEEN '${from}' and '${to}'
)

DATETIME column: use the @@SplitQueryByDateTime operator

@@SplitQueryByDateTime instances = <number of instances>, from = minMax[0].min, to = minMax[0].max
INSERT INTO <final_result_table>
(
  SELECT
   -- Original query..
  WHERE <column_to_split_by> BETWEEN '${from}' and '${to}'
)

DATETIME2 column: use the @@SplitQueryByDateTime2 operator

@@SplitQueryByDateTime2 instances = <number of instances>, from = minMax[0].min, to = minMax[0].max
INSERT INTO <final_result_table>
(
  SELECT
   -- Original query..
  WHERE <column_to_split_by> BETWEEN '${from}' and '${to}'
)

Gathering results:

-- Gathering results for queries without aggregations:

SELECT *
FROM
  <final_result_table>
;

-- Gathering results for queries with aggregations and/or AVERAGE statement:

- AVERAGE:

SELECT
  <column1>, [,...],
  [SUM([DISTINCT] expr) AS <sum_column>],
  [SUM(count_column) AS <sum_count_column>],
  [SUM(avg_column1) / SUM(avg_column2) AS <avg_column>]
FROM
  <final_result_table>
GROUP BY
  <column1>, <column2> [,...]
ORDER BY
  <column4>

-- Do not use a WHERE clause

Example 

Creating a Sample Table and Query 

To split your first query, create the following table and insert data into it:

CREATE TABLE MyTable (
  id INT,
  name TEXT NOT NULL,
  age INT,
  salary INT,
  quantity INT
  );

-- Inserting data into the table
INSERT INTO MyTable (id, name, age, salary, quantity)
VALUES
  (1, 'John', 25, 50000, 10),
  (2, 'Jane', 30, 60000, 20),
  (3, 'Bob', 28, 55000, 15),
  (4, 'Emily', 35, 70000, 18),
  (5, 'David', 32, 62000, 22),
  (6, 'Sarah', 27, 52000, 12),
  (7, 'Michael', 40, 75000, 17),
  (8, 'Olivia', 22, 48000, 25),
  (9, 'William', 31, 58000, 14),
  (10, 'Sophia', 29, 56000, 19),
  (11, 'Liam', 26, 51000, 13),
  (12, 'Emma', 33, 64000, 16),
  (13, 'Daniel', 24, 49000, 23),
  (14, 'Ava', 37, 69000, 21),
  (15, 'Matthew', 23, 47000, 28),
  (16, 'Ella', 34, 67000, 24),
  (17, 'James', 28, 55000, 11),
  (18, 'Grace', 39, 72000, 26),
  (19, 'Benjamin', 30, 60000, 18),
  (20, 'Chloe', 25, 50000, 14),
  (21, 'Logan', 38, 71000, 20),
  (22, 'Mia', 27, 52000, 16),
  (23, 'Christopher', 32, 62000, 22),
  (24, 'Aiden', 29, 56000, 19),
  (25, 'Lily', 36, 68000, 15),
  (26, 'Jackson', 31, 58000, 23),
  (27, 'Harper', 24, 49000, 12),
  (28, 'Ethan', 35, 70000, 17),
  (29, 'Isabella', 22, 48000, 25),
  (30, 'Carter', 37, 69000, 14),
  (31, 'Amelia', 26, 51000, 21),
  (32, 'Lucas', 33, 64000, 19),
  (33, 'Abigail', 28, 55000, 16),
  (34, 'Mason', 39, 72000, 18),
  (35, 'Evelyn', 30, 60000, 25),
  (36, 'Alexander', 23, 47000, 13),
  (37, 'Addison', 34, 67000, 22),
  (38, 'Henry', 25, 50000, 20),
  (39, 'Avery', 36, 68000, 15),
  (40, 'Sebastian', 29, 56000, 24),
  (41, 'Layla', 31, 58000, 11),
  (42, 'Wyatt', 38, 71000, 26),
  (43, 'Nora', 27, 52000, 19),
  (44, 'Grayson', 32, 62000, 17),
  (45, 'Scarlett', 24, 49000, 14),
  (46, 'Gabriel', 35, 70000, 23),
  (47, 'Hannah', 22, 48000, 16),
  (48, 'Eli', 37, 69000, 25),
  (49, 'Paisley', 28, 55000, 18),
  (50, 'Owen', 33, 64000, 12);

Next, we’ll split the following query:

SELECT
  age,
  COUNT(*) AS total_people,
  AVG(salary) AS avg_salary,
  SUM(quantity) AS total_quantity,
  SUM(CASE WHEN quantity > 20 THEN 1 ELSE 0 END) AS high_quantity_count,
  SUM(CASE WHEN age BETWEEN 25 AND 30 THEN salary ELSE 0 END) AS total_salary_age_25_30
FROM
  MyTable
WHERE
  salary > 55000
GROUP BY
  age
ORDER BY
  age;

Splitting the Query 

Prepare an empty table mirroring the original query result set’s structure with the same DDL, using a false filter under the WHERE clause.

An empty table named FinalResult is created.

CREATE OR REPLACE TABLE FinalResult
AS
(
  SELECT
  age,
  COUNT(*) AS total_people,
  SUM(salary) AS avg_salary,
  COUNT(salary) AS avg_salary2,
  SUM(quantity) AS total_quantity,
  SUM(CASE WHEN quantity > 20 THEN 1 ELSE 0 END) AS high_quantity_count,
  SUM(CASE WHEN age BETWEEN 25 AND 30 THEN salary ELSE 0 END) AS total_salary_age_25_30
FROM
  MyTable
WHERE
  1=0
  AND salary > 55000
GROUP BY
  age
ORDER BY
  age
  );

Set the @@setresult operator to split the original query using min and max variables.

@@ SetResult minMax
SELECT min(age) as min, max(age) as max
FROM mytable
;

Set the @@SplitQueryByNumber operator with the number of instances (splits) of your query (here based on an INTEGER column), and set the between ${from} and ${to} clause with the name of the column by which you wish to split your query (here the query is split by the age column.

@@SplitQueryByNumber instances = 4, from = minMax[0].min, to = minMax[0].max
INSERT INTO FinalResult
(
SELECT
  age,
  COUNT(*) AS total_people,
  SUM(salary) AS avg_salary,
  COUNT(salary) AS avg_salary2,
  SUM(quantity) AS total_quantity,
  SUM(CASE WHEN quantity > 20 THEN 1 ELSE 0 END) AS high_quantity_count,
  SUM(CASE WHEN age BETWEEN 25 AND 30 THEN salary ELSE 0 END) AS total_salary_age_25_30
FROM
  MyTable
WHERE
  age between '${from}' and '${to}'
  AND salary > 55000
GROUP BY
  age
ORDER BY
  age
  );

Gather the results of your query.

SELECT * FROM FinalResult ;

If we were to split the query Create a query that gathers the results of all instances (splits) into the empty table you created in step 1.

SELECT
  age,
  SUM(total_people) AS total_people,
  SUM(avg_salary) / SUM(avg_salary2) AS avg_salary,
  SUM(total_quantity) AS total_quantity,
  SUM(high_quantity_count) AS high_quantity_count,
  SUM(total_salary_age_25_30) AS total_salary_age_25_30
FROM
  FinalResult
GROUP BY
  age
ORDER BY
  age
  ;

Arrange ALL sequential scripts on one Editor tab.
Ensure that EACH script ends with a ;.
Ensure that the Execute button is set to All so that all queries are consecutively executed.
Select the Execute button.

All scripts are executed, resulting in the splitting of the initial query and a table containing the final result set.

Best Practices 

General

When incorporating the LIMIT clause or any aggregate function in your query, split the query based only on a GROUP BY column. If no relevant columns are present in the GROUP BY clause, the query might not be suitable for splitting.
If you are not using aggregations, it’s best to split the query using a column that appears in the a WHERE or JOIN clause.
When using the JOIN key, it is usually better to use the key of the smaller table.

Choosing a Column to Split by

The column you split by must be sorted or mostly sorted. Meaning, that even if the column values may not be perfectly ordered, they still follow a general sequence or trend.

Aggregation Best Practices

Aggregation functions, or special functions need to have adjustments in the query that gathers the results of all instances (splits) into the empty table:

COUNT becomes SUM
The following statement and functions are split into two columns in the query split and then merged to be executed as one statement or function in the final query:

AVERAGE

User defined functions

Variance functions

Standard deviation functions

Date as Number best practices

When date is stored as number, using the number of workers as the instances number may not result in the expected way. e.g. if date run from 20210101 to 20210630 splitting to 8 will result in 6 relevant splits, as SQream only checks min and max and splits accordingly (20210630-20210101)/8. we get an instance of empty data with dates ranging from 20210432 to 20210499 (not really dates, but real numbers). In this case, we need to adjust the number of instance to get the right size splits. In the above example we need to split to 64, and each worker will run 3 splits with actual data.

Known Issue

When running queries with the -c flag in jdbc-console, adding a space between -c and the query string causes the command to fail.

Example – works without space

java -jar jdbc-console-3.5.jar --user sqream --pass sqream \
--port 5001 -d master --host=192.168.4.122 \
-c"@@SetResult minMax SELECT 1 as min, 2 as max;"

Example – fails with space

java -jar jdbc-console-3.5.jar --user sqream --pass sqream \
--port 5001 -d master --host=192.168.4.122 \
-c "@@SetResult minMax SELECT 1 as min, 2 as max;"

Usage Notes & Limitations 

Limitation of split query instances to 1000.
The number of splits should not exceed the number of available workers. If you split a query to more instances than the number of distinct values in the split column, you will have idle splits.
Stopping a splitted query can leave the database in an inconsistent state. Since each split is a separate, parallel process, halting the main command may not stop all of them immediately. This can result in a partial data load, where some data is successfully inserted while other parts are not. Consequently, a user would need to manually clean up the incomplete data or re-run the entire operation from scratch, which adds significant overhead and complexity to the process.
For documenting the usage of date and datetime types, it is essential to always enclose the from and to values in single quotes, such as ‘${from}’ and ‘${to}’. This is a crucial best practice as it ensures the database correctly interprets the string values as dates or datetimes, preventing syntax errors and guaranteeing the queries execute successfully. The single quotes are a requirement for these specific data types and should always be included in the examples and instructions provided to users.

Query Split

Syntax

Example

Creating a Sample Table and Query

Splitting the Query

Best Practices