Query Split
The split query operation optimizes long-running queries by executing them in parallel on different GPUs and/or Workers, reducing overall runtime. This involves breaking down a complex query into parallel executions on small data subsets. To ensure an ordered result set aligned with the original complex query, two prerequisites are essential. First, create an empty table mirroring the original result set’s structure. Second, define the @@SetResult
operator to split the query using an INTEGER
, DATE
, or DATETIME
column, as these types are compatible with the operator’s min
and max
variables.
Splitting is exclusive to the CLI and UI, utilizing Meta-scripting, a unique CLI & UI feature. Keep in mind that not all queries benefit, as this method introduces overhead runtime.
Syntax
Creating an empty table mirroring the original query result set’s structure using the same DDL:
CREATE TABLE <final_result_table>
AS
(
SELECT
-- Original query..
WHERE
<false_filter>
)
-- A false_filter example: 1=2
Defining the @@setresult operator to split the original query using an INTEGER, BIGINT, DATE, or DATETIME column with min and max variables. If the column you’re splitting by is used in a WHERE clause in the original query, use a WHERE clause when setting the SetResult operator as well. The name you alias in the @@SetResult section is the name you reference in the @@SplitQueryBy… section. For example, aliasing as minMax allows you to reference it as minMax[0].min. The @@SetResult operator has a single-row limitation; an error is thrown for 0 or more than 1 row to prevent a huge memory buffer.
@@SetResult minMax
SELECT
MIN(<column>) AS min,
MAX(<column>) AS max
FROM
<my_table>
[WHERE
<column> BETWEEN
-- Integer Range:
1 AND 100
| -- Date Range:
'yyyy-mm-dd' AND 'yyyy-mm-dd'
| -- DateTime Range:
'yyyy-mm-dd hh:mm:ss:SSS' AND 'yyyy-mm-dd hh:mm:ss:SSS']
Defining the operator that determines the number of instances (splits) based on the data type of the column by which the query is split:
INTEGER column: use the
@@SplitQueryByNumber
operator
@@SplitQueryByNumber instances = <number of instances>, from = minMax[0].min, to = minMax[0].max
INSERT INTO <final_result_table>
(
SELECT
-- Original query..
WHERE
<column_to_split_by> BETWEEN '${from}' and '${to}'
)
DATE column: use the
@@SplitQueryByDate
operator
@@SplitQueryByDate instances = <number of instances>, from = minMax[0].min, to = minMax[0].max
INSERT INTO <final_result_table>
(
SELECT
-- Original query..
WHERE
<column_to_split_by> BETWEEN '${from}' and '${to}'
)
DATETIME column: use the
@@SplitQueryByDateTime
operator
@@SplitQueryByDateTime instances = <number of instances>, from = minMax[0].min, to = minMax[0].max
INSERT INTO <final_result_table>
(
SELECT
-- Original query..
WHERE <column_to_split_by> BETWEEN '${from}' and '${to}'
)
Gathering results:
-- Gathering results for queries without aggregations:
SELECT *
FROM
<final_result_table>
;
-- Gathering results for queries with aggregations and/or AVERAGE statement:
- AVERAGE:
SELECT
<column1>, [,...],
[SUM([DISTINCT] expr) AS <sum_column>],
[SUM(count_column) AS <sum_count_column>],
[SUM(avg_column1) / SUM(avg_column2) AS <avg_column>]
FROM
<final_result_table>
GROUP BY
<column1>, <column2> [,...]
ORDER BY
<column4>
-- Do not use a WHERE clause
Example
Creating a Sample Table and Query
To split your first query, create the following table and insert data into it:
CREATE TABLE MyTable (
id INT,
name TEXT NOT NULL,
age INT,
salary INT,
quantity INT
);
-- Inserting data into the table
INSERT INTO MyTable (id, name, age, salary, quantity)
VALUES
(1, 'John', 25, 50000, 10),
(2, 'Jane', 30, 60000, 20),
(3, 'Bob', 28, 55000, 15),
(4, 'Emily', 35, 70000, 18),
(5, 'David', 32, 62000, 22),
(6, 'Sarah', 27, 52000, 12),
(7, 'Michael', 40, 75000, 17),
(8, 'Olivia', 22, 48000, 25),
(9, 'William', 31, 58000, 14),
(10, 'Sophia', 29, 56000, 19),
(11, 'Liam', 26, 51000, 13),
(12, 'Emma', 33, 64000, 16),
(13, 'Daniel', 24, 49000, 23),
(14, 'Ava', 37, 69000, 21),
(15, 'Matthew', 23, 47000, 28),
(16, 'Ella', 34, 67000, 24),
(17, 'James', 28, 55000, 11),
(18, 'Grace', 39, 72000, 26),
(19, 'Benjamin', 30, 60000, 18),
(20, 'Chloe', 25, 50000, 14),
(21, 'Logan', 38, 71000, 20),
(22, 'Mia', 27, 52000, 16),
(23, 'Christopher', 32, 62000, 22),
(24, 'Aiden', 29, 56000, 19),
(25, 'Lily', 36, 68000, 15),
(26, 'Jackson', 31, 58000, 23),
(27, 'Harper', 24, 49000, 12),
(28, 'Ethan', 35, 70000, 17),
(29, 'Isabella', 22, 48000, 25),
(30, 'Carter', 37, 69000, 14),
(31, 'Amelia', 26, 51000, 21),
(32, 'Lucas', 33, 64000, 19),
(33, 'Abigail', 28, 55000, 16),
(34, 'Mason', 39, 72000, 18),
(35, 'Evelyn', 30, 60000, 25),
(36, 'Alexander', 23, 47000, 13),
(37, 'Addison', 34, 67000, 22),
(38, 'Henry', 25, 50000, 20),
(39, 'Avery', 36, 68000, 15),
(40, 'Sebastian', 29, 56000, 24),
(41, 'Layla', 31, 58000, 11),
(42, 'Wyatt', 38, 71000, 26),
(43, 'Nora', 27, 52000, 19),
(44, 'Grayson', 32, 62000, 17),
(45, 'Scarlett', 24, 49000, 14),
(46, 'Gabriel', 35, 70000, 23),
(47, 'Hannah', 22, 48000, 16),
(48, 'Eli', 37, 69000, 25),
(49, 'Paisley', 28, 55000, 18),
(50, 'Owen', 33, 64000, 12);
Next, we’ll split the following query:
SELECT
age,
COUNT(*) AS total_people,
AVG(salary) AS avg_salary,
SUM(quantity) AS total_quantity,
SUM(CASE WHEN quantity > 20 THEN 1 ELSE 0 END) AS high_quantity_count,
SUM(CASE WHEN age BETWEEN 25 AND 30 THEN salary ELSE 0 END) AS total_salary_age_25_30
FROM
MyTable
WHERE
salary > 55000
GROUP BY
age
ORDER BY
age;
Splitting the Query
Prepare an empty table mirroring the original query result set’s structure with the same DDL, using a false filter under the
WHERE
clause.An empty table named
FinalResult
is created.
CREATE OR REPLACE TABLE FinalResult
AS
(
SELECT
age,
COUNT(*) AS total_people,
SUM(salary) AS avg_salary,
COUNT(salary) AS avg_salary2,
SUM(quantity) AS total_quantity,
SUM(CASE WHEN quantity > 20 THEN 1 ELSE 0 END) AS high_quantity_count,
SUM(CASE WHEN age BETWEEN 25 AND 30 THEN salary ELSE 0 END) AS total_salary_age_25_30
FROM
MyTable
WHERE
1=0
AND salary > 55000
GROUP BY
age
ORDER BY
age
);
Set the
@@setresult
operator to split the original query usingmin
andmax
variables.
@@ SetResult minMax
SELECT min(age) as min, max(age) as max
FROM mytable
;
Set the
@@SplitQueryByNumber
operator with the number of instances (splits) of your query (here based on anINTEGER
column), and set thebetween ${from} and ${to}
clause with the name of the column by which you wish to split your query (here the query is split by theage
column.
@@SplitQueryByNumber instances = 4, from = minMax[0].min, to = minMax[0].max
INSERT INTO FinalResult
(
SELECT
age,
COUNT(*) AS total_people,
SUM(salary) AS avg_salary,
COUNT(salary) AS avg_salary2,
SUM(quantity) AS total_quantity,
SUM(CASE WHEN quantity > 20 THEN 1 ELSE 0 END) AS high_quantity_count,
SUM(CASE WHEN age BETWEEN 25 AND 30 THEN salary ELSE 0 END) AS total_salary_age_25_30
FROM
MyTable
WHERE
age between '${from}' and '${to}'
AND salary > 55000
GROUP BY
age
ORDER BY
age
);
Gather the results of your query.
SELECT * FROM FinalResult ;
If we were to split the query Create a query that gathers the results of all instances (splits) into the empty table you created in step 1.
SELECT
age,
SUM(total_people) AS total_people,
SUM(avg_salary) / SUM(avg_salary2) AS avg_salary,
SUM(total_quantity) AS total_quantity,
SUM(high_quantity_count) AS high_quantity_count,
SUM(total_salary_age_25_30) AS total_salary_age_25_30
FROM
FinalResult
GROUP BY
age
ORDER BY
age
;
Arrange ALL sequential scripts on one Editor tab.
Ensure that EACH script ends with a
;
.Ensure that the Execute button is set to All so that all queries are consecutively executed.
Select the Execute button.
All scripts are executed, resulting in the splitting of the initial query and a table containing the final result set.
Best Practices
General
When incorporating the
LIMIT
clause or any aggregate function in your query, split the query based only on aGROUP BY
column. If no relevant columns are present in theGROUP BY
clause, the query might not be suitable for splitting.If you are not using aggregations, it’s best to split the query using a column that appears in the a
WHERE
orJOIN
clause.When using the
JOIN
key, it is usually better to use the key of the smaller table.
Choosing a Column to Split by
The column you split by must be sorted or mostly sorted. Meaning, that even if the column values may not be perfectly ordered, they still follow a general sequence or trend.
Aggregation Best Practices
Aggregation functions, or special functions need to have adjustments in the query that gathers the results of all instances (splits) into the empty table:
COUNT
becomesSUM
The following statement and functions are split into two columns in the query split and then merged to be executed as one statement or function in the final query:
AVERAGE
User defined functions
Variance functions
Standard deviation functions
Date as Number best practices
When date is stored as number, using the number of workers as the instances number may not result in the expected way. e.g. if date run from 20210101 to 20210630 splitting to 8 will result in 6 relevant splits, as SQream only checks min and max and splits accordingly (20210630-20210101)/8. we get an instance of empty data with dates ranging from 20210432 to 20210499 (not really dates, but real numbers). In this case, we need to adjust the number of instance to get the right size splits. In the above example we need to split to 64, and each worker will run 3 splits with actual data.
Known Issue
When running queries with the -c flag in jdbc-console, adding a space between -c and the query string causes the command to fail.
Example – works without space
java -jar jdbc-console-3.5.jar --user sqream --pass sqream \
--port 5001 -d master --host=192.168.4.122 \
-c"@@SetResult minMax SELECT 1 as min, 2 as max;"
Example – fails with space
java -jar jdbc-console-3.5.jar --user sqream --pass sqream \
--port 5001 -d master --host=192.168.4.122 \
-c "@@SetResult minMax SELECT 1 as min, 2 as max;"
Usage Notes & Limitations
Limitation of split query instances to 1000.
The number of splits should not exceed the number of available workers. If you split a query to more instances than the number of distinct values in the split column, you will have idle splits.
Stopping a splitted query can leave the database in an inconsistent state. Since each split is a separate, parallel process, halting the main command may not stop all of them immediately. This can result in a partial data load, where some data is successfully inserted while other parts are not. Consequently, a user would need to manually clean up the incomplete data or re-run the entire operation from scratch, which adds significant overhead and complexity to the process.
For documenting the usage of date and datetime types, it is essential to always enclose the from and to values in single quotes, such as ‘${from}’ and ‘${to}’. This is a crucial best practice as it ensures the database correctly interprets the string values as dates or datetimes, preventing syntax errors and guaranteeing the queries execute successfully. The single quotes are a requirement for these specific data types and should always be included in the examples and instructions provided to users.