Query Split
The split query operation optimizes long-running queries by executing them in parallel on different GPUs and/or Workers, reducing overall runtime. This involves breaking down a complex query into parallel executions on small data subsets. To ensure an ordered result set aligned with the original complex query, two prerequisites are essential. First, create an empty table mirroring the original result set’s structure. Second, define the @@SetResult
operator to split the query using an INTEGER
, DATE
, or DATETIME
column, as these types are compatible with the operator’s min
and max
variables.
Splitting is exclusive to the UI, utilizing Meta-scripting, a unique UI feature. Keep in mind that not all queries benefit, as this method introduces overhead runtime.
Syntax
Creating an empty table mirroring the original query result set’s structure using the same DDL:
CREATE TABLE <final_result_table>
AS
(
SELECT
-- Original query..
WHERE
<false_filter>
)
-- A false_filter example: 1=2
Defining the @@setresult
operator to split the original query using an INTEGER
, DATE
, or DATETIME
column with min
and max
variables. If the column you’re splitting by is used in a WHERE
clause in the original query, use a WHERE
clause when setting the SetResult
operator as well.
@@SetResult minMax
SELECT
MIN(<column>) AS min,
MAX(<column>) AS max
FROM
<my_table>
[WHERE
<column> BETWEEN
-- Integer Range:
1 AND 100
| -- Date Range:
'yyyy-mm-dd' AND 'yyyy-mm-dd'
| -- DateTime Range:
'yyyy-mm-dd hh:mm:ss:SSS' AND 'yyyy-mm-dd hh:mm:ss:SSS']
Defining the operator that determines the number of instances (splits) based on the data type of the column by which the query is split:
INTEGER column: use the
@@SplitQueryByNumber
operator
@@SplitQueryByNumber instances = <number of instances>, from = minMax[0].min, to = minMax[0].max
INSERT INTO <final_result_table>
(
SELECT
-- Original query..
WHERE
<column_to_split_by> BETWEEN '${from}' and '${to}'
)
DATE column: use the
@@SplitQueryByDate
operator
@@SplitQueryByDate instances = <number of instances>, from = minMax[0].min, to = minMax[0].max
INSERT INTO <final_result_table>
(
SELECT
-- Original query..
WHERE
<column_to_split_by> BETWEEN '${from}' and '${to}'
)
DATETIME column: use the
@@SplitQueryByDateTime
operator
@@SplitQueryByDateTime instances = <number of instances>, from = minMax[0].min, to = minMax[0].max
INSERT INTO <final_result_table>
(
SELECT
-- Original query..
WHERE <column_to_split_by> BETWEEN '${from}' and '${to}'
)
Gathering results:
-- Gathering results for queries without aggregations:
SELECT *
FROM
<final_result_table>
;
-- Gathering results for queries with aggregations and/or AVERAGE statement:
- AVERAGE:
SELECT
<column1>, [,...],
[SUM([DISTINCT] expr) AS <sum_column>],
[SUM(count_column) AS <sum_count_column>],
[SUM(avg_column1) / SUM(avg_column2) AS <avg_column>]
FROM
<final_result_table>
GROUP BY
<column1>, <column2> [,...]
ORDER BY
<column4>
-- Do not use a WHERE clause
Example
Creating a Sample Table and Query
To split your first query, create the following table and insert data into it:
CREATE TABLE MyTable (
id INT,
name TEXT NOT NULL,
age INT,
salary INT,
quantity INT
);
-- Inserting data into the table
INSERT INTO MyTable (id, name, age, salary, quantity)
VALUES
(1, 'John', 25, 50000, 10),
(2, 'Jane', 30, 60000, 20),
(3, 'Bob', 28, 55000, 15),
(4, 'Emily', 35, 70000, 18),
(5, 'David', 32, 62000, 22),
(6, 'Sarah', 27, 52000, 12),
(7, 'Michael', 40, 75000, 17),
(8, 'Olivia', 22, 48000, 25),
(9, 'William', 31, 58000, 14),
(10, 'Sophia', 29, 56000, 19),
(11, 'Liam', 26, 51000, 13),
(12, 'Emma', 33, 64000, 16),
(13, 'Daniel', 24, 49000, 23),
(14, 'Ava', 37, 69000, 21),
(15, 'Matthew', 23, 47000, 28),
(16, 'Ella', 34, 67000, 24),
(17, 'James', 28, 55000, 11),
(18, 'Grace', 39, 72000, 26),
(19, 'Benjamin', 30, 60000, 18),
(20, 'Chloe', 25, 50000, 14),
(21, 'Logan', 38, 71000, 20),
(22, 'Mia', 27, 52000, 16),
(23, 'Christopher', 32, 62000, 22),
(24, 'Aiden', 29, 56000, 19),
(25, 'Lily', 36, 68000, 15),
(26, 'Jackson', 31, 58000, 23),
(27, 'Harper', 24, 49000, 12),
(28, 'Ethan', 35, 70000, 17),
(29, 'Isabella', 22, 48000, 25),
(30, 'Carter', 37, 69000, 14),
(31, 'Amelia', 26, 51000, 21),
(32, 'Lucas', 33, 64000, 19),
(33, 'Abigail', 28, 55000, 16),
(34, 'Mason', 39, 72000, 18),
(35, 'Evelyn', 30, 60000, 25),
(36, 'Alexander', 23, 47000, 13),
(37, 'Addison', 34, 67000, 22),
(38, 'Henry', 25, 50000, 20),
(39, 'Avery', 36, 68000, 15),
(40, 'Sebastian', 29, 56000, 24),
(41, 'Layla', 31, 58000, 11),
(42, 'Wyatt', 38, 71000, 26),
(43, 'Nora', 27, 52000, 19),
(44, 'Grayson', 32, 62000, 17),
(45, 'Scarlett', 24, 49000, 14),
(46, 'Gabriel', 35, 70000, 23),
(47, 'Hannah', 22, 48000, 16),
(48, 'Eli', 37, 69000, 25),
(49, 'Paisley', 28, 55000, 18),
(50, 'Owen', 33, 64000, 12);
Next, we’ll split the following query:
SELECT
age,
COUNT(*) AS total_people,
AVG(salary) AS avg_salary,
SUM(quantity) AS total_quantity,
SUM(CASE WHEN quantity > 20 THEN 1 ELSE 0 END) AS high_quantity_count,
SUM(CASE WHEN age BETWEEN 25 AND 30 THEN salary ELSE 0 END) AS total_salary_age_25_30
FROM
MyTable
WHERE
salary > 55000
GROUP BY
age
ORDER BY
age;
Splitting the Query
Prepare an empty table mirroring the original query result set’s structure with the same DDL, using a false filter under the
WHERE
clause.An empty table named
FinalResult
is created.
CREATE OR REPLACE TABLE FinalResult
AS
(
SELECT
age,
COUNT(*) AS total_people,
SUM(salary) AS avg_salary,
COUNT(salary) AS avg_salary2,
SUM(quantity) AS total_quantity,
SUM(CASE WHEN quantity > 20 THEN 1 ELSE 0 END) AS high_quantity_count,
SUM(CASE WHEN age BETWEEN 25 AND 30 THEN salary ELSE 0 END) AS total_salary_age_25_30
FROM
MyTable
WHERE
1=0
AND salary > 55000
GROUP BY
age
ORDER BY
age
);
Set the
@@setresult
operator to split the original query usingmin
andmax
variables.
@@ SetResult minMax
SELECT min(age) as min, max(age) as max
FROM mytable
;
Set the
@@SplitQueryByNumber
operator with the number of instances (splits) of your query (here based on anINTEGER
column), and set thebetween ${from} and ${to}
clause with the name of the column by which you wish to split your query (here the query is split by theage
column.
@@SplitQueryByNumber instances = 4, from = minMax[0].min, to = minMax[0].max
INSERT INTO FinalResult
(
SELECT
age,
COUNT(*) AS total_people,
SUM(salary) AS avg_salary,
COUNT(salary) AS avg_salary2,
SUM(quantity) AS total_quantity,
SUM(CASE WHEN quantity > 20 THEN 1 ELSE 0 END) AS high_quantity_count,
SUM(CASE WHEN age BETWEEN 25 AND 30 THEN salary ELSE 0 END) AS total_salary_age_25_30
FROM
MyTable
WHERE
age between '${from}' and '${to}'
AND salary > 55000
GROUP BY
age
ORDER BY
age
);
Gather the results of your query.
SELECT * FROM FinalResult ;
If we were to split the query Create a query that gathers the results of all instances (splits) into the empty table you created in step 1.
SELECT
age,
SUM(total_people) AS total_people,
SUM(avg_salary) / SUM(avg_salary2) AS avg_salary,
SUM(total_quantity) AS total_quantity,
SUM(high_quantity_count) AS high_quantity_count,
SUM(total_salary_age_25_30) AS total_salary_age_25_30
FROM
FinalResult
GROUP BY
age
ORDER BY
age
;
Arrange ALL sequential scripts on one Editor tab.
Ensure that EACH script ends with a
;
.Ensure that the Execute button is set to All so that all queries are consecutively executed.
Select the Execute button.
All scripts are executed, resulting in the splitting of the initial query and a table containing the final result set.
Best Practices
General
When incorporating the
LIMIT
clause or any aggregate function in your query, split the query based only on aGROUP BY
column. If no relevant columns are present in theGROUP BY
clause, the query might not be suitable for splitting.If you are not using aggregations, it’s best to split the query using a column that appears in the a
WHERE
orJOIN
clause.When using the
JOIN
key, it is usually better to use the key of the smaller table.
Choosing a Column to Split by
The column you split by must be sorted or mostly sorted. Meaning, that even if the column values may not be perfectly ordered, they still follow a general sequence or trend.
Aggregation Best Practices
Aggregation functions, or special functions need to have adjustments in the query that gathers the results of all instances (splits) into the empty table:
COUNT
becomesSUM
The following statement and functions are split into two columns in the query split and then merged to be executed as one statement or function in the final query:
AVERAGE
User defined functions
Variance functions
Standard deviation functions
Date as Number best practices
When date is stored as number, using the number of workers as the instances number may not result in the expected way. e.g. if date run from 20210101 to 20210630 splitting to 8 will result in 6 relevant splits, as SQream only checks min and max and splits accordingly (20210630-20210101)/8. we get an instance of empty data with dates ranging from 20210432 to 20210499 (not really dates, but real numbers). In this case, we need to adjust the number of instance to get the right size splits. In the above example we need to split to 64, and each worker will run 3 splits with actual data.