Inserting data¶
This guide covers inserting data into SQream DB, with subguides on inserting data from a variety of sources and locations.
In this topic:
Data loading overview¶
SQream DB supports importing data from the following sources:
Using INSERT with a client driver
Using COPY FROM:
Using Foreign Tables:
SQream DB supports loading files in the following formats:
- Text - CSV, TSV, PSV
- Parquet
- ORC
Data loading considerations¶
Verify data and performance after load¶
Like other RDBMSs, SQream DB has its own set of best practcies for table design and query optimization.
SQream therefore recommends:
- Verify that the data is as you expect it (e.g. row counts, data types, formatting, content)
- The performance of your queries is adequate
- Best practices were followed for table design
- Applications such as Tableau and others have been tested, and work
- Data types were not over-provisioned (e.g. don’t use VARCHAR(2000) to store a short string)
File source location for loading¶
During loading using COPY FROM, the statement can run on any worker. If you are running multiple nodes, make sure that all nodes can see the source the same. If you load from a local file which is only on 1 node and not on shared storage, it will fail some of the time. (If you need to, you can also control which node a statement runs on using the Workload manager).
Supported load methods¶
SQream DB’s COPY FROM syntax can be used to load CSV files, but can’t be used for Parquet and ORC.
FOREIGN TABLE can be used to load text files, Parquet, and ORC files, and can also transform the data prior to materialization as a full table.
Method / File type | Text (CSV) | Parquet | ORC | Streaming data |
---|---|---|---|---|
COPY FROM | ✓ | ✗ | ✗ | ✗ |
Foreign Tables | ✓ | ✓ | ✓ | ✗ |
INSERT | ✗ | ✗ | ✗ | ✓ (Python, JDBC, Node.JS) |
Unsupported data types¶
SQream DB doesn’t support the entire set of features that some other database systems may have, such as ARRAY
, BLOB
, ENUM
, SET
, etc.
These data types will have to be converted before load. For example, ENUM
can often be stored as a VARCHAR
.
Extended error handling¶
While external tables can be used to load CSVs, the COPY FROM
statement provides more fine-grained error handling options, as well as extended support for non-standard CSVs with multi-character delimiters, alternate timestamp formats, and more.
Best practices for CSV¶
Text files like CSV rarely conform to RFC 4180 , so alterations may be required:
Use
OFFSET 2
for files containing header rowsFailed rows can be captured in a log file for later analysis, or just to skip them. See Capturing rejected rows for information on skipping rejected rows.
Record delimiters (new lines) can be modified with the RECORD DELIMITER syntax.
If the date formats differ from ISO 8601, refer to the Supported date formats section to see how to override default parsing.
Fields in a CSV can be optionally quoted with double-quotes (
"
). However, any field containing a newline or another double-quote character must be quoted.If a field is quoted, any double quote that appears must be double-quoted (similar to the string literals quoting rules. For example, to encode
What are "birds"?
, the field should appear as"What are ""birds""?"
.Field delimiters don’t have a to be a displayable ASCII character. See Supported field delimiters for all options.
Best practices for Parquet¶
- Parquet files are loaded through Foreign Tables. The destination table structure has to match in number of columns between the source files.
- Parquet files support predicate pushdown. When a query is issued over Parquet files, SQream DB uses row-group metadata to determine which row-groups in a file need to be read for a particular query and the row indexes can narrow the search to a particular set of rows.
Type support and behavior notes¶
- Unlike ORC, the column types should match the data types exactly (see table below).
SQream DB type → Parquet source |
BOOL |
TINYINT |
SMALLINT |
INT |
BIGINT |
REAL |
DOUBLE |
Text [1] | DATE |
DATETIME |
---|---|---|---|---|---|---|---|---|---|---|
BOOLEAN |
✓ | |||||||||
INT16 |
✓ | |||||||||
INT32 |
✓ | |||||||||
INT64 |
✓ | |||||||||
FLOAT |
✓ | |||||||||
DOUBLE |
✓ | |||||||||
BYTE_ARRAY [2] |
✓ | |||||||||
INT96 [3] |
✓ [4] |
- If a Parquet file has an unsupported type like
enum
,uuid
,time
,json
,bson
,lists
,maps
, but the data is not referenced in the table (it does not appear in the SELECT query), the statement will succeed. If the column is referenced, an error will be thrown to the user, explaining that the type is not supported, but the column may be ommited.
Best practices for ORC¶
- ORC files are loaded through Foreign Tables. The destination table structure has to match in number of columns between the source files.
- ORC files support predicate pushdown. When a query is issued over ORC files, SQream DB uses ORC metadata to determine which stripes in a file need to be read for a particular query and the row indexes can narrow the search to a particular set of 10,000 rows.
Type support and behavior notes¶
- ORC files are loaded through Foreign Tables. The destination table structure has to match in number of columns between the source files.
- The types should match to some extent within the same “class” (see table below).
SQream DB type → ORC source |
BOOL |
TINYINT |
SMALLINT |
INT |
BIGINT |
REAL |
DOUBLE |
Text [1] | DATE |
DATETIME |
---|---|---|---|---|---|---|---|---|---|---|
boolean |
✓ | ✓ [5] | ✓ [5] | ✓ [5] | ✓ [5] | |||||
tinyint |
○ [6] | ✓ | ✓ | ✓ | ✓ | |||||
smallint |
○ [6] | ○ [7] | ✓ | ✓ | ✓ | |||||
int |
○ [6] | ○ [7] | ○ [7] | ✓ | ✓ | |||||
bigint |
○ [6] | ○ [7] | ○ [7] | ○ [7] | ✓ | |||||
float |
✓ | ✓ | ||||||||
double |
✓ | ✓ | ||||||||
string / char / varchar |
✓ | |||||||||
date |
✓ | ✓ | ||||||||
timestamp , timestamp with timezone |
✓ |
- If an ORC file has an unsupported type like
binary
,list
,map
, andunion
, but the data is not referenced in the table (it does not appear in the SELECT query), the statement will succeed. If the column is referenced, an error will be thrown to the user, explaining that the type is not supported, but the column may be ommited.
Further reading and migration guides¶
Data loading guides
Migration guides
See also:
Footnotes
[1] | (1, 2) Text values include TEXT , VARCHAR , and NVARCHAR |
[2] | With UTF8 annotation |
[3] | With TIMESTAMP_NANOS or TIMESTAMP_MILLIS annotation |
[4] | Any microseconds will be rounded down to milliseconds. |
[5] | (1, 2, 3, 4) Boolean values are cast to 0, 1 |
[6] | (1, 2, 3, 4) Will succeed if all values are 0, 1 |
[7] | (1, 2, 3, 4, 5, 6) Will succeed if all values fit the destination type |