Inserting Data Overview¶
The Inserting Data Overview page describes how to insert data into SQream, specifically how to insert data from a variety of sources and locations.
In this topic:
Getting Started¶
SQream supports importing data from the following sources:
Using INSERT with a client driver
Using COPY FROM:
Local filesystem and locally mounted network filesystems
Using external_tables:
Local filesystem and locally mounted network filesystems
SQream DB supports loading files in the following formats:
Text - CSV, TSV, PSV
Parquet
ORC
Data Loading Considerations¶
Verifying Data and Performance after Loading¶
Like other RDBMSs, SQream DB has its own set of best practcies for table design and query optimization.
SQream therefore recommends:
Verify that the data is as you expect it (e.g. row counts, data types, formatting, content)
The performance of your queries is adequate
Best practices were followed for table design
Applications such as Tableau and others have been tested, and work
Data types were not over-provisioned (e.g. don’t use TEXT to store a short string)
File Soure Location when Loading¶
During loading using COPY FROM, the statement can run on any worker. If you are running multiple nodes, make sure that all nodes can see the source the same. If you load from a local file which is only on 1 node and not on shared storage, it will fail some of the time. (If you need to, you can also control which node a statement runs on using the Workload Manager).
Supported load methods¶
SQream DB’s COPY FROM syntax can be used to load CSV files, but can’t be used for Parquet and ORC.
FOREIGN TABLE can be used to load text files, Parquet, and ORC files, and can also transform the data prior to materialization as a full table.
Method / File type |
Text (CSV) |
Parquet |
ORC |
Streaming data |
---|---|---|---|---|
Supported |
Not supported |
Not supported |
Not supported |
|
Supported |
Supported |
Supported |
Not supported |
|
Not supported |
Not supported |
Not supported |
Supported (Python, JDBC, Node.JS) |
Unsupported Data Types¶
SQream DB doesn’t support the entire set of features that some other database systems may have, such as ARRAY
, BLOB
, ENUM
, SET
, etc.
These data types will have to be converted before load. For example, ENUM
can often be stored as a VARCHAR
.
Handing Extended Errors¶
While foreign tables can be used to load CSVs, the COPY FROM
statement provides more fine-grained error handling options, as well as extended support for non-standard CSVs with multi-character delimiters, alternate timestamp formats, and more.
Best Practices for CSV¶
Text files like CSV rarely conform to RFC 4180 , so alterations may be required:
Use
OFFSET 2
for files containing header rowsFailed rows can be captured in a log file for later analysis, or just to skip them. See Unsupported Field Delimiters for information on skipping rejected rows.
Record delimiters (new lines) can be modified with the RECORD DELIMITER syntax.
If the date formats differ from ISO 8601, refer to the Supported Date Formats section to see how to override default parsing.
Fields in a CSV can be optionally quoted with double-quotes (
"
). However, any field containing a newline or another double-quote character must be quoted.If a field is quoted, any double quote that appears must be double-quoted (similar to the string literals quoting rules. For example, to encode
What are "birds"?
, the field should appear as"What are ""birds""?"
.Field delimiters don’t have a to be a displayable ASCII character. See Supported Field Delimiters for all options.
Best Practices for Parquet¶
Parquet files are loaded through Foreign Tables. The destination table structure has to match in number of columns between the source files.
Parquet files support predicate pushdown. When a query is issued over Parquet files, SQream DB uses row-group metadata to determine which row-groups in a file need to be read for a particular query and the row indexes can narrow the search to a particular set of rows.
Type Support and Behavior Notes¶
Unlike ORC, the column types should match the data types exactly (see table below).
SQream DB type → Parquet source |
|
|
|
|
|
|
|
Text 1 |
|
|
---|---|---|---|---|---|---|---|---|---|---|
|
Supported |
|||||||||
|
Supported |
|||||||||
|
Supported |
|||||||||
|
Supported |
|||||||||
|
Supported |
|||||||||
|
Supported |
|||||||||
|
Supported |
|||||||||
|
Supported 4 |
If a Parquet file has an unsupported type like
enum
,uuid
,time
,json
,bson
,lists
,maps
, but the data is not referenced in the table (it does not appear in the SELECT query), the statement will succeed. If the column is referenced, an error will be thrown to the user, explaining that the type is not supported, but the column may be ommited.
Best Practices for ORC¶
ORC files are loaded through Foreign Tables. The destination table structure has to match in number of columns between the source files.
ORC files support predicate pushdown. When a query is issued over ORC files, SQream DB uses ORC metadata to determine which stripes in a file need to be read for a particular query and the row indexes can narrow the search to a particular set of 10,000 rows.
Type Support and Behavior Notes¶
ORC files are loaded through Foreign Tables. The destination table structure has to match in number of columns between the source files.
The types should match to some extent within the same “class” (see table below).
SQream DB type → ORC source |
|
|
|
|
|
|
|
Text 1 |
|
|
---|---|---|---|---|---|---|---|---|---|---|
|
Supported |
Supported 5 |
Supported 5 |
Supported 5 |
Supported 5 |
|||||
|
○ 6 |
Supported |
Supported |
Supported |
Supported |
|||||
|
○ 6 |
○ 7 |
Supported |
Supported |
Supported |
|||||
|
○ 6 |
○ 7 |
○ 7 |
Supported |
Supported |
|||||
|
○ 6 |
○ 7 |
○ 7 |
○ 7 |
Supported |
|||||
|
Supported |
Supported |
||||||||
|
Supported |
Supported |
||||||||
|
Supported |
|||||||||
|
Supported |
Supported |
||||||||
|
Supported |
If an ORC file has an unsupported type like
binary
,list
,map
, andunion
, but the data is not referenced in the table (it does not appear in the SELECT query), the statement will succeed. If the column is referenced, an error will be thrown to the user, explaining that the type is not supported, but the column may be ommited.
Further Reading and Migration Guides¶
See also:
Footnotes
- 1(1,2)
Text values include
TEXT
,VARCHAR
, andNVARCHAR
.- 2
With UTF8 annotation.
- 3
With
TIMESTAMP_NANOS
orTIMESTAMP_MILLIS
annotation.- 4
Any microseconds will be rounded down to milliseconds.
- 5(1,2,3,4)
Boolean values are cast to 0, 1.
- 6(1,2,3,4)
Will succeed if all values are 0, 1.
- 7(1,2,3,4,5,6)
Will succeed if all values fit the destination type.