Inserting data

This guide covers inserting data into SQream DB, with subguides on inserting data from a variety of sources and locations.

Data loading overview

SQream DB supports importing data from the following sources:

SQream DB supports loading files in the following formats:

  • Text - CSV, TSV, PSV
  • Parquet
  • ORC

Data loading considerations

Verify data and performance after load

Like other RDBMSs, SQream DB has its own set of best practcies for table design and query optimization.

SQream therefore recommends:

  • Verify that the data is as you expect it (e.g. row counts, data types, formatting, content)
  • The performance of your queries is adequate
  • Best practices were followed for table design
  • Applications such as Tableau and others have been tested, and work
  • Data types were not over-provisioned (e.g. don’t use VARCHAR(2000) to store a short string)

File source location for loading

During loading using COPY FROM, the statement can run on any worker. If you are running multiple nodes, make sure that all nodes can see the source the same. If you load from a local file which is only on 1 node and not on shared storage, it will fail some of the time. (If you need to, you can also control which node a statement runs on using the Workload manager).

Supported load methods

SQream DB’s COPY FROM syntax can be used to load CSV files, but can’t be used for Parquet and ORC.

EXTERNAL TABLE can be used to load text files, Parquet, and ORC files, and can also transform the data prior to materialization as a full table.

Method / File type Text (CSV) Parquet ORC Streaming data
COPY FROM
External Tables
INSERT ✓ (Python, JDBC, Node.JS)

Unsupported data types

SQream DB doesn’t support the entire set of features that some other database systems may have, such as ARRAY, BLOB, ENUM, SET, etc.

These data types will have to be converted before load. For example, ENUM can often be stored as a VARCHAR.

Extended error handling

While external tables can be used to load CSVs, the COPY FROM statement provides more fine-grained error handling options, as well as extended support for non-standard CSVs with multi-character delimiters, alternate timestamp formats, and more.

Best practices for CSV

Text files like CSV rarely conform to RFC 4180 , so alterations may be required:

  • Use OFFSET 2 for files containing header rows

  • Failed rows can be captured in a log file for later analysis, or just to skip them. See Capturing rejected rows for information on skipping rejected rows.

  • Record delimiters (new lines) can be modified with the RECORD DELIMITER syntax.

  • If the date formats differ from ISO 8601, refer to the Supported date formats section to see how to override default parsing.

  • Fields in a CSV can be optionally quoted with double-quotes ("). However, any field containing a newline or another double-quote character must be quoted.

    If a field is quoted, any double quote that appears must be double-quoted (similar to the string literals quoting rules. For example, to encode What are "birds"?, the field should appear as "What are ""birds""?".

  • Field delimiters don’t have a to be a displayable ASCII character. See Supported field delimiters for all options.

Best practices for Parquet

  • Parquet files are loaded through External Tables. The destination table structure has to match in number of columns between the source files.
  • Parquet files support predicate pushdown. When a query is issued over Parquet files, SQream DB uses row-group metadata to determine which row-groups in a file need to be read for a particular query and the row indexes can narrow the search to a particular set of rows.

Type support and behavior notes

  • Unlike ORC, the column types should match the data types exactly (see table below).

SQream DB type →

Parquet source

BOOL TINYINT SMALLINT INT BIGINT REAL DOUBLE Text [1] DATE DATETIME
BOOLEAN                  
INT16                  
INT32                  
INT64                  
FLOAT                  
DOUBLE                  
BYTE_ARRAY [2]                  
INT96 [3]                   [4]
  • If a Parquet file has an unsupported type like enum, uuid, time, json, bson, lists, maps, but the data is not referenced in the table (it does not appear in the SELECT query), the statement will succeed. If the column is referenced, an error will be thrown to the user, explaining that the type is not supported, but the column may be ommited.

Best practices for ORC

  • ORC files are loaded through External Tables. The destination table structure has to match in number of columns between the source files.
  • ORC files support predicate pushdown. When a query is issued over ORC files, SQream DB uses ORC metadata to determine which stripes in a file need to be read for a particular query and the row indexes can narrow the search to a particular set of 10,000 rows.

Type support and behavior notes

  • ORC files are loaded through External Tables. The destination table structure has to match in number of columns between the source files.
  • The types should match to some extent within the same “class” (see table below).

SQream DB type →

ORC source

BOOL TINYINT SMALLINT INT BIGINT REAL DOUBLE Text [1] DATE DATETIME
boolean [5] [5] [5] [5]          
tinyint [6]          
smallint [6] [7]          
int [6] [7] [7]          
bigint [6] [7] [7] [7]          
float                
double                
string / char / varchar                  
date                
timestamp, timestamp with timezone                  
  • If an ORC file has an unsupported type like binary, list, map, and union, but the data is not referenced in the table (it does not appear in the SELECT query), the statement will succeed. If the column is referenced, an error will be thrown to the user, explaining that the type is not supported, but the column may be ommited.

Further reading and migration guides

See also:

Footnotes

[1](1, 2) Text values include TEXT, VARCHAR, and NVARCHAR
[2]With UTF8 annotation
[3]With TIMESTAMP_NANOS `` or ``TIMESTAMP_MILLIS annotation
[4]Any microseconds will be rounded down to milliseconds.
[5](1, 2, 3, 4) Boolean values are cast to 0, 1
[6](1, 2, 3, 4) Will succeed if all values are 0, 1
[7](1, 2, 3, 4, 5, 6) Will succeed if all values fit the destination type