Overview

The Ingesting Data Overview page provides basic information useful when ingesting data into SQream from a variety of sources and locations, and describes the following:

Getting Started

SQream supports ingesting data using the following methods:

  • Executing the INSERT statement using a client driver.

  • Executing the COPY FROM statement or ingesting data from foreign tables:

    • Local filesystem and locally mounted network filesystems

    • Ingesting Data using the Amazon S3 object storage service

    • Ingesting Data using an HDFS data storage system

SQream supports loading files from the following formats:

  • Text - CSV, TSV, and PSV

  • Parquet

  • ORC

  • Avro

  • JSON

For more information, see the following:

Data Loading Considerations

The Data Loading Considerations section describes the following:

Verifying Data and Performance after Loading

Like many RDBMSs, SQream recommends its own set of best practices for table design and query optimization. When using SQream, verify the following:

  • That your data is structured as you expect (row counts, data types, formatting, content).

  • That your query performance is adequate.

  • That you followed the table design best practices (Optimization and Best Practices).

  • That you’ve tested and verified that your applications work.

  • That your data types have not been not over-provisioned.

File Soure Location when Loading

While you are loading data, you can use the COPY FROM command to let statements run on any worker. If you are running multiple nodes, verify that all nodes can see the source the same. Loading data from a local file that is only on one node and not on shared storage may cause it to fail. If required, you can also control which node a statement runs on using the Workload Manager).

For more information, see the following:

Supported Load Methods

You can use the COPY FROM syntax to load CSV files.

Note

The COPY FROM cannot be used for loading data from Parquet and ORC files.

You can use foreign tables to load text files, Parquet, and ORC files, and to transform your data before generating a full table, as described in the following table:

Method/File Type

Text (CSV)

Parquet

ORC

Streaming Data

COPY FROM

Supported

Not supported

Not supported

Not supported

Foreign tables

Supported

Supported

Supported

Not supported

INSERT

Not supported

Not supported

Not supported

Supported (Python, JDBC, Node.JS)

For more information, see the following:

Unsupported Data Types

SQream does not support certain features that are supported by other databases, such as ARRAY, BLOB, ENUM, and SET. You must convert these data types before loading them. For example, you can store ENUM as TEXT.

Handing Extended Errors

While you can use foreign tables to load CSVs, the COPY FROM statement provides more fine-grained error handling options and extended support for non-standard CSVs with multi-character delimiters, alternate timestamp formats, and more.

For more information, see foreign tables.

Foreign Data Wrapper Best Practice

A recommended approach when working with Foreign Tables and Foreign Data Wrapper (FDW) is storing files belonging to distinct file families and files with similar schemas in separate folders.

Best Practices for CSV

Text files, such as CSV, rarely conform to RFC 4180 , so you may need to make the following modifications:

  • Use OFFSET 2 for files containing header rows.

  • You can capture failed rows in a log file for later analysis, or skip them. See Unsupported Field Delimiters for information on skipping rejected rows.

  • You can modify record delimiters (new lines) using the RECORD DELIMITER syntax.

  • If the date formats deviate from ISO 8601, refer to the Supported Date Formats section for overriding the default parsing.

  • (Optional) You can quote fields in a CSV using double-quotes (").

Note

You must quote any field containing a new line or another double-quote character.

  • If a field is quoted, you must double quote any double quote, similar to the string literals quoting rules. For example, to encode What are "birds"?, the field should appear as "What are ""birds""?". For more information, see string literals quoting rules.

  • Field delimiters do not have to be a displayable ASCII character. For all supported field delimiters, see Supported Field Delimiters.

Best Practices for Parquet

The following list shows the best practices when ingesting data from Parquet files:

  • You must load Parquet files through Foreign Tables. Note that the destination table structure must be identical to the number of columns between the source files.

  • Parquet files support predicate pushdown. When a query is issued over Parquet files, SQream uses row-group metadata to determine which row-groups in a file must be read for a particular query and the row indexes can narrow the search to a particular set of rows.

Supported Types and Behavior Notes

Unlike the ORC format, the column types should match the data types exactly, as shown in the table below:

SQream DB type →

Parquet source

BOOL

TINYINT

SMALLINT

INT

BIGINT

REAL

DOUBLE

Text [#f0]_

DATE

DATETIME

BOOLEAN

Supported

INT16

Supported

INT32

Supported

INT64

Supported

FLOAT

Supported

DOUBLE

Supported

BYTE_ARRAY [1]

Supported

INT96 [2]

Supported [3]

If a Parquet file has an unsupported type, such as enum, uuid, time, json, bson, lists, maps, but the table does not reference this data (i.e., the data does not appear in the SELECT query), the statement will succeed. If the table does reference a column, an error will be displayed explaining that the type is not supported, but the column may be omitted.

Best Practices for ORC

The following list shows the best practices when ingesting data from ORC files:

  • You must load ORC files through Foreign Tables. Note that the destination table structure must be identical to the number of columns between the source files.

  • ORC files support predicate pushdown. When a query is issued over ORC files, SQream uses ORC metadata to determine which stripes in a file need to be read for a particular query and the row indexes can narrow the search to a particular set of 10,000 rows.

Type Support and Behavior Notes

You must load ORC files through a foreign table. Note that the destination table structure must be identical to the number of columns between the source files.

For more information, see Foreign Tables.

The types should match to some extent within the same “class”, as shown in the following table:

SQream DB Type →

ORC Source

BOOL

TINYINT

SMALLINT

INT

BIGINT

REAL

DOUBLE

TEXT

DATE

DATETIME

boolean

Supported

Supported [4]

Supported [4]

Supported [4]

Supported [4]

tinyint

[5]

Supported

Supported

Supported

Supported

smallint

[5]

[6]

Supported

Supported

Supported

int

[5]

[6]

[6]

Supported

Supported

bigint

[5]

[6]

[6]

[6]

Supported

float

Supported

Supported

double

Supported

Supported

string / char / varchar

Supported

date

Supported

Supported

timestamp, timestamp with timezone

Supported

  • If an ORC file has an unsupported type like binary, list, map, and union, but the data is not referenced in the table (it does not appear in the SELECT query), the statement will succeed. If the column is referenced, an error will be thrown to the user, explaining that the type is not supported, but the column may be omitted.

Further Reading and Migration Guides

For more information, see the following:

Footnotes