Overview

The Ingesting Data Overview page provides basic information useful when ingesting data into SQream from a variety of sources and locations, and describes the following:

Getting Started 

SQream supports ingesting data using the following methods:

Executing the INSERT statement using a client driver.
Executing the COPY FROM statement or ingesting data from foreign tables:
- Local filesystem and locally mounted network filesystems
- Ingesting Data using the Amazon S3 object storage service
- Ingesting Data using an HDFS data storage system

SQream supports loading files from the following formats:

Text - CSV, TSV, and PSV
Parquet
ORC
Avro
JSON

For more information, see the following:

Using the INSERT statement - INSERT
Using client drivers - Client drivers
Using the COPY FROM statement - COPY FROM
Using the Amazon S3 object storage service - Amazon Web Services
Using the HDFS data storage system - HDFS Environment
Loading data from foreign tables - Foreign Tables

SQream does not support certain features that are supported by other databases, such as ARRAY, BLOB, ENUM, and SET. You must convert these data types before loading them. For example, you can store ENUM as TEXT.

Handing Extended Errors 

While you can use foreign tables to load CSVs, the COPY FROM statement provides more fine-grained error handling options and extended support for non-standard CSVs with multi-character delimiters, alternate timestamp formats, and more.

For more information, see foreign tables.

Foreign Data Wrapper Best Practice 

A recommended approach when working with Foreign Tables and Foreign Data Wrapper (FDW) is storing files belonging to distinct file families and files with similar schemas in separate folders.

Best Practices for CSV

Text files, such as CSV, rarely conform to RFC 4180 , so you may need to make the following modifications:

Use OFFSET 2 for files containing header rows.
You can capture failed rows in a log file for later analysis, or skip them. See Unsupported Field Delimiters for information on skipping rejected rows.
You can modify record delimiters (new lines) using the RECORD DELIMITER syntax.
If the date formats deviate from ISO 8601, refer to the Supported Date Formats section for overriding the default parsing.
(Optional) You can quote fields in a CSV using double-quotes (").

Note

You must quote any field containing a new line or another double-quote character.

If a field is quoted, you must double quote any double quote, similar to the string literals quoting rules. For example, to encode What are "birds"?, the field should appear as "What are ""birds""?". For more information, see string literals quoting rules.
Field delimiters do not have to be a displayable ASCII character. For all supported field delimiters, see Supported Field Delimiters.

Best Practices for Parquet

The following list shows the best practices when ingesting data from Parquet files:

You must load Parquet files through Foreign Tables. Note that the destination table structure must be identical to the number of columns between the source files.
Parquet files support predicate pushdown. When a query is issued over Parquet files, SQream uses row-group metadata to determine which row-groups in a file must be read for a particular query and the row indexes can narrow the search to a particular set of rows.

Supported Types and Behavior Notes

Unlike the ORC format, the column types should match the data types exactly, as shown in the table below:

SQream DB type → Parquet source	`BOOL`	`SMALLINT`	`INT`	`BIGINT`	`REAL`	`DOUBLE`	Text [#f0]_	`DATETIME`
`BOOLEAN`	Supported
`INT16`		Supported
`INT32`			Supported
`INT64`				Supported
`FLOAT`					Supported
`DOUBLE`						Supported
`BYTE_ARRAY` [1]							Supported
`INT96` [2]								Supported [3]

If a Parquet file has an unsupported type, such as enum, uuid, time, json, bson, lists, maps, but the table does not reference this data (i.e., the data does not appear in the SELECT query), the statement will succeed. If the table does reference a column, an error will be displayed explaining that the type is not supported, but the column may be omitted.

Best Practices for ORC

The following list shows the best practices when ingesting data from ORC files:

You must load ORC files through Foreign Tables. Note that the destination table structure must be identical to the number of columns between the source files.
ORC files support predicate pushdown. When a query is issued over ORC files, SQream uses ORC metadata to determine which stripes in a file need to be read for a particular query and the row indexes can narrow the search to a particular set of 10,000 rows.

Type Support and Behavior Notes

You must load ORC files through a foreign table. Note that the destination table structure must be identical to the number of columns between the source files.

For more information, see Foreign Tables.

The types should match to some extent within the same “class”, as shown in the following table:

SQream DB Type → ORC Source	`BOOL`	`TINYINT`	`SMALLINT`	`INT`	`BIGINT`	`REAL`	`DOUBLE`	`TEXT`	`DATE`	`DATETIME`
`boolean`	Supported	Supported [4]	Supported [4]	Supported [4]	Supported [4]
`tinyint`	○ [5]	Supported	Supported	Supported	Supported
`smallint`	○ [5]	○ [6]	Supported	Supported	Supported
`int`	○ [5]	○ [6]	○ [6]	Supported	Supported
`bigint`	○ [5]	○ [6]	○ [6]	○ [6]	Supported
`float`						Supported	Supported
`double`						Supported	Supported
`string` / `char` / `varchar`								Supported
`date`									Supported	Supported
`timestamp`, `timestamp` with timezone										Supported

If an ORC file has an unsupported type like binary, list, map, and union, but the data is not referenced in the table (it does not appear in the SELECT query), the statement will succeed. If the column is referenced, an error will be thrown to the user, explaining that the type is not supported, but the column may be omitted.

Method/File Type	Text (CSV)	Parquet	ORC	Streaming Data
COPY FROM	Supported	Not supported	Not supported	Not supported
Foreign tables	Supported	Supported	Supported	Not supported
INSERT	Not supported	Not supported	Not supported	Supported (Python, JDBC, Node.JS)

Overview

Getting Started 

Data Loading Considerations 

Verifying Data and Performance after Loading 

File Soure Location when Loading 

Supported Load Methods 

Unsupported Data Types 