Overview
The Ingesting Data Overview page provides basic information useful when ingesting data into SQream from a variety of sources and locations, and describes the following:
Getting Started
SQream supports ingesting data using the following methods:
Executing the
INSERT
statement using a client driver.Executing the
COPY FROM
statement or ingesting data from foreign tables:Local filesystem and locally mounted network filesystems
Ingesting Data using the Amazon S3 object storage service
Ingesting Data using an HDFS data storage system
SQream supports loading files from the following formats:
Text - CSV, TSV, and PSV
Parquet
ORC
Avro
JSON
For more information, see the following:
Using the
INSERT
statement - INSERTUsing client drivers - Client drivers
Using the
COPY FROM
statement - COPY FROMUsing the Amazon S3 object storage service - Amazon Web Services
Using the HDFS data storage system - HDFS Environment
Loading data from foreign tables - Foreign Tables
Data Loading Considerations
The Data Loading Considerations section describes the following:
Verifying Data and Performance after Loading
Like many RDBMSs, SQream recommends its own set of best practices for table design and query optimization. When using SQream, verify the following:
That your data is structured as you expect (row counts, data types, formatting, content).
That your query performance is adequate.
That you followed the table design best practices (Optimization and Best Practices).
That you’ve tested and verified that your applications work.
That your data types have not been not over-provisioned.
File Soure Location when Loading
While you are loading data, you can use the COPY FROM
command to let statements run on any worker. If you are running multiple nodes, verify that all nodes can see the source the same. Loading data from a local file that is only on one node and not on shared storage may cause it to fail. If required, you can also control which node a statement runs on using the Workload Manager).
For more information, see the following:
Supported Load Methods
You can use the COPY FROM
syntax to load CSV files.
Note
The COPY FROM
cannot be used for loading data from Parquet and ORC files.
You can use foreign tables to load text files, Parquet, and ORC files, and to transform your data before generating a full table, as described in the following table:
Method/File Type |
Text (CSV) |
Parquet |
ORC |
Streaming Data |
---|---|---|---|---|
COPY FROM |
Supported |
Not supported |
Not supported |
Not supported |
Foreign tables |
Supported |
Supported |
Supported |
Not supported |
INSERT |
Not supported |
Not supported |
Not supported |
Supported (Python, JDBC, Node.JS) |
For more information, see the following:
Unsupported Data Types
SQream does not support certain features that are supported by other databases, such as ARRAY
, BLOB
, ENUM
, and SET
. You must convert these data types before loading them. For example, you can store ENUM
as TEXT
.
Handing Extended Errors
While you can use foreign tables to load CSVs, the COPY FROM
statement provides more fine-grained error handling options and extended support for non-standard CSVs with multi-character delimiters, alternate timestamp formats, and more.
For more information, see foreign tables.
Foreign Data Wrapper Best Practice
A recommended approach when working with Foreign Tables and Foreign Data Wrapper (FDW) is storing files belonging to distinct file families and files with similar schemas in separate folders.
Best Practices for CSV
Text files, such as CSV, rarely conform to RFC 4180 , so you may need to make the following modifications:
Use
OFFSET 2
for files containing header rows.You can capture failed rows in a log file for later analysis, or skip them. See Unsupported Field Delimiters for information on skipping rejected rows.
You can modify record delimiters (new lines) using the RECORD DELIMITER syntax.
If the date formats deviate from ISO 8601, refer to the Supported Date Formats section for overriding the default parsing.
(Optional) You can quote fields in a CSV using double-quotes (
"
).
Note
You must quote any field containing a new line or another double-quote character.
If a field is quoted, you must double quote any double quote, similar to the string literals quoting rules. For example, to encode
What are "birds"?
, the field should appear as"What are ""birds""?"
. For more information, see string literals quoting rules.Field delimiters do not have to be a displayable ASCII character. For all supported field delimiters, see Supported Field Delimiters.
Best Practices for Parquet
The following list shows the best practices when ingesting data from Parquet files:
You must load Parquet files through Foreign Tables. Note that the destination table structure must be identical to the number of columns between the source files.
Parquet files support predicate pushdown. When a query is issued over Parquet files, SQream uses row-group metadata to determine which row-groups in a file must be read for a particular query and the row indexes can narrow the search to a particular set of rows.
Supported Types and Behavior Notes
Unlike the ORC format, the column types should match the data types exactly, as shown in the table below:
SQream DB type → Parquet source |
|
|
|
|
|
|
|
Text [#f0]_ |
|
|
---|---|---|---|---|---|---|---|---|---|---|
|
Supported |
|||||||||
|
Supported |
|||||||||
|
Supported |
|||||||||
|
Supported |
|||||||||
|
Supported |
|||||||||
|
Supported |
|||||||||
|
Supported |
|||||||||
|
Supported [3] |
If a Parquet file has an unsupported type, such as enum
, uuid
, time
, json
, bson
, lists
, maps
, but the table does not reference this data (i.e., the data does not appear in the SELECT query), the statement will succeed. If the table does reference a column, an error will be displayed explaining that the type is not supported, but the column may be omitted.
Best Practices for ORC
The following list shows the best practices when ingesting data from ORC files:
You must load ORC files through Foreign Tables. Note that the destination table structure must be identical to the number of columns between the source files.
ORC files support predicate pushdown. When a query is issued over ORC files, SQream uses ORC metadata to determine which stripes in a file need to be read for a particular query and the row indexes can narrow the search to a particular set of 10,000 rows.
Type Support and Behavior Notes
You must load ORC files through a foreign table. Note that the destination table structure must be identical to the number of columns between the source files.
For more information, see Foreign Tables.
The types should match to some extent within the same “class”, as shown in the following table:
SQream DB Type → ORC Source |
|
|
|
|
|
|
|
|
|
|
---|---|---|---|---|---|---|---|---|---|---|
|
Supported |
Supported [4] |
Supported [4] |
Supported [4] |
Supported [4] |
|||||
|
○ [5] |
Supported |
Supported |
Supported |
Supported |
|||||
|
○ [5] |
○ [6] |
Supported |
Supported |
Supported |
|||||
|
○ [5] |
○ [6] |
○ [6] |
Supported |
Supported |
|||||
|
○ [5] |
○ [6] |
○ [6] |
○ [6] |
Supported |
|||||
|
Supported |
Supported |
||||||||
|
Supported |
Supported |
||||||||
|
Supported |
|||||||||
|
Supported |
Supported |
||||||||
|
Supported |
If an ORC file has an unsupported type like
binary
,list
,map
, andunion
, but the data is not referenced in the table (it does not appear in the SELECT query), the statement will succeed. If the column is referenced, an error will be thrown to the user, explaining that the type is not supported, but the column may be omitted.
Further Reading and Migration Guides
For more information, see the following:
Footnotes