Best Practices¶

Foreign Data Wrapper Best Practice¶

A recommended approach when working with Foreign Tables and Foreign Data Wrapper (FDW) is storing files belonging to distinct file families and files with similar schemas in separate folders.

Best Practices for CSV¶

Text files, such as CSV, rarely conform to RFC 4180 , so you may need to make the following modifications:

Use OFFSET 2 for files containing header rows.
You can capture failed rows in a log file for later analysis, or skip them. See Unsupported ASCII Field Delimiters for information on skipping rejected rows.
You can modify record delimiters (new lines) using the RECORD DELIMITER syntax.
If the date formats deviate from ISO 8601, refer to the Supported Date Formats section for overriding the default parsing.
(Optional) You can quote fields in a CSV using double-quotes (").

Note

You must quote any field containing a new line or another double-quote character.

If a field is quoted, you must double quote any double quote, similar to the string literals quoting rules. For example, to encode What are "birds"?, the field should appear as "What are ""birds""?". For more information, see string literals quoting rules.
Field delimiters do not have to be a displayable ASCII character. For all supported field delimiters, see Field Delimiters.

Best Practices for Parquet¶

The following list shows the best practices when ingesting data from Parquet files:

You must load Parquet files through Foreign Tables. Note that the destination table structure must be identical to the number of columns between the source files.
Parquet files support predicate pushdown. When a query is issued over Parquet files, BLUE uses row-group metadata to determine which row-groups in a file must be read for a particular query and the row indexes can narrow the search to a particular set of rows.

Supported Types and Behavior Notes¶

Unlike the ORC format, the column types should match the data types exactly, as shown in the table below:

BLUE type → Parquet source	`BOOL`	`SMALLINT`	`INT`	`BIGINT`	`REAL`	`DOUBLE`	`TEXT`	`DATETIME`
`BOOLEAN`	Supported
`INT16`		Supported
`INT32`			Supported
`INT64`				Supported
`FLOAT`					Supported
`DOUBLE`						Supported
`BYTE_ARRAY` [1]							Supported
`INT96` [2]								Supported [3]

If a Parquet file has an unsupported type, such as enum, uuid, time, json, bson, lists, maps, but the table does not reference this data (i.e., the data does not appear in the SELECT query), the statement will succeed. If the table does reference a column, an error will be displayed explaining that the type is not supported, but the column may be omitted.

Best Practices for ORC¶

The following list shows the best practices when ingesting data from ORC files:

You must load ORC files through Foreign Tables. Note that the destination table structure must be identical to the number of columns between the source files.
ORC files support predicate pushdown. When a query is issued over ORC files, BLUE uses ORC metadata to determine which stripes in a file need to be read for a particular query and the row indexes can narrow the search to a particular set of 10,000 rows.

Type Support and Behavior Notes¶

You must load ORC files through foreign table. Note that the destination table structure must be identical to the number of columns between the source files.

For more information, see Foreign Tables.

The types should match to some extent within the same “class”, as shown in the following table:

BLUE Type → ORC Source	`BOOL`	`TINYINT`	`SMALLINT`	`INT`	`BIGINT`	`REAL`	`DOUBLE`	`TEXT`	`DATE`	`DATETIME`
`boolean`	Supported	Supported [4]	Supported [4]	Supported [4]	Supported [4]
`tinyint`	○ [5]	Supported	Supported	Supported	Supported
`smallint`	○ [5]	○ [6]	Supported	Supported	Supported
`int`	○ [5]	○ [6]	○ [6]	Supported	Supported
`bigint`	○ [5]	○ [6]	○ [6]	○ [6]	Supported
`float`						Supported	Supported
`double`						Supported	Supported
`string` / `char` / `varchar`								Supported
`date`									Supported	Supported
`timestamp`, `timestamp` with timezone										Supported

If an ORC file has an unsupported type like binary, list, map, and union, but the data is not referenced in the table (it does not appear in the SELECT query), the statement will succeed. If the column is referenced, an error will be thrown to the user, explaining that the type is not supported, but the column may be ommited.

Unsupported Data Types¶

BLUE does not support certain features that are supported by other databases, such as BLOB, ENUM, and SET. You must convert these data types before loading them. For example, you can store ENUM as TEXT.

Best Practices¶

Foreign Data Wrapper Best Practice¶

Best Practices for CSV¶

Best Practices for Parquet¶

Supported Types and Behavior Notes¶

Best Practices for ORC¶

Type Support and Behavior Notes¶

Further Reading and Migration Guides¶

Unsupported Data Types¶