Best Practices

Foreign Data Wrapper Best Practice

A recommended approach when working with Foreign Tables and Foreign Data Wrapper (FDW) is storing files belonging to distinct file families and files with similar schemas in separate folders.

Best Practices for CSV

Text files, such as CSV, rarely conform to RFC 4180 , so you may need to make the following modifications:

  • Use OFFSET 2 for files containing header rows.

  • You can capture failed rows in a log file for later analysis, or skip them. See Unsupported ASCII Field Delimiters for information on skipping rejected rows.

  • You can modify record delimiters (new lines) using the RECORD DELIMITER syntax.

  • If the date formats deviate from ISO 8601, refer to the Supported Date Formats section for overriding the default parsing.

  • (Optional) You can quote fields in a CSV using double-quotes (").

Note

You must quote any field containing a new line or another double-quote character.

  • If a field is quoted, you must double quote any double quote, similar to the string literals quoting rules. For example, to encode What are "birds"?, the field should appear as "What are ""birds""?". For more information, see string literals quoting rules.

  • Field delimiters do not have to be a displayable ASCII character. For all supported field delimiters, see Field Delimiters.

Best Practices for Parquet

The following list shows the best practices when ingesting data from Parquet files:

  • You must load Parquet files through Foreign Tables. Note that the destination table structure must be identical to the number of columns between the source files.

  • Parquet files support predicate pushdown. When a query is issued over Parquet files, BLUE uses row-group metadata to determine which row-groups in a file must be read for a particular query and the row indexes can narrow the search to a particular set of rows.

Supported Types and Behavior Notes

Unlike the ORC format, the column types should match the data types exactly, as shown in the table below:

BLUE type →

Parquet source

BOOL

TINYINT

SMALLINT

INT

BIGINT

REAL

DOUBLE

TEXT

DATE

DATETIME

BOOLEAN

Supported

INT16

Supported

INT32

Supported

INT64

Supported

FLOAT

Supported

DOUBLE

Supported

BYTE_ARRAY [1]

Supported

INT96 [2]

Supported [3]

If a Parquet file has an unsupported type, such as enum, uuid, time, json, bson, lists, maps, but the table does not reference this data (i.e., the data does not appear in the SELECT query), the statement will succeed. If the table does reference a column, an error will be displayed explaining that the type is not supported, but the column may be omitted.

Best Practices for ORC

The following list shows the best practices when ingesting data from ORC files:

  • You must load ORC files through Foreign Tables. Note that the destination table structure must be identical to the number of columns between the source files.

  • ORC files support predicate pushdown. When a query is issued over ORC files, BLUE uses ORC metadata to determine which stripes in a file need to be read for a particular query and the row indexes can narrow the search to a particular set of 10,000 rows.

Type Support and Behavior Notes

You must load ORC files through foreign table. Note that the destination table structure must be identical to the number of columns between the source files.

For more information, see Foreign Tables.

The types should match to some extent within the same “class”, as shown in the following table:

BLUE Type →

ORC Source

BOOL

TINYINT

SMALLINT

INT

BIGINT

REAL

DOUBLE

TEXT

DATE

DATETIME

boolean

Supported

Supported [4]

Supported [4]

Supported [4]

Supported [4]

tinyint

[5]

Supported

Supported

Supported

Supported

smallint

[5]

[6]

Supported

Supported

Supported

int

[5]

[6]

[6]

Supported

Supported

bigint

[5]

[6]

[6]

[6]

Supported

float

Supported

Supported

double

Supported

Supported

string / char / varchar

Supported

date

Supported

Supported

timestamp, timestamp with timezone

Supported

  • If an ORC file has an unsupported type like binary, list, map, and union, but the data is not referenced in the table (it does not appear in the SELECT query), the statement will succeed. If the column is referenced, an error will be thrown to the user, explaining that the type is not supported, but the column may be ommited.

Further Reading and Migration Guides

For more information, see the following:

Footnotes

Unsupported Data Types

BLUE does not support certain features that are supported by other databases, such as BLOB, ENUM, and SET. You must convert these data types before loading them. For example, you can store ENUM as TEXT.