COPY FROM

COPY ... FROM is a statement that allows loading data from files on the filesystem and importing them into SQreamDB tables. This is the recommended way for bulk loading CSV files into SQreamDB. In general, COPY moves data between filesystem files and SQreamDB tables.

Note

Permissions

The role must have the INSERT permission to the destination table.

Syntax

 COPY [schema name.]table_name [ (<column_name>) [, ...] ]
   FROM WRAPPER fdw_name
   OPTIONS
   (
     [ copy_from_option [, ...] ]
   )


 schema_name ::= identifer

 table_name ::= identifier

 copy_from_option ::=

    LOCATION = { filename | S3 URI | HDFS URI }

    | QUOTE = {'C' | E'\ooo')

    | OFFSET = { offset }

    | LIMIT = { limit }

    | DELIMITER = '{ delimiter }'

    | RECORD_DELIMITER = '{ record delimiter }'

    | ERROR_LOG = '{ local filepath }'

    | REJECTED_DATA = '{ local filepath }'

    | CONTINUE_ON_ERROR = { true | false }

    | ERROR_COUNT = '{ error count }'

    | DATETIME_FORMAT = '{ parser format }'

    | AWS_ID = '{ AWS ID }'

    | AWS_SECRET = '{ AWS Secret }'

offset ::= positive integer

limit ::= positive integer

delimiter ::= string

record delimiter ::= string

error count ::= integer

parser_format ::= see supported parser table below

AWS ID ::= string

AWS Secret ::= string

Note

Some options are applicable to CSVs only.

These include: OFFSET, LIMIT, DELIMITER, RECORD_DELIMITER, REJECTED_DATA, DATETIME_FORMAT

Elements

Parameter

Default value

Value range

Description

[schema_name.]table_name

None

Table to copy data into

QUOTE

"

Specifies an alternative quote character. The quote character must be a single, 1-byte printable ASCII character, and the equivalent octal syntax of the copy command can be used. The quote character cannot be contained in the field delimiter, the record delimiter, or the null marker. QUOTE can be used with csv_fdw in COPY FROM and foreign tables. The following characters cannot be an alternative quote character: "-.:\\0123456789abcdefghijklmnopqrstuvwxyzN"

fdw_name

csv_fdw, orc_fdw, parquet_fdw, json_fdw, or avro_fdw

The name of the Foreign Data Wrapper to use

LOCATION

None

A path on the local filesystem, S3, or HDFS URI. For example, /tmp/foo.csv, s3://my-bucket/foo.csv, or hdfs://my-namenode:8020/foo.csv. The local path must be an absolute path that SQream DB can access. Wildcards are premitted in this field.

OFFSET

1

>1, but no more than the number of lines in the first file

The row number to start with. The first row is 1.

LIMIT

unlimited

1 to 2147483647.

When specified, tells SQream DB to stop loading after the specified number of rows. Unlimited if unset.

DELIMITER

','

Almost any ASCII character, See field delimiters section below

Specifies the field terminator - the character (or characters) that separates fields or columns within each row of the file.

RECORD_DELIMITER

\n (UNIX style newline)

\n, \r\n, \r

Specifies the row terminator - the character that separates lines or rows, also known as a new line separator.

ERROR_LOG

No error log

When used, the COPY process will write error information from unparsable rows to the file specified by this parameter. ERROR_LOG requires CONTINUE_ON_ERROR to be set to true

  • If an existing file path is specified, it will be overwritten.

  • Specifying the same file for ERROR_LOG and REJECTED_DATA is not allowed and will result in error.

  • Specifing an error log when creating a foreign table will write a new error log for every query on the foreign table.

REJECTED_DATA

Inactive

When used, the COPY process will write the rejected record lines to this file.

  • If an existing file path is specified, it will be overwritten.

  • Specifying the same file for ERROR_LOG and REJECTED_DATA is not allowed and will result in error.

  • Specifing an error log when creating a foreign table will write a new error log for every query on the foreign table.

CONTINUE_ON_ERROR

false

true, false

Specifies if errors should be ignored or skipped. When set to true, the transaction will continue despite rejected data.

This parameter should be set together with ERROR_COUNT When reading multiple files, if an entire file can’t be opened it will be skipped.

ERROR_COUNT

unlimited

1 to 2147483647

Specifies the threshold for the maximum number of faulty records that will be ignored.

This setting must be used in conjunction with CONTINUE_ON_ERROR.

DATETIME_FORMAT

ISO8601 for all columns

See table below

Allows specifying a non-default date formats for specific columns

AWS_ID, AWS_SECRET

None

Specifies the authentication details for secured S3 buckets

Supported Date Formats

Supported date parsers

Name

Pattern

Examples

ISO8601, DEFAULT

YYYY-MM-DD [hh:mm:ss[.SSS]]

2017-12-31 11:12:13.456, 2018-11-02 11:05:00, 2019-04-04

ISO8601C

YYYY-MM-DD [hh:mm:ss[:SSS]]

2017-12-31 11:12:13:456

DMY

DD/MM/YYYY [hh:mm:ss[.SSS]]

31/12/2017 11:12:13.123

YMD

YYYY/MM/DD [hh:mm:ss[.SSS]]

2017/12/31 11:12:13.678

MDY

MM/DD/YYYY [hh:mm:ss[.SSS]]

12/31/2017 11:12:13.456

YYYYMMDD

YYYYMMDD[hh[mm[ss[SSS]]]]

20171231111213456

YYYY-M-D

YYYY-M-D[ h:m[:s[.S]]]

2017-9-10 10:7:21.1 (optional leading zeroes)

YYYY/M/D

YYYY/M/D[ h:m[:s[.S]]]

2017/9/10 10:7:21.1 (optional leading zeroes)

DD-mon-YYYY

DD-mon-YYYY[ hh:mm[:ss[.SSS]]]

31-Dec-2017 11:12:13.456

YYYY-mon-DD

YYYY-mon-DD[ hh:mm[:ss[.SSS]]]

2017-Dec-31 11:12:13.456

Pattern

Description

YYYY

four digit year representation (0000-9999)

MM

two digit month representation (01-12)

DD

two digit day of month representation (01-31)

m

short month representation (Jan-Dec)

a

short day of week representation (Sun-Sat).

hh

two digit 24 hour representation (00-23)

h

two digit 12 hour representation (00-12)

P

uppercase AM/PM representation

mm

two digit minute representation (00-59)

ss

two digit seconds representation (00-59)

SSS

3 digits fraction representation for milliseconds (000-999)

Note

These date patterns are not the same as date parts used in the DATEPART function.

Supported Field Delimiters

Field delimiters can be one or more characters.

Customizing Quotations Using Alternative Characters

Syntax:

COPY t FROM wrapper csv_fdw OPTIONS (location = '/tmp/source_file.csv', quote='@');
COPY t TO wrapper csv_fdw OPTIONS (location = '/tmp/destination_file.csv', quote='@');

Example:

The following is an example of line taken from a CSV when customizing quotations using a character:

Pepsi-"Cola",@Coca-"Cola"@,Sprite,Fanta

Customizing Quotations Using ASCII Character Codes

Syntax:

copy t from wrapper csv_fdw options (location = '/tmp/source_file.csv', quote=E'\064');
copy t to wrapper csv_fdw options (location = '/tmp/destination_file.csv', quote=E'\064');

Example:

The following is an example of line taken from a CSV when customizing quotations using an ASCII character code:

Pepsi-"Cola",@Coca-"Cola"@,Sprite,Fanta

Multi-Character Delimiters

SQreamDB supports multi-character field delimiters, sometimes found in non-standard files.

A multi-character delimiter can be specified. For example, DELIMITER '%%', DELIMITER '{~}', etc.

Printable Characters

All printable ASCII character (except for N) can be used as a delimiter without special syntax. The default CSV field delimiter is a comma (,).

The following table shows the supported printable ASCII characters:

Character

Description

ASCII

Octal

Hex

Binary

HTML Code

HTML Name

(Space)

Space

32

40

20

100000

&#32;

!

Exclamation Mark

33

41

21

100001

&#33;

&excl;

#

Hash or Number

35

43

23

100011

&#35;

&num;

$

Dollar Sign

36

44

24

100100

&#36;

&dollar;

%

Percentage

37

45

25

100101

&#37;

&percnt;

&

Ampersand

38

46

26

100110

&#38;

&amp;

(

Left Parenthesis

40

50

28

101000

&#40;

&lpar;

)

Right Parenthesis

41

51

29

101001

&#41;

&rpar;

*

Asterisk

42

52

2A

101010

&#42;

&ast;

+

Plus Sign

43

53

2B

101011

&#43;

&plus;

,

Comma

44

54

2C

101100

&#44;

&comma;

/

Slash

47

57

2F

101111

&#47;

&sol;

;

Semicolon

59

73

3B

111011

&#59;

&semi;

<

Less Than

60

74

3C

111100

&#60;

&lt;

=

Equals Sign

61

75

3D

111101

&#61;

&equals;

>

Greater Than

62

76

3E

111110

&#62;

&gt;

?

Question Mark

63

77

3F

111111

&#63;

&quest;

@

At Sign

64

100

40

1000000

&#64;

&commat;

[

Left Square Bracket

91

133

5B

1011011

&#91;

&lsqb;

\

Backslash

92

134

5C

1011100

&#92;

&bsol;

]

Right Square Bracket

93

135

5D

1011101

&#93;

&rsqb;

^

Caret or Circumflex

94

136

5E

1011110

&#94;

&hat;

_

Underscore

95

137

5F

1011111

&#95;

&lowbar;

`

Grave Accent

96

140

60

1100000

`

&grave;

{

Left Curly Bracket

123

173

7B

1111011

&#123;

&lcub;

|

Vertical Bar

124

174

7C

1111100

&#124;

&verbar;

}

Right Curly Bracket

125

175

7D

1111101

&#125;

&rcub;

~

Tilde

126

176

7E

1111110

&#126;

&tilde;

58

:

Colon

72

3A

111010

&#58;

&colon;

65

A

A

101

41

1000001

&#65;

66

B

B

102

42

1000010

&#66;

67

C

C

103

43

1000011

&#67;

68

D

D

104

44

1000100

&#68;

69

E

E

105

45

1000101

&#69;

70

F

F

106

46

1000110

&#70;

71

G

G

107

47

1000111

&#71;

72

H

H

110

48

1001000

&#72;

73

I

I

111

49

1001001

&#73;

74

J

J

112

4A

1001010

&#74;

75

K

K

113

4B

1001011

&#75;

76

L

L

114

4C

1001100

&#76;

77

M

M

115

4D

1001101

&#77;

79

O

O

117

4F

1001111

&#79;

80

P

P

120

50

1010000

&#80;

81

Q

Q

121

51

1010001

&#81;

82

R

R

122

52

1010010

&#82;

83

S

S

123

53

1010011

&#83;

84

T

T

124

54

1010100

&#84;

85

U

U

125

55

1010101

&#85;

86

V

V

126

56

1010110

&#86;

87

W

W

127

57

1010111

&#87;

88

X

X

130

58

1011000

&#88;

89

Y

Y

131

59

1011001

&#89;

90

Z

Z

132

5A

1011010

&#90;

92

\

Backslash

134

5C

01011100

&#92;

97

a

a

141

61

1100001

&#97;

98

b

b

142

62

1100010

&#98;

99

c

c

143

63

1100011

&#99;

100

d

d

144

64

1100100

&#100;

101

e

e

145

65

1100101

&#101;

102

f

f

146

66

1100110

&#102;

103

g

g

147

67

1100111

&#103;

104

h

h

150

68

1101000

&#104;

105

i

i

151

69

1101001

&#105;

106

j

j

152

6A

1101010

&#106;

107

k

k

153

6B

1101011

&#107;

108

l

l

154

6C

1101100

&#108;

109

m

m

155

6D

1101101

&#109;

110

n

n

156

6E

1101110

&#110;

111

o

o

157

6F

1101111

&#111;

112

p

p

160

70

1110000

&#112;

113

q

q

161

71

1110001

&#113;

114

r

r

162

72

1110010

&#114;

115

s

s

163

73

1110011

&#115;

116

t

t

164

74

1110100

&#116;

117

u

u

165

75

1110101

&#117;

118

v

v

166

76

1110110

&#118;

119

w

w

167

77

1110111

&#119;

120

x

x

170

78

1111000

&#120;

121

y

y

171

79

1111001

&#121;

122

z

z

172

7A

1111010

&#122;

Non-Printable Characters

A tab can be specified by escaping it, for example \t. Other non-printable characters can be specified using their octal representations, by using the E'\000' format, where 000 is the octal value of the character.

For example, ASCII character 15, known as “shift in”, can be specified using E'\017'.

The following table shows the supported non-printable ASCII characters:

Character

Description

Octal

ASCII

Hex

Binary

HTML Code

HTML Name

NUL

Null

0

0

0

0

&#0;

SOH

Start of Heading

1

1

1

1

&#1;

STX

Start of Text

2

2

2

10

&#2;

ETX

End of Text

3

3

3

11

&#3;

EOT

End of Transmission

4

4

4

100

&#4;

ENQ

Enquiry

5

5

5

101

&#5;

ACK

Acknowledge

6

6

6

110

&#6;

BEL

Bell

7

7

7

111

&#7;

BS

Backspace

10

8

8

1000

&#8;

HT

Horizontal Tab

11

9

9

1001

&#9;

VT

Vertical Tab

13

11

0B

1011

&#11;

FF

NP Form Feed, New Page

14

12

0C

1100

&#12;

SO

Shift Out

16

14

0E

1110

&#14;

SI

Shift In

17

15

0F

1111

&#15;

DLE

Data Link Escape

20

16

10

10000

&#16;

DC1

Device Control 1

21

17

11

10001

&#17;

DC2

Device Control 2

22

18

12

10010

&#18;

DC3

Device Control 3

23

19

13

10011

&#19;

DC4

Device Control 4

24

20

14

10100

&#20;

NAK

Negative Acknowledge

25

21

15

10101

&#21;

SYN

Synchronous Idle

26

22

16

10110

&#22;

ETB

End of Transmission Block

27

23

17

10111

&#23;

CAN

Cancel

30

24

18

11000

&#24;

EM

End of Medium

31

25

19

11001

&#25;

SUB

Substitute

32

26

1A

11010

&#26;

ESC

Escape

33

27

1B

11011

&#27;

FS

File Separator

34

28

1C

11100

&#28;

GS

Group Separator

35

29

1D

11101

&#29;

RS

Record Separator

36

30

1E

11110

&#30;

US

Unit Separator

37

31

1F

11111

&#31;

DEL

Delete

177

127

7F

1111111

&#127;

Unsupported Field Delimiters

The following ASCII field delimiters (octal range 001 - 176) are not supported:

The following table shows the unsupported ASCII field delimiters:

ASCII

Character

Description

Octal

Hex

Binary

HTML Code

HTML Name

10

LF

NL Line Feed, New Line

12

0A

1010

&#10;

13

CR

Carriage Return

15

0D

1101

&#13;

34

Double Quote

42

22

100010

"

&quot;

45

-

Minus Sign

55

2D

101101

-

&minus;

46

.

Period

56

2E

101110

.

&period;

48

0

Zero

60

30

110000

&#48;

49

1

Number One

61

31

110001

&#49;

50

2

Number Two

62

32

110010

&#50;

51

3

Number Three

63

33

110011

&#51;

52

4

Number Four

64

34

110100

&#52;

53

5

Number Five

65

35

110101

&#53;

54

6

Number Six

66

36

110110

&#54;

55

7

Number Seven

67

37

110111

&#55;

56

8

Number Eight

70

38

111000

&#56;

57

9

Number Nine

71

39

111001

&#57;

58

:

Colon

72

3A

111010

&#58;

&colon;

92

\

Backslash

134

5C

01011100

&#92;

78

N

N

116

4E

1001110

&#78;

Capturing Rejected Rows

Prior to the column process and storage, the COPY command parses the data. Whenever the data can’t be parsed because it is improperly formatted or doesn’t match the data structure, the entire record (or row) will be rejected.

When ERROR_LOG is not used, the COPY command will stop and roll back the transaction upon the first error.

../../../../_images/copy_from_rejected_rows.png

CSV Support

By default, SQream DB’s CSV parser can handle RFC 4180 standard CSVs , but can also be modified to support non-standard CSVs (with multi-character delimiters, unquoted fields, etc).

All CSV files should be prepared according to these recommendations:

  • Files are UTF-8 or ASCII encoded

  • Field delimiter is an ASCII character or characters

  • Record delimiter, also known as a new line separator, is a Unix-style newline (\n), DOS-style newline (\r\n), or Mac style newline (\r).

  • Fields are optionally enclosed by double-quotes, or mandatory quoted if they contain one of the following characters:

    • The record delimiter or field delimiter

    • A double quote character

    • A newline

  • If a field is quoted, any double quote that appears must be double-quoted (similar to the string literals quoting rules. For example, to encode What are "birds"?, the field should appear as "What are ""birds""?".

    Other modes of escaping are not supported (e.g. 1,"What are \"birds\"?" is not a valid way of escaping CSV values).

Marking Null Markers

NULL values can be marked in two ways in the CSV:

  • An explicit null marker. For example, col1,\N,col3

  • An empty field delimited by the field delimiter. For example, col1,,col3

Note

If a text field is quoted but contains no content ("") it is considered an empty text field. It is not considered NULL.

Examples

Loading a Standard CSV File

COPY table_name FROM WRAPPER csv_fdw OPTIONS (location = '/tmp/file.csv');

Skipping Faulty Rows

COPY table_name FROM WRAPPER csv_fdw OPTIONS (location = '/tmp/file.csv', continue_on_error = true);

Skipping a Maximum of 100 Faulty Rows

COPY table_name FROM WRAPPER csv_fdw OPTIONS (location = '/tmp/file.csv', continue_on_error = true, error_count = 100);

Loading a Pipe Separated Value (PSV) File

COPY table_name FROM WRAPPER csv_fdw OPTIONS (location = '/tmp/file.psv', delimiter = '|');

Loading a Tab Separated Value (TSV) File

COPY table_name FROM WRAPPER csv_fdw OPTIONS (location = '/tmp/file.tsv', delimiter = '\t');

Loading an ORC File

COPY table_name FROM WRAPPER orc_fdw OPTIONS (location = '/tmp/file.orc');

Loading a Parquet File

COPY table_name FROM WRAPPER parquet_fdw OPTIONS (location = '/tmp/file.parquet');

Loading a JSON File

COPY t FROM WRAPPER json_fdw OPTIONS (location = 'somefile.json');

Loading an AVRO File

COPY t FROM WRAPPER fdw_name OPTIONS ([ copy_from_option [, ...] ]);

Loading a Text File with Non-Printable Delimiters

In the file below, the separator is DC1, which is represented by ASCII 17 decimal or 021 octal.

COPY table_name FROM WRAPPER psv_fdw OPTIONS (location = '/tmp/file.txt', delimiter = E'\021');

Loading a Text File with Multi-Character Delimiters

In the file below, the separator is ^|.

COPY table_name FROM WRAPPER psv_fdw OPTIONS (location = '/tmp/file.txt', delimiter = '^|');

In the file below, the separator is '|. The quote character has to be repeated, as per the literal quoting rules.

COPY table_name FROM WRAPPER psv_fdw OPTIONS (location = '/tmp/file.txt', delimiter = ''''|');

Loading Files with a Header Row

Use OFFSET to skip rows.

Note

When loading multiple files (e.g. with wildcards), this setting affects each file separately.

COPY table_name FROM WRAPPER csv_fdw OPTIONS (location = '/tmp/file.psv', delimiter = '|', offset = 2);

Loading Files Formatted for Windows (\r\n)

COPY table_name FROM WRAPPER csv_fdw OPTIONS (location = '/tmp/file.psv', delimiter = '\r\n');

Loading a File from a Public S3 Bucket

Note

The bucket must be publicly available and objects can be listed

COPY table_name FROM WRAPPER csv_fdw OPTIONS (location = 's3://sqream-demo-data/file.csv', delimiter = '\r\n', offset = 2);

Loading a File From a Google Cloud Platform Bucket

To access a Google Cloud Platform (GCP) Bucket it is required that your environment be authorized.

COPY table_name FROM WRAPPER csv_fdw OPTIONS (location = 'gs://<gcs path>/<gcs_bucket>/*');

Loading a File From Azure

To access Azure it is required that your environment be authorized.

COPY table_name FROM WRAPPER csv_fdw OPTIONS(location = 'azure://sqreamrole.core.windows.net/sqream-demo-data/file.csv');

Loading Files from an Authenticated S3 Bucket

COPY table_name FROM WRAPPER psv_fdw OPTIONS (location = 's3://secret-bucket/*.csv', DELIMITER = '\r\n', OFFSET = 2, AWS_ID = '12345678', AWS_SECRET = 'super_secretive_secret');

Saving Rejected Rows to a File

Note

When loading multiple files (e.g. with wildcards), this error threshold is for the entire transaction.

COPY table_name FROM WRAPPER csv_fdw
                OPTIONS
                (
                location = '/tmp/file.csv'
                ,continue_on_error  = true
                ,error_log  = '/temp/load_error.log'
                );
COPY table_name FROM WRAPPER csv_fdw
                OPTIONS
                (
                location = '/tmp/file.psv'
                ,delimiter '|'
                ,error_log = '/temp/load_error.log' -- Save error log
                ,rejected_data = '/temp/load_rejected.log' -- Only save rejected rows
                ,limit = 100 -- Only load 100 rows
                ,error_count = 5 -- Stop the load if 5 errors reached
                );

Loading CSV Files from a Set of Directories

COPY table_name FROM WRAPPER csv_fdw OPTIONS (location = '/tmp/2019_08_*/*.csv');

Rearranging Destination Columns

When the source of the files does not match the table structure, tell the COPY command what the order of columns should be

COPY table_name (fifth, first, third) FROM WRAPPER csv_fdw OPTIONS (location = '/tmp/*.csv');

Note

Any column not specified will revert to its default value or NULL value if nullable

Loading Non-Standard Dates

If your files contain dates in a format other than ISO8601, you can specify a parsing format to convert them during the import process. This ensures the dates are stored internally as ISO8601 within the database.

In this example, date_col1 and date_col2 in the table are non-standard. date_col3 is mentioned explicitly, but can be left out. Any column that is not specified is assumed to be ISO8601.

COPY my_table (date_col1, date_col2, date_col3) FROM WRAPPER csv_fdw OPTIONS (location = '/tmp/my_data.csv', offset = 2, datetime_format 'DMY');

Loading Specific Columns

Loading specific columns using the COPY FROM command:

  • Does not support CSV files

  • Requires that the target table columns be nullable

COPY
  new_nba (name, salary)
FROM
WRAPPER
  parquet_fdw
OPTIONS
  (
   LOCATION = '/tmp/nba.parquet'
  );