S3¶
SQream DB has a native S3 connector for inserting data. The s3://
URI specifies an external file path on an S3 bucket.
File names may contain wildcard characters and the files can be a CSV or columnar format like Parquet and ORC.
In this topic:
S3 configuration¶
Any SQream DB host that has access to S3 endpoints can access S3 without any configuration.
To read files from an S3 bucket, it must have listable files.
S3 URI format¶
With S3, specify a location for a file (or files) when using COPY FROM or Foreign Tables.
The general S3 syntax is:
s3://bucket_name/path
Authentication¶
SQream DB supports AWS ID
and AWS SECRET
authentication.
These should be specified when executing a statement.
Examples¶
Use an external table to stage data from S3 before loading from CSV, Parquet or ORC files.
Planning for data staging¶
For the following examples, we will want to interact with a CSV file. Here’s a peek at the table contents:
Name | Team | Number | Position | Age | Height | Weight | College | Salary |
---|---|---|---|---|---|---|---|---|
Avery Bradley | Boston Celtics | 0.0 | PG | 25.0 | 6-2 | 180.0 | Texas | 7730337.0 |
Jae Crowder | Boston Celtics | 99.0 | SF | 25.0 | 6-6 | 235.0 | Marquette | 6796117.0 |
John Holland | Boston Celtics | 30.0 | SG | 27.0 | 6-5 | 205.0 | Boston University | |
R.J. Hunter | Boston Celtics | 28.0 | SG | 22.0 | 6-5 | 185.0 | Georgia State | 1148640.0 |
Jonas Jerebko | Boston Celtics | 8.0 | PF | 29.0 | 6-10 | 231.0 | 5000000.0 | |
Amir Johnson | Boston Celtics | 90.0 | PF | 29.0 | 6-9 | 240.0 | 12000000.0 | |
Jordan Mickey | Boston Celtics | 55.0 | PF | 21.0 | 6-8 | 235.0 | LSU | 1170960.0 |
Kelly Olynyk | Boston Celtics | 41.0 | C | 25.0 | 7-0 | 238.0 | Gonzaga | 2165160.0 |
Terry Rozier | Boston Celtics | 12.0 | PG | 22.0 | 6-2 | 190.0 | Louisville | 1824360.0 |
The file is stored on S3, at s3://sqream-demo-data/nba_players.csv
.
This bucket is public and listable.
We will make note of the file structure, to create a matching CREATE FOREIGN TABLE
statement.
Creating the external table¶
Based on the source file structure, we we create a foreign table with the appropriate structure, and point it to the file.
CREATE FOREIGN TABLE nba
(
Name varchar(40),
Team varchar(40),
Number tinyint,
Position varchar(2),
Age tinyint,
Height varchar(4),
Weight real,
College varchar(40),
Salary float
)
WRAPPER csv_fdw
OPTIONS
(
LOCATION = 's3://sqream-demo-data/nba_players.csv',
RECORD_DELIMITER '\r\n' -- DOS delimited file
)
;
The file format in this case is CSV, and it is stored as an S3 object (if the path is on HDFS, change the URI accordingly).
We also took note that the record delimiter was a DOS newline (\r\n
).
Querying external tables¶
Let’s peek at the data from the external table:
t=> SELECT * FROM nba LIMIT 10;
name | team | number | position | age | height | weight | college | salary
--------------+----------------+--------+----------+-----+--------+--------+-------------------+---------
Avery Bradley | Boston Celtics | 0 | PG | 25 | 6-2 | 180 | Texas | 7730337
Jae Crowder | Boston Celtics | 99 | SF | 25 | 6-6 | 235 | Marquette | 6796117
John Holland | Boston Celtics | 30 | SG | 27 | 6-5 | 205 | Boston University |
R.J. Hunter | Boston Celtics | 28 | SG | 22 | 6-5 | 185 | Georgia State | 1148640
Jonas Jerebko | Boston Celtics | 8 | PF | 29 | 6-10 | 231 | | 5000000
Amir Johnson | Boston Celtics | 90 | PF | 29 | 6-9 | 240 | | 12000000
Jordan Mickey | Boston Celtics | 55 | PF | 21 | 6-8 | 235 | LSU | 1170960
Kelly Olynyk | Boston Celtics | 41 | C | 25 | 7-0 | 238 | Gonzaga | 2165160
Terry Rozier | Boston Celtics | 12 | PG | 22 | 6-2 | 190 | Louisville | 1824360
Marcus Smart | Boston Celtics | 36 | PG | 22 | 6-4 | 220 | Oklahoma State | 3431040
Bulk loading a file from a public S3 bucket¶
The COPY FROM command can also be used to load data without staging it first.
Note
The bucket must be publicly available and objects can be listed
COPY nba FROM 's3://sqream-demo-data/nba.csv' WITH OFFSET 2 RECORD DELIMITER '\r\n';
Loading files from an authenticated S3 bucket¶
COPY nba FROM 's3://secret-bucket/*.csv' WITH OFFSET 2 RECORD DELIMITER '\r\n'
AWS_ID '12345678'
AWS_SECRET 'super_secretive_secret';