HDFS

SQream DB has a native HDFS connector for inserting data. The hdfs:// URI specifies an external file path on a Hadoop Distributed File System.

File names may contain wildcard characters and the files can be a CSV or columnar format like Parquet and ORC.

Verifying HDFS configuration

SQream DB’s built-in HDFS relies on the host’s Hadoop HDFS configuration.

Before you can use HDFS, you should verify that all SQream DB hosts are configured correctly.

Use built-in Hadoop libraries

SQream DB comes with Hadoop libraries built-in. In a typical SQream DB installation, you’ll find Hadoop and JDK libraries in the hdfs subdirectory of the package.

If you are using the built-in libraries, it’s important to note where they are.

For example, if SQream DB was installed to /opt/sqream, here’s how to set-up the environment variables from the shell:

$ export JAVA_HOME=/opt/sqream/hdfs/jdk
$ export HADOOP_INSTALL=/opt/sqream/hdfs/hadoop

$ export PATH=$PATH:${HADOOP_INSTALL}/bin:${HADOOP_INSTALL}/sbin
$ export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_INSTALL}/lib/native
$ export CLASSPATH=$CLASSPATH:`${HADOOP_INSTALL}/bin/hadoop classpath --glob`
$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_COMMON_LIB_NATIVE_DIR

$ export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
$ export HADOOP_COMMON_HOME=$HADOOP_INSTALL
$ export HADOOP_HDFS_HOME=$HADOOP_INSTALL
$ export YARN_HOME=$HADOOP_INSTALL

$ export HADOOP_CONF_DIR=$HADOOP_INSTALL/etc/hadoop
$ export YARN_CONF_DIR=$HADOOP_INSTALL/etc/hadoop
$ export HADOOP_HOME=$HADOOP_INSTALL
$ export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=${HADOOP_COMMON_LIB_NATIVE_DIR}"
$ export ARROW_LIBHDFS_DIR=${HADOOP_COMMON_LIB_NATIVE_DIR}

You’ll find core-site.xml and other configuration files in /opt/sqream/hdfs/hadoop/etc/hadoop

To persist these settings, place these variable settings in a ‘run commands’ file like .bashrc. Test this by examining the output of $ echo $ARROW_LIBHDFS_DIR.

Note

  • This process needs to be repeated for every host in the SQream DB cluster, and from SQream DB’s host username (often sqream)
  • Restart SQream DB workers on the host after setting these parameters for them to take effect.

(Optional) Overriding the Hadoop environment

If you have an existing Hadoop environment set-up on the host, you can override SQream DB’s built-in Hadoop by setting the environment variables accordingly.

For example,

$ export JAVA_HOME=/usr/local/java-1.8.0/
$ export HADOOP_INSTALL=/usr/local/hadoop-3.2.1

$ export PATH=$PATH:${HADOOP_INSTALL}/bin:${HADOOP_INSTALL}/sbin
$ export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_INSTALL}/lib/native
$ export CLASSPATH=$CLASSPATH:`${HADOOP_INSTALL}/bin/hadoop classpath --glob`
$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_COMMON_LIB_NATIVE_DIR

$ export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
$ export HADOOP_COMMON_HOME=$HADOOP_INSTALL
$ export HADOOP_HDFS_HOME=$HADOOP_INSTALL
$ export YARN_HOME=$HADOOP_INSTALL

$ export HADOOP_CONF_DIR=$HADOOP_INSTALL/etc/hadoop
$ export YARN_CONF_DIR=$HADOOP_INSTALL/etc/hadoop
$ export HADOOP_HOME=$HADOOP_INSTALL
$ export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=${HADOOP_COMMON_LIB_NATIVE_DIR}"
$ export ARROW_LIBHDFS_DIR=${HADOOP_COMMON_LIB_NATIVE_DIR}

To persist these settings, place these variable settings in a ‘run commands’ file like .bashrc. Test this by examining the output of $ echo $ARROW_LIBHDFS_DIR.

Note

  • This process needs to be repeated for every host in the SQream DB cluster, and from SQream DB’s host username (often sqream)
  • Restart SQream DB workers on the host after setting these parameters for them to take effect.

Configuring the node

A Hadoop administrator will want to edit the configuration XMLs to allow access to your Hadoop cluster.

If using the SQream DB Hadoop libraries, modify the following files to match your cluster settings:

  • /opt/sqream/hdfs/hadoop/etc/hadoop/core-site.xml
  • /opt/sqream/hdfs/hadoop/etc/hadoop/yarn-site.xml
  • /opt/sqream/hdfs/hadoop/etc/hadoop/hdfs-site.xml

If using the system Hadoop libraries, be sure to override JAVA_HOME, CLASSPATH, HADOOP_HOME, and ARROW_LIBHDFS_DIR as described above.

Verifying Hadoop configuration

To test HDFS access, try accessing files using the HDFS shell:

$ hdfs dfs -ls
Found 2 items
-rw-r--r--   3 hdfs supergroup      63446 2020-02-29 16:37 MD1.csv
-rw-r--r--   3 hdfs supergroup      63906 2020-02-29 16:37 MD2.csv
$ hdfs dfs -tail MD1.csv
985,Obediah,Reith,oreithrc@time.com,Male,Colombia,859.28
986,Lennard,Hairesnape,lhairesnaperd@merriam-webster.com,Male,North Korea,687.60
987,Valaree,Pieper,vpieperre@tinyurl.com,Female,Kazakhstan,1116.23
988,Rosemaria,Legan,rleganrf@slideshare.net,Female,Indonesia,62.19
989,Rafaellle,Hartill,rhartillrg@marketwatch.com,Male,Albania,1308.17
990,Symon,Edmett,sedmettrh@tinyurl.com,Male,China,1216.97
991,Hiram,Slayton,hslaytonri@amazon.de,Male,China,510.55
992,Sylvan,Dalgliesh,sdalglieshrj@booking.com,Male,China,1503.60
993,Alys,Sedgebeer,asedgebeerrk@va.gov,Female,Moldova,1947.58
994,Ninette,Hearl,nhearlrl@sakura.ne.jp,Female,Palau,917.66
995,Tommy,Atterley,tatterleyrm@homestead.com,Female,Philippines,1660.22
996,Sean,Mully,smullyrn@rakuten.co.jp,Female,Brunei,938.04
997,Gabe,Lytell,glytellro@cnn.com,Male,China,491.12
998,Clementius,Battison,cbattisonrp@dedecms.com,Male,Norway,1781.92
999,Kyle,Vala,kvalarq@paginegialle.it,Male,France,11.26
1000,Korrie,Odd,koddrr@bigcartel.com,Female,China,471.96

If the command succeeded and the file was read correctly, you HDFS has been configured correctly and can now be used in SQream DB.

If an access error occured, check your Hadoop configuration or contact SQream support.

Configuring HDFS for Kerberos access

This section describes how to configure SQream DB to access HDFS secured with Kerberos.

When a Hadoop cluster is Kerberized, SQream DB’s user must be configured to to authenticate through Kerberos.

Prerequisites

This section assumes you already have Java and Hadoop installed on your SQream DB hosts.

  • SQream DB hosts and Kerberos servers should have the same JCE (Java Cryptography Extension). You can copy the JCE files from the Kerberos server to the SQream DB hosts if needed, to the $JAVA_HOME/jre/lib/security path.

  • Install the Kerberos clients

    CentOS / RHEL: $ sudo yum install krb5-libs krb5-workstation

    Ubuntu: $ sudo apt-get install krb5-user

  • Configure Hadoop as per your distribution.

Creating keytabs

  1. Sign into your Kerberos Key Distribution Center (KDC) as a root user

  2. Create a new principal for the SQream DB OS users (e.g. sqream by default):

    # kadmin.local -q "addprinc -randkey sqream@KRLM.PIEDPIPER.COM"
    

    Make sure to replace the realm (KRLM.PIEDPIPER.COM) with your actual Kerberos realm.

  3. Create a Kerberos service principal for each SQream DB host in the cluster.

    In this example, three cluster hosts:

    # kadmin.local -q "addprinc -randkey sqream/sqreamdb-01.piedpiper.com@KRLM.PIEDPIPER.COM"
    # kadmin.local -q "addprinc -randkey sqream/sqreamdb-02.piedpiper.com@KRLM.PIEDPIPER.COM"
    # kadmin.local -q "addprinc -randkey sqream/sqreamdb-03.piedpiper.com@KRLM.PIEDPIPER.COM"
    

    The format for each principal is user/host@realm, where:

    • user is the OS username
    • host is the hostname (typically the output of hostname -f)
    • realm is the Kerberos realm
  4. Generate a keytab for each principal.

    # kadmin.local -q "xst -k /etc/security/keytabs/sqreamdb-01.service.keytab sqream/sqreamdb-01 sqream/sqreamdb-01.piedpiper.com@KRLM.PIEDPIPER.COM"
    # kadmin.local -q "xst -k /etc/security/keytabs/sqreamdb-02.service.keytab sqream/sqreamdb-02 sqream/sqreamdb-02.piedpiper.com@KRLM.PIEDPIPER.COM"
    # kadmin.local -q "xst -k /etc/security/keytabs/sqreamdb-03.service.keytab sqream/sqreamdb-03 sqream/sqreamdb-03.piedpiper.com@KRLM.PIEDPIPER.COM"
    

    You can now exit kadmin.

  5. Change permissions and ownership on each keytab:

    # chown sqream:sqream /etc/security/keytabs/sqreamdb*
    # chmod 440 /etc/security/keytabs/sqreamdb*
    
  6. Copy the keytab files for each service principal to its respective SQream DB host:

    # scp /etc/security/keytabs/sqreamdb-01.service.keytab sqreamdb-01.piedpiper.com:/home/sqream/sqreamdb-01.service.keytab
    # scp /etc/security/keytabs/sqreamdb-02.service.keytab sqreamdb-02.piedpiper.com:/home/sqream/sqreamdb-02.service.keytab
    # scp /etc/security/keytabs/sqreamdb-03.service.keytab sqreamdb-03.piedpiper.com:/home/sqream/sqreamdb-03.service.keytab
    

Configuring HDFS for Kerberos

  1. Edit the core-site.xml configuration file on each SQream DB host to enable authorization.

    For example, editing /opt/sqream/hdfs/hadoop/etc/hadoop/core-site.xml:

    <property>
        <name>hadoop.security.authorization</name>
        <value>true</value>
    </property>
    
  2. Edit the yarn-site.xml configuration file on each SQream DB host to set the Yarn Kerberos principal

    For example, editing /opt/sqream/hdfs/hadoop/etc/hadoop/yarn-site.xml:

    <property>
        <name>yarn.resourcemanager.address</name>
        <value>hadoop-nn.piedpiper.com:8032</value>
    </property>
    <property>
        <name>yarn.resourcemanager.principal</name>
        <value>yarn/_hostname@KRLM.PIEDPIPER.COM</value>
    </property>
    
  3. Edit the hdfs-site.xml configuration file on each SQream DB host to set the NameNode Kerberos principals, the location of the Kerberos keytab file, and the principal:

    For example, editing /opt/sqream/hdfs/hadoop/etc/hadoop/hdfs-site.xml on the first host (sqreamdb-01):

    <property>
        <name>dfs.namenode.kerberos.principal</name>
        <value>sqream/sqreamdb-01.piedpiper.com@KRLM.PIEDPIPER.COM</value>
    </property>
    <property>
        <name>dfs.namenode.https.principal</name>
        <value>sqream/sqreamdb-01.piedpiper.com@KRLM.PIEDPIPER.COM</value>
    </property>
    

Test the access

To confirm that Kerberized HDFS is accessible on all SQream DB hosts, run the following command to list a directory:

$ hdfs dfs -ls hdfs://hadoop-nn.piedpiper.com:8020

Repeat the command on all hosts. If the command succeeds and you see a directory listing, Kerberized HDFS has been configured correctly and can now be used in SQream DB.

If an error occured, check your configuration or contact SQream support.

Troubelshooting HDFS access

class not found error

If you get a class not found error that looks like this:

java.lang.ClassNotFoundException: Class org.apache.hadoop.hdfs.DistributedFileSystem not found
  1. Verify that the CLASSPATH and ARROW_LIBHDFS_DIR are set correctly. Read more about setting the environment variables above.
  2. Try restarting SQream DB after setting the environment variables.