Embedded ML Models

SQream offers support for embedded and Python-based ML models. AISQream empowers users to build and train machine learning models, including Linear Regression and XGBoost, directly within the database using SQL.

Linear Regression
XGBoost

Linear Regression

Syntax

--# Create and train model
CREATE [OR REPLACE] MODEL [database.schema.]model_name
OPTIONS(model_option_list)
AS {query_statement};

model_option_list:
  MODEL_TYPE = { 'LINEAR_REG' | 'XGBOOST' }
, MODEL_PATH = 'PATH/TO/MODEL/JSON/TO/CREATE/WITH_NAME_OF_MODEL_INCLUDED.json'
[, INITIAL_ALGORITHM = { 'SVD' | 'EIGENDECOMPOSITION' | 'QR_DECOMPOSITION' } ]
[, GD_OPTIMIZER = { 'SGD' | 'ADAM' } ]
[, STANDARDIZATION = TRUE | FALSE ]
[, LEARNING_RATE = FLOAT ]
[, EPOCH_COUNT = INT ]
[, LOSS_FUNCTION = { 'Mse', 'L1', 'SmoothL1', 'Huber' } ]
[, TOLERANCE = FLOAT ]

--# Inference
SELECT model_predict(
    [database.schema.]model_name,
    feature_col1 [,feature_column2, ...])
FROM {query_statement};

--# DROP
DROP MODEL [database.schema.]model_name;

Parameters

Linear Regression Parameters
Parameter Name	Description	Default Value
MODEL_TYPE	Specify the algorithm.	Mandatory
MODEL_PATH	Specify the path to save the model to.	Mandatory
INITIAL_ALGORITHM	Algorithm to fit linear model.	SVD
GD_OPTIMIZER	Optimizer used in second phase (gradient descent).	ADAM
STANDARDIZATION	Normalize features (mean = 0, std = 1).	FALSE
LEARNING_RATE	Gradient descent step size.	0.01
EPOCH_COUNT	Number of full passes over data.	Optional
LOSS_FUNCTION	Distance between real and predicted values.	Mse
TOLERANCE	Minimal loss reduction needed to continue training.	0.01

Usage Notes & Limitations

Based on Nvidia RAPIDS Linear Regression.
Training and inference can be read directly from a table or a query expression.
At least 2 columns should be provided for training (feature column and a label).
The label column is the last column in the chunk’s input for training.
The model will be saved under database.schema hierarchy, like any other SQream object.
model_predict doesn’t work within sub-query.
Model export functionality is in development.
Up to 8K features limit is recommended in this current version.

XGBoost

Create and train model

CREATE [OR REPLACE] MODEL [database.schema.]model_name
OPTIONS(model_option_list)
AS {query_statement};

model_option_list:

MODEL_TYPE = 'XGBOOST',

MODEL_PATH = 'PATH/TO/MODEL/JSON/TO/CREATE/WITH_NAME_OF_MODEL_INCLUDED.json' -- must for xgboost

[VERBOSITY] = {silent | info | warning | debug}

[BOOSTER] = { 'GBTREE' | 'GBLINEAR' | 'DART' }

[,treeBoosterParams]

[,dartParams]

[,linearBoosterParams]

treeBoosterParams: (applicable when BOOSTER = { ‘GBTREE’ | ‘DART’ }`)

[,ETA | LEARNING_RATE] : float
[,GAMMA | MIN_SPLIT_LOSS]
[...]
-- please view the parameters table under Booster Parameters section

dartParams:

[,SAMPLE_TYPE] = { 'UNIFORM' | 'WEIGHTED' }
[...]
-- please view the parameters table under Additional parameters for Dart Booster section

linearBoosterParams:

[UPDATER] = { 'SHOTGUN' | 'COORD_DESCENT' }
[...]
-- please view the parameters table under Additional parameters for Linear Booster section

Inference

SELECT model_predict(
[database.schema.]model_name,
feature_col1 [,feature_column2, ...])
FROM {query_statement};

DROP

DROP MODEL [database.schema.]model_name;

CREATE MODEL mod
OPTIONS
( model_type = 'xgboost'
, path='/home/sqream/mod.json' --full path
, max_depth = 10, eta = 0.3, max_bin=100
, verbosity= 'SILENT', tree_method='APPROX'
) AS
SELECT * from t_train;

Parameters

XGBoost Parameters — XGBoost 2.1.3 documentation. The parameters can be divided into global parameters (appear in the syntax), general parameters (available per booster) and learning task parameters.

Global Configuration

Parameter Name	Values	Comments
verbosity	“silent, info, warning, debug”	“Optional, default is silent”

General Parameters

Parameter Name	Values	Comments
booster	“gbtree, gblinear or dart”	“gbtree and dart use tree based models while gblinear uses linear functions. Default is gbtree.”
disable_default_eval_metric	boolean	“false by default.”

Booster Parameters (applicable when booster is set to gbtree or dart)

Parameter Name	Values	Comments
“eta, learning_rate”	“float [0, 1]”	“Step size shrinkage used in update to prevent overfitting. Optional, default = 0.3.”
“gamma, min_split_loss”	“float [0, MAX_FLOAT]”	“Minimum loss reduction required to make a further partition on a leaf node of the tree. Optional, default = 0.”
max_depth	“int [0, MAX_INT]”	“Maximum depth of a tree. Optional, default = 6.”
min_child_weight	“int [0, MAX_INT]”	“Minimum sum of instance weight (hessian) needed in a child. Optional, default = 1.”
max_delta_step	“int [0, MAX_INT]”	“Maximum delta step we allow each leaf output to be. Optional, default = 0. If the value is set to 0, it means there is no constraint.”
subsample	“float (0,1]”	“Subsample ratio of the training instances. Optional, default = 1. Setting it to 0.5 means that XGBoost would randomly sample half of the training data prior to growing trees.”
sampling_method	“uniform, gradient_based”	“Optional, default= uniform.”
colsample_bytree	“float (0, 1]”	“This is a family of parameters for subsampling of columns. All colsample_by* parameters have a range of (0, 1] specify the fraction of columns to be subsampled. Optional, default = 1.”
colsample_bylevel	“float (0, 1]”
colsample_bynode	“float (0, 1]”
“lambda, reg_lambda”	“float [0, MAX_FLOAT]”	“L1 regularization term on weights. Increasing this value will make the model more conservative. Optional, default = 1.”
“alpha, reg_alpha”	“float [0, MAX_FLOAT]”	“L2 regularization term on weights. Increasing this value will make the model more conservative. Optional, default = 1.”
tree_method	“auto, exact, approx, hist”	“The tree construction algorithm used in XGBoost. See description in the reference paper and Tree Methods. Optional, default = auto.”
scale_pos_weight	float	“Control the balance of positive and negative weights. Optional, default = 1.”
refresh_leaf	boolean	“This is a parameter of the refresh updater. When this flag is True, tree leafs as well as tree nodes’ stats are updated. When it is False, only node stats are updated. Optional, default = True.”
process_type	“default, update”	“A type of boosting process to run. Optional, default = default. default: The normal boosting process which creates new trees. update: Starts from an existing model and only updates its trees.”
grow_policy	“depthwise, lossguide”	“Controls a way new nodes are added to the tree. Currently supported only if tree_method is set to hist or approx. Optional, default= depthwise.”
max_leaves	int	“Maximum number of nodes to be added. Not used by exact tree method. Optional, default = 0.”
max_bin	int	“Only used if tree_method is set to hist or approx. Maximum number of discrete bins to bucket continuous features. Optional, default = 256.”
num_parallel_tree	int	“Number of parallel trees constructed during each iteration. This option is used to support boosted random forest. Optional, default = 1.”

Additional parameters for Dart Booster

Parameter Name	Values	Comments
sample_type	“uniform, weighted”	“Type of sampling algorithm. uniform: dropped trees are selected uniformly. weighted: dropped trees are selected in proportion to weight. Optional, default = uniform.”
normalize_type	“tree, forest”	“Type of normalization algorithm. Optional, default = tree.”
rate_drop	float	“Dropout rate. Optional, default = 0.0.”
one_drop	bool	“When this flag is enabled, at least one tree is always dropped during the dropout. Optional, default = false.”
skip_drop	float	“Probability of skipping the dropout procedure during a boosting iteration. Optional, default = 0.0.”

Parameters for Linear Booster

Parameter Name	Values	Comments
“lambda, reg_lambda”	“float [0, MAX_FLOAT]”	“L2 regularization term on weights. Increasing this value will make the model more conservative. Optional, default = 0.”
“alpha, reg_alpha”	“float [0, MAX_FLOAT]”	“L1 regularization term on weights. Increasing this value will make the model more conservative. Optional, default = 0.”
updater	“shotgun, coord_descent”	“Optional, default= shotgun.”
feature_selector	“cyclic, shuffle, random, greedy”	“Feature selection and ordering method. Optional, default = cyclic.”
top_k	int	“The number of top features to select in greedy and thrifty feature selector. The value of 0 means using all the features. Optional, default = 0.”

Learning Task Parameters

Parameter Name	Values	Comments
objective	“reg_squarederror, reg_squaredlogerror, reg_logistic, reg_pseudohubererror, reg_absoluteerror, reg_quantileerror, binary_logistic, binary_logitraw, binary_hinge, count_poisson, survival_cox, survival_aft, rank_ndcg, rank_map, rank_pairwise, reg_gamma, reg_tweedie”	“Optional, default = reg_squarederror.”
base_score	float	“The initial prediction score of all instances, global bias. If base_margin is supplied, base_score will not be added. Default is not passing any params.”
eval_metric	“reg_squarederror, reg_squaredlogerror, reg_logistic, reg_pseudohubererror, reg_absoluteerror, reg_quantileerror, binary_logistic, binary_logitraw, binary_hinge, count_poisson, survival_cox, survival_aft, rank_ndcg, rank_map, rank_pairwise, reg_gamma, reg_tweedie, rmse (root mean square error)”	“Optional, default according to objective: rmsle default metric of reg_squaredlogerror, mphe default metric of reg_pseudohubererror objective.”
seed		“Random number seed. Optional, default = 0.”
seed_per_iteration	boolean	“Seed PRNG deterministically via iterator number. Optional, default = false.”

Parameters for Tweedie Regression

Parameter Name	Values	Comments
tweedie_variance_power	“float (1,2)”	“Parameter that controls the variance of the Tweedie distribution range. Optional, default=1.5.”

Parameter for using Pseudo-Huber

Parameter Name	Values	Comments
huber_slope	float	“A parameter used for Pseudo-Huber loss to define the δ term. Optional, default = 1.0.”

Parameter for using Quantile Loss

Parameter Name	Values	Comments
quantile_alpha	“float [0, 1]”	“A scalar or a list of targeted quantiles. Optional, default = 0.5.”

Parameter for using AFT Survival Loss survival: AFT and Negative Log Likelihood of AFT metric

Parameter Name	Values	Comments
aft_loss_distribution	“normal, logistic or extreme”	Optional

Parameters for learning to rank

Parameter Name	Values	Comments
lambdarank_pair_method	“mean, topk”	“Optional, default = topk.”
lambdarank_num_pair_per_sample	“int [1, MAX_INT]”	“It specifies the number of pairs sampled for each document when pair method is mean. Optional.”
lambdarank_normalization		“Specify whether we need to debias input click data. Optional, default = true.”
lambdarank_bias_norm	float	“Lp normalization for position debiasing, default is L2. Only relevant when lambdarank_unbiased is set to true. Optional, default = 2.0.”
ndcg_exp_gain		“Whether we should use the exponential gain function for NDCG. Optional, default = true.”

Command Line Parameters

Parameter Name	Values	Comments
epoch_count	int	“The number of rounds for boosting. Optional, default 10.”

Usage Notes & Limitations

Based on DMLC’s XGBoost.
Training and inference can be read directly from a table or a query expression.
At least 2 columns should be provided for training (feature column and a label).
The label column is the last column in the chunk’s input for training.
The model will be saved under database.schema hierarchy, like any other SQream object.
model_predict doesn’t work within sub-query.
Model export functionality is in development.
Input feature types are Nullable Float.
Support only single label at this stage, label should appear last.
Up to 8K features limit is recommended in this current version.