Embedded ML Models
SQream offers support for embedded and Python-based ML models. AISQream empowers users to build and train machine learning models, including Linear Regression and XGBoost, directly within the database using SQL.
Linear Regression
Syntax
--# Create and train model
CREATE [OR REPLACE] MODEL [database.schema.]model_name
OPTIONS(model_option_list)
AS {query_statement};
model_option_list:
MODEL_TYPE = { 'LINEAR_REG' | 'XGBOOST' }
, MODEL_PATH = 'PATH/TO/MODEL/JSON/TO/CREATE/WITH_NAME_OF_MODEL_INCLUDED.json'
[, INITIAL_ALGORITHM = { 'SVD' | 'EIGENDECOMPOSITION' | 'QR_DECOMPOSITION' } ]
[, GD_OPTIMIZER = { 'SGD' | 'ADAM' } ]
[, STANDARDIZATION = TRUE | FALSE ]
[, LEARNING_RATE = FLOAT ]
[, EPOCH_COUNT = INT ]
[, LOSS_FUNCTION = { 'Mse', 'L1', 'SmoothL1', 'Huber' } ]
[, TOLERANCE = FLOAT ]
--# Inference
SELECT model_predict(
[database.schema.]model_name,
feature_col1 [,feature_column2, ...])
FROM {query_statement};
--# DROP
DROP MODEL [database.schema.]model_name;
Parameters
Parameter Name |
Description |
Default Value |
---|---|---|
MODEL_TYPE |
Specify the algorithm. |
Mandatory |
MODEL_PATH |
Specify the path to save the model to. |
Mandatory |
INITIAL_ALGORITHM |
Algorithm to fit linear model. |
SVD |
GD_OPTIMIZER |
Optimizer used in second phase (gradient descent). |
ADAM |
STANDARDIZATION |
Normalize features (mean = 0, std = 1). |
FALSE |
LEARNING_RATE |
Gradient descent step size. |
0.01 |
EPOCH_COUNT |
Number of full passes over data. |
Optional |
LOSS_FUNCTION |
Distance between real and predicted values. |
Mse |
TOLERANCE |
Minimal loss reduction needed to continue training. |
0.01 |
Usage Notes & Limitations
Based on Nvidia RAPIDS Linear Regression.
Training and inference can be read directly from a table or a query expression.
At least 2 columns should be provided for training (feature column and a label).
The label column is the last column in the chunk’s input for training.
The model will be saved under database.schema hierarchy, like any other SQream object.
model_predict doesn’t work within sub-query.
Model export functionality is in development.
Up to 8K features limit is recommended in this current version.
XGBoost
Create and train model
CREATE [OR REPLACE] MODEL [database.schema.]model_name
OPTIONS(model_option_list)
AS {query_statement};
model_option_list
:
MODEL_TYPE = 'XGBOOST',
MODEL_PATH = 'PATH/TO/MODEL/JSON/TO/CREATE/WITH_NAME_OF_MODEL_INCLUDED.json' -- must for xgboost
[VERBOSITY] = {silent | info | warning | debug}
[BOOSTER] = { 'GBTREE' | 'GBLINEAR' | 'DART' }
[,treeBoosterParams]
[,dartParams]
[,linearBoosterParams]
treeBoosterParams
: (applicable when BOOSTER = { ‘GBTREE’ | ‘DART’ }`)
[,ETA | LEARNING_RATE] : float
[,GAMMA | MIN_SPLIT_LOSS]
[...]
-- please view the parameters table under Booster Parameters section
dartParams
:
[,SAMPLE_TYPE] = { 'UNIFORM' | 'WEIGHTED' }
[...]
-- please view the parameters table under Additional parameters for Dart Booster section
linearBoosterParams
:
[UPDATER] = { 'SHOTGUN' | 'COORD_DESCENT' }
[...]
-- please view the parameters table under Additional parameters for Linear Booster section
Inference
SELECT model_predict(
[database.schema.]model_name,
feature_col1 [,feature_column2, ...])
FROM {query_statement};
DROP
DROP MODEL [database.schema.]model_name;
CREATE MODEL mod
OPTIONS
( model_type = 'xgboost'
, path='/home/sqream/mod.json' --full path
, max_depth = 10, eta = 0.3, max_bin=100
, verbosity= 'SILENT', tree_method='APPROX'
) AS
SELECT * from t_train;
Parameters
XGBoost Parameters — XGBoost 2.1.3 documentation. The parameters can be divided into global parameters (appear in the syntax), general parameters (available per booster) and learning task parameters.
Global Configuration
Parameter Name |
Values |
Comments |
---|---|---|
verbosity |
“silent, info, warning, debug” |
“Optional, default is silent” |
General Parameters
Parameter Name |
Values |
Comments |
---|---|---|
booster |
“gbtree, gblinear or dart” |
“gbtree and dart use tree based models while gblinear uses linear functions. Default is gbtree.” |
disable_default_eval_metric |
boolean |
“false by default.” |
Booster Parameters (applicable when booster is set to gbtree or dart)
Parameter Name |
Values |
Comments |
---|---|---|
“eta, learning_rate” |
“float [0, 1]” |
“Step size shrinkage used in update to prevent overfitting. Optional, default = 0.3.” |
“gamma, min_split_loss” |
“float [0, MAX_FLOAT]” |
“Minimum loss reduction required to make a further partition on a leaf node of the tree. Optional, default = 0.” |
max_depth |
“int [0, MAX_INT]” |
“Maximum depth of a tree. Optional, default = 6.” |
min_child_weight |
“int [0, MAX_INT]” |
“Minimum sum of instance weight (hessian) needed in a child. Optional, default = 1.” |
max_delta_step |
“int [0, MAX_INT]” |
“Maximum delta step we allow each leaf output to be. Optional, default = 0. If the value is set to 0, it means there is no constraint.” |
subsample |
“float (0,1]” |
“Subsample ratio of the training instances. Optional, default = 1. Setting it to 0.5 means that XGBoost would randomly sample half of the training data prior to growing trees.” |
sampling_method |
“uniform, gradient_based” |
“Optional, default= uniform.” |
colsample_bytree |
“float (0, 1]” |
“This is a family of parameters for subsampling of columns. All colsample_by* parameters have a range of (0, 1] specify the fraction of columns to be subsampled. Optional, default = 1.” |
colsample_bylevel |
“float (0, 1]” |
|
colsample_bynode |
“float (0, 1]” |
|
“lambda, reg_lambda” |
“float [0, MAX_FLOAT]” |
“L1 regularization term on weights. Increasing this value will make the model more conservative. Optional, default = 1.” |
“alpha, reg_alpha” |
“float [0, MAX_FLOAT]” |
“L2 regularization term on weights. Increasing this value will make the model more conservative. Optional, default = 1.” |
tree_method |
“auto, exact, approx, hist” |
“The tree construction algorithm used in XGBoost. See description in the reference paper and Tree Methods. Optional, default = auto.” |
scale_pos_weight |
float |
“Control the balance of positive and negative weights. Optional, default = 1.” |
refresh_leaf |
boolean |
“This is a parameter of the refresh updater. When this flag is True, tree leafs as well as tree nodes’ stats are updated. When it is False, only node stats are updated. Optional, default = True.” |
process_type |
“default, update” |
“A type of boosting process to run. Optional, default = default. default: The normal boosting process which creates new trees. update: Starts from an existing model and only updates its trees.” |
grow_policy |
“depthwise, lossguide” |
“Controls a way new nodes are added to the tree. Currently supported only if tree_method is set to hist or approx. Optional, default= depthwise.” |
max_leaves |
int |
“Maximum number of nodes to be added. Not used by exact tree method. Optional, default = 0.” |
max_bin |
int |
“Only used if tree_method is set to hist or approx. Maximum number of discrete bins to bucket continuous features. Optional, default = 256.” |
num_parallel_tree |
int |
“Number of parallel trees constructed during each iteration. This option is used to support boosted random forest. Optional, default = 1.” |
Additional parameters for Dart Booster
Parameter Name |
Values |
Comments |
---|---|---|
sample_type |
“uniform, weighted” |
“Type of sampling algorithm. uniform: dropped trees are selected uniformly. weighted: dropped trees are selected in proportion to weight. Optional, default = uniform.” |
normalize_type |
“tree, forest” |
“Type of normalization algorithm. Optional, default = tree.” |
rate_drop |
float |
“Dropout rate. Optional, default = 0.0.” |
one_drop |
bool |
“When this flag is enabled, at least one tree is always dropped during the dropout. Optional, default = false.” |
skip_drop |
float |
“Probability of skipping the dropout procedure during a boosting iteration. Optional, default = 0.0.” |
Parameters for Linear Booster
Parameter Name |
Values |
Comments |
---|---|---|
“lambda, reg_lambda” |
“float [0, MAX_FLOAT]” |
“L2 regularization term on weights. Increasing this value will make the model more conservative. Optional, default = 0.” |
“alpha, reg_alpha” |
“float [0, MAX_FLOAT]” |
“L1 regularization term on weights. Increasing this value will make the model more conservative. Optional, default = 0.” |
updater |
“shotgun, coord_descent” |
“Optional, default= shotgun.” |
feature_selector |
“cyclic, shuffle, random, greedy” |
“Feature selection and ordering method. Optional, default = cyclic.” |
top_k |
int |
“The number of top features to select in greedy and thrifty feature selector. The value of 0 means using all the features. Optional, default = 0.” |
Learning Task Parameters
Parameter Name |
Values |
Comments |
---|---|---|
objective |
“reg_squarederror, reg_squaredlogerror, reg_logistic, reg_pseudohubererror, reg_absoluteerror, reg_quantileerror, binary_logistic, binary_logitraw, binary_hinge, count_poisson, survival_cox, survival_aft, rank_ndcg, rank_map, rank_pairwise, reg_gamma, reg_tweedie” |
“Optional, default = reg_squarederror.” |
base_score |
float |
“The initial prediction score of all instances, global bias. If base_margin is supplied, base_score will not be added. Default is not passing any params.” |
eval_metric |
“reg_squarederror, reg_squaredlogerror, reg_logistic, reg_pseudohubererror, reg_absoluteerror, reg_quantileerror, binary_logistic, binary_logitraw, binary_hinge, count_poisson, survival_cox, survival_aft, rank_ndcg, rank_map, rank_pairwise, reg_gamma, reg_tweedie, rmse (root mean square error)” |
“Optional, default according to objective: rmsle default metric of reg_squaredlogerror, mphe default metric of reg_pseudohubererror objective.” |
seed |
“Random number seed. Optional, default = 0.” |
|
seed_per_iteration |
boolean |
“Seed PRNG deterministically via iterator number. Optional, default = false.” |
Parameters for Tweedie Regression
Parameter Name |
Values |
Comments |
---|---|---|
tweedie_variance_power |
“float (1,2)” |
“Parameter that controls the variance of the Tweedie distribution range. Optional, default=1.5.” |
Parameter for using Pseudo-Huber
Parameter Name |
Values |
Comments |
---|---|---|
huber_slope |
float |
“A parameter used for Pseudo-Huber loss to define the δ term. Optional, default = 1.0.” |
Parameter for using Quantile Loss
Parameter Name |
Values |
Comments |
---|---|---|
quantile_alpha |
“float [0, 1]” |
“A scalar or a list of targeted quantiles. Optional, default = 0.5.” |
Parameter for using AFT Survival Loss survival: AFT and Negative Log Likelihood of AFT metric
Parameter Name |
Values |
Comments |
---|---|---|
aft_loss_distribution |
“normal, logistic or extreme” |
Optional |
Parameters for learning to rank
Parameter Name |
Values |
Comments |
---|---|---|
lambdarank_pair_method |
“mean, topk” |
“Optional, default = topk.” |
lambdarank_num_pair_per_sample |
“int [1, MAX_INT]” |
“It specifies the number of pairs sampled for each document when pair method is mean. Optional.” |
lambdarank_normalization |
“Specify whether we need to debias input click data. Optional, default = true.” |
|
lambdarank_bias_norm |
float |
“Lp normalization for position debiasing, default is L2. Only relevant when lambdarank_unbiased is set to true. Optional, default = 2.0.” |
ndcg_exp_gain |
“Whether we should use the exponential gain function for NDCG. Optional, default = true.” |
Command Line Parameters
Parameter Name |
Values |
Comments |
---|---|---|
epoch_count |
int |
“The number of rounds for boosting. Optional, default 10.” |
Usage Notes & Limitations
Based on DMLC’s XGBoost.
Training and inference can be read directly from a table or a query expression.
At least 2 columns should be provided for training (feature column and a label).
The label column is the last column in the chunk’s input for training.
The model will be saved under database.schema hierarchy, like any other SQream object.
model_predict doesn’t work within sub-query.
Model export functionality is in development.
Input feature types are Nullable Float.
Support only single label at this stage, label should appear last.
Up to 8K features limit is recommended in this current version.