Decision Tree

DTRegressor

@Plugin(type = SparkSink.PLUGIN_TYPE)
@Name("DTRegressor")
@Description("A building stage for an Apache Spark ML Decision Tree regressor model. "
+ "This stage expects a dataset with at least two fields to train the model: One as "
+ "an array of numeric values, and, another that describes the class or label value "
+ "as numeric value.")
public class DTRegressor extends RegressorSink {

    ...

}

Parameters

Model Name The unique name of the regressor model.
Features Field The name of the field in the input schema that contains the feature vector.
Label Field The name of the field in the input schema that contains the label.
Data Split The split of the dataset into train & test data, e.g. 80:20. Default is 70:30.
Model Configuration
Impurity Impurity is a criterion how to calculate information gain. Supported values: 'entropy' and 'gini'. Default is 'gini'.
Minimum Gain The minimum information gain for a split to be considered at a tree node. The value should be at least 0.0. Default is 0.0.
Maximum Bins The maximum number of bins used for discretizing continuous features and for choosing how to split on features at each node. More bins give higher granularity. Must be at least 2. Default is 32.
Maximum Depth Nonnegative value that maximum depth of the tree. E.g. depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Default is 5.

DTPredictor

@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("DTPredictor")
@Description("A prediction stage that leverages a trained Apache Spark ML Decision Tree classifier "
+ "or regressor model. The model type parameter determines whether this stage predicts from a "
+ "classifier or regressor model.")
public class DTPredictor extends PredictorCompute {

    ...

}

Parameters

Model Name The unique name of the classifier or regressor model that is used for predictions.
Model Type The type of the model that is used for prediction, either 'classifier' or 'regressor'.
Features Field The name of the field in the input schema that contains the feature vector.
Prediction Field The name of the field in the output schema that contains the predicted label.

Generalized Linear Regression

GLRegressor

@Plugin(type = SparkSink.PLUGIN_TYPE)
@Name("GLRegressor")
@Description("A building stage for an Apache Spark ML Generalized Linear regressor model. This stage "
+ "expects a dataset with at least two fields to train the model: One as an array of numeric values, "
+ "and, another that describes the class or label value as numeric value.")
public class GLRegressor extends RegressorSink {

    ...

}

Parameters

Model Name The unique name of the regressor model.
Features Field The name of the field in the input schema that contains the feature vector.
Label Field The name of the field in the input schema that contains the label.
Data Split The split of the dataset into train & test data, e.g. 80:20. Default is 70:30.
Model Configuration
Maximum Iterations The maximum number of iterations to train the Logistic Regression model. Default is 20.
Regularization Parameter The nonnegative regularization parameter. Default is 0.0.
Conversion Tolerance The positive convergence tolerance of iterations. Smaller values will lead to higher accuracy with the cost of more iterations. Default is 1e-6.
Distribution The name of the family which is a description of the error distribution used in this model. Supported values are: 'gaussian', 'binomial', 'poisson' and 'gamma'. The family values are correlated with the name of the link function. Default is 'gaussian'.
Link Function The name of the link function which provides the relationship between the linear predictor and the mean of the distribution function. Supported values are: 'identity', 'log', 'inverse', 'logit', 'probit', 'cloglog' and 'sqrt'. Default is 'identity'.

GLPredictor

@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("GLPredictor")
@Description("A prediction stage that leverages a trained Apache Spark ML Generalized Linear "
+ "Regression (regressor) model.")
public class GLPredictor extends PredictorCompute {

      ...

}

Parameters

Model Name The unique name of the regressor model that is used for predictions.
Features Field The name of the field in the input schema that contains the feature vector.
Prediction Field The name of the field in the output schema that contains the predicted label.

Gradient-Boosted Tree

GBTRegressor

@Plugin(type = SparkSink.PLUGIN_TYPE)
@Name("GBTRegressor")
@Description("A building stage for an Apache Spark ML Gradient-Boosted Tree regressor model. "
+ "This stage expects a dataset with at least two fields to train the model: One as an array "
+ "of numeric values, and, another that describes the class or label value as numeric value.")
public class GBTRegressor extends RegressorSink {

    ...

}

Parameters

Model Name The unique name of the regressor model.
Features Field The name of the field in the input schema that contains the feature vector.
Label Field The name of the field in the input schema that contains the label.
Data Split The split of the dataset into train & test data, e.g. 80:20. Default is 70:30.
Model Configuration
Loss Type The type of the loss function the Gradient-Boosted Trees algorithm tries to minimize. Default is 'logistic'.
Minimum Gain The minimum information gain for a split to be considered at a tree node. The value should be at least 0.0. Default is 0.0.
Maximum Bins The maximum number of bins used for discretizing continuous features and for choosing how to split on features at each node. More bins give higher granularity. Must be at least 2. Default is 32.
Maximum Depth Nonnegative value that maximum depth of the tree. E.g. depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Default is 5.
Maximum Iterations The maximum number of iterations to train the Gradient-Boosted Trees model. Default is 20.
Learning Rate The learning rate for shrinking the contribution of each estimator. Must be in interval (0, 1]. Default is 0.1.

GBTPredictor

@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("GBTPredictor")
@Description("A prediction stage that leverages a trained Apache Spark based Gradient-Boosted Trees "
+ "classifier or regressor model. The model type parameter determines whether this stage predicts "
+ "from a classifier or regressor model.")

    ...

}

Parameters

Model Name The unique name of the classifier or regressor model that is used for predictions.
Model Type The type of the model that is used for prediction, either 'classifier' or 'regressor'.
Features Field The name of the field in the input schema that contains the feature vector.
Prediction Field The name of the field in the output schema that contains the predicted label.

Isotonic Regression

IsotonicRegressor

@Plugin(type = SparkSink.PLUGIN_TYPE)
@Name("IsotonicRegressor")
@Description("A building stage for an Apache Spark ML Isotonic Regression (regressor) model. "
+ "This stage expects a dataset with at least two fields to train the model: One as an array "
+ "of numeric values, and, another that describes the class or label value as numeric value.")
public class IsotonicRegressor extends RegressorSink {

    ...

}

Parameters

Model Name The unique name of the regressor model.
Features Field The name of the field in the input schema that contains the feature vector.
Label Field The name of the field in the input schema that contains the label.
Data Split The split of the dataset into train & test data, e.g. 80:20. Default is 70:30.
Model Configuration
Isotonic Indicator This indicator determines whether whether the output sequence should be 'isotonic' (increasing) or 'antitonic' (decreasing). Default is 'isotonic'.
Feature Index The nonnegative index of the feature. Default is 0.

IsotonicPredictor

@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("IsotonicPredictor")
@Description("A prediction stage that leverages a trained Apache Spark ML Isotonic Regression "
+ "(regressor) model.")
public class IsotonicPredictor extends PredictorCompute {

    ...

}

Parameters

Model Name The unique name of the regressor model that is used for predictions.
Features Field The name of the field in the input schema that contains the feature vector.
Prediction Field The name of the field in the output schema that contains the predicted label.

Linear Regression

LinearRegressor

@Plugin(type = SparkSink.PLUGIN_TYPE)
@Name("LinearRegressor")
@Description("A building stage for an Apache Spark ML Linear Regression (regressor) model. "
+ "This stage expects a dataset with at least two fields to train the model: One as an "
+ "array of numeric values, and, another that describes the class or label value as "
+ "numeric value.")
public class LinearRegressor extends RegressorSink {

    ...

}

Parameters

Model Name The unique name of the regressor model.
Features Field The name of the field in the input schema that contains the feature vector.
Label Field The name of the field in the input schema that contains the label.
Data Split The split of the dataset into train & test data, e.g. 80:20. Default is 70:30.
Model Configuration
Maximum Iterations The maximum number of iterations to train the Linear Regression model. Default is 100.
ElasticNet Mixing The ElasticNet mxing parameter. For value = 0.0, the penalty is an L2 penalty. For value = 1.0, it is an L1 penalty. For 0.0 < value < 1.0, the penalty is a combination of L1 and L2. Default is 0.0.
Regularization Parameter The nonnegative regularization parameter. Default is 0.0.
Conversion Tolerance The positive convergence tolerance of iterations. Smaller values will lead to higher accuracy with the cost of more iterations. Default is 1e-6.
Solver Algorithm The solver algorithm for optimization. Supported options are 'auto', 'normal' and 'l-bfgs'. Default is 'auto'.

LinearPredictor

@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("LinearPredictor")
@Description("A prediction stage that leverages a trained Apache Spark ML Linear Regression "
+ "(regressor) model.")
public class LinearPredictor extends PredictorCompute {

    ...

}

Parameters

Model Name The unique name of the regressor model that is used for predictions.
Features Field The name of the field in the input schema that contains the feature vector.
Prediction Field The name of the field in the output schema that contains the predicted label.

Random Forest Tree

RFRegressor

@Plugin(type = SparkSink.PLUGIN_TYPE)
@Name("RFRegressor")
@Description("A building stage for an Apache Spark ML Random Forest Trees regressor model. "
+ "This stage expects a dataset with at least two fields to train the model: One as an "
+ "array of numeric values, and, another that describes the class or label value as numeric value.")
public class RFRegressor extends RegressorSink {

    ...

}

Parameters

Model Name The unique name of the regressor model.
Features Field The name of the field in the input schema that contains the feature vector.
Label Field The name of the field in the input schema that contains the label.
Data Split The split of the dataset into train & test data, e.g. 80:20. Default is 70:30.
Model Configuration
Impurity Impurity is a criterion how to calculate information gain. Supported values: 'entropy' and 'gini'. Default is 'gini'.
Minimum Gain The minimum information gain for a split to be considered at a tree node. The value should be at least 0.0. Default is 0.0.
Maximum Bins The maximum number of bins used for discretizing continuous features and for choosing how to split on features at each node. More bins give higher granularity. Must be at least 2. Default is 32.
Maximum Depth Nonnegative value that maximum depth of the tree. E.g. depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Default is 5.
Number of Trees The number of trees to train the model. Default is 20.

RFPredictor

@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("RFPredictor")
@Description("A prediction stage that leverages a trained Apache Spark ML Random Forest "
+ "classifier or regressor model. The model type parameter determines whether this stage "
+ "predicts from a classifier or regressor model.")		
public class RFPredictor extends PredictorCompute {

    ...

}

Parameters

Model Name The unique name of the classifier or regressor model that is used for predictions.
Model Type The type of the model that is used for prediction, either 'classifier' or 'regressor'.
Features Field The name of the field in the input schema that contains the feature vector.
Prediction Field The name of the field in the output schema that contains the predicted label.

Survival Regression

SurvivalRegressor

@Plugin(type = SparkSink.PLUGIN_TYPE)
@Name("SurvivalRegressor")
@Description("A building stage for an Apache Spark ML Survival (AFT) regressor model. "
+ "This stage expects a dataset with at least two fields to train the model: One as "
+ "an array of numeric values, and, another that describes the class or label value as "
+ "numeric value.")
public class SurvivalRegressor extends RegressorSink {

    ...

}

Parameters

Model Name The unique name of the regressor model.
Features Field The name of the field in the input schema that contains the feature vector.
Label Field The name of the field in the input schema that contains the label.
Data Split The split of the dataset into train & test data, e.g. 80:20. Default is 70:30.
Censor Field The name of the field in the input schema that contains the censor value. The censor value can be 0 or 1. If the value is 1, it means the event has occurred (uncensored), otherwise censored.
Model Configuration
Maximum Iterations The maximum number of iterations to train the Survival Regression (AFT) model. Default is 100.
Conversion Tolerance The positive convergence tolerance of iterations. Smaller values will lead to higher accuracy with the cost of more iterations. Default is 1e-6.

SurvivalPredictor

@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("SurvivalPredictor")
@Description("A prediction stage that leverages a trained Apache Spark ML Survival (AFT) "
+ "regressor model.")
public class SurvivalPredictor extends PredictorCompute {

      ...

}

Parameters

Model Name The unique name of the regressor model that is used for predictions.
Features Field The name of the field in the input schema that contains the feature vector.
Prediction Field The name of the field in the output schema that contains the predicted label.