Decision Tree

DTClassifier

@Plugin(type = SparkSink.PLUGIN_TYPE)
@Name("DTClassifer")
@Description("A building stage for an Apache Spark ML Decision Tree classifier model. This stage expects "
+ "a dataset with at least two fields to train the model: One as an array of numeric values, and, "
+ "another that describes the class or label value as numeric value.")
public class DTClassifier extends ClassifierSink {

    ...

}

Parameters

Model Name The unique name of the classifier model.
Features Field The name of the field in the input schema that contains the feature vector.
Label Field The name of the field in the input schema that contains the label.
Data Split The split of the dataset into train & test data, e.g. 80:20. Default is 70:30.
Model Configuration
Impurity Impurity is a criterion how to calculate information gain. Supported values: 'entropy' and 'gini'. Default is 'gini'.
Minimum Gain The minimum information gain for a split to be considered at a tree node. The value should be at least 0.0. Default is 0.0.
Maximum Bins The maximum number of bins used for discretizing continuous features and for choosing how to split on features at each node. More bins give higher granularity. Must be at least 2. Default is 32.
Maximum Depth Nonnegative value that maximum depth of the tree. E.g. depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Default is 5.

DTPredictor

@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("DTPredictor")
@Description("A prediction stage that leverages a trained Apache Spark ML Decision Tree classifier "
+ "or regressor model. The model type parameter determines whether this stage predicts from a classifier "
+ "or regressor model.")
public class DTPredictor extends PredictorCompute {

    ...

}

Parameters

Model Name The unique name of the classifier or regressor model that is used for predictions.
Model Type The type of the model that is used for prediction, either 'classifier' or 'regressor'.
Features Field The name of the field in the input schema that contains the feature vector.
Prediction Field The name of the field in the output schema that contains the predicted label.

Gradient-Boosted Tree

GBTClassifier


@Plugin(type = SparkSink.PLUGIN_TYPE)
@Name("GBTClassifer")
@Description("A building stage for an Apache Spark ML Gradient-Boosted Tree classifier model. This stage expects "
+ "a dataset with at least two fields to train the model: One as an array of numeric values, and, "  
+ "another that describes the class or label value as numeric value.")
public class GBTClassifier extends ClassifierSink {

    ...

}

Parameters

Model Name The unique name of the classifier model.
Features Field The name of the field in the input schema that contains the feature vector.
Label Field The name of the field in the input schema that contains the label.
Data Split The split of the dataset into train & test data, e.g. 80:20. Default is 70:30.
Model Configuration
Loss Type The type of the loss function the Gradient-Boosted Trees algorithm tries to minimize. Default is 'logistic'.
Minimum Gain The minimum information gain for a split to be considered at a tree node. The value should be at least 0.0. Default is 0.0.
Maximum Bins The maximum number of bins used for discretizing continuous features and for choosing how to split on features at each node. More bins give higher granularity. Must be at least 2. Default is 32.
Maximum Depth Nonnegative value that maximum depth of the tree. E.g. depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Default is 5.
Maximum Iterations The maximum number of iterations to train the Gradient-Boosted Trees model. Default is 20.
Learning Rate The learning rate for shrinking the contribution of each estimator. Must be in interval (0, 1]. Default is 0.1.

GBTPredictor

@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("GBTPredictor")
@Description("A prediction stage that leverages a trained Apache Spark based Gradient-Boosted Trees "
+ "classifier or regressor model. The model type parameter determines whether this stage predicts "
+ "from a classifier or regressor model.")

    ...

}

Parameters

Model Name The unique name of the classifier or regressor model that is used for predictions.
Model Type The type of the model that is used for prediction, either 'classifier' or 'regressor'.
Features Field The name of the field in the input schema that contains the feature vector.
Prediction Field The name of the field in the output schema that contains the predicted label.

Logistic Regression

LRClassifier

@Plugin(type = SparkSink.PLUGIN_TYPE)
@Name("LRClassifer")
@Description("A building stage for an Apache Spark ML Logistic Regression classifier model. This stage expects "
+ "a dataset with at least two fields to train the model: One as an array of numeric values, and, "  
+ "another that describes the class or label value as numeric value.")
public class LRClassifier extends ClassifierSink {

    ...

}

Parameters

Model Name The unique name of the classifier model.
Features Field The name of the field in the input schema that contains the feature vector.
Label Field The name of the field in the input schema that contains the label.
Data Split The split of the dataset into train & test data, e.g. 80:20. Default is 70:30.
Model Configuration
Maximum Iterations The maximum number of iterations to train the Logistic Regression model. Default is 20.
ElasticNet Mixing The ElasticNet mxing parameter. For value = 0.0, the penalty is an L2 penalty. For value = 1.0, it is an L1 penalty. For 0.0 < value < 1.0, the penalty is a combination of L1 and L2. Default is 0.0.
Regularization Parameter The nonnegative regularization parameter. Default is 0.0.
Conversion Tolerance The positive convergence tolerance of iterations. Smaller values will lead to higher accuracy with the cost of more iterations. Default is 1e-6.

LRPredictor

@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("LRPredictor")
@Description("A prediction stage that leverages a trained Apache Spark ML Logistic "
+ "Regression classifier model.")
public class LRPredictor extends PredictorCompute {

    ...

}

Parameters

Model Name The unique name of the classifier model that is used for predictions.
Features Field The name of the field in the input schema that contains the feature vector.
Prediction Field The name of the field in the output schema that contains the predicted label.

Multi-Layer Perceptron

MLPClassifier

@Plugin(type = SparkSink.PLUGIN_TYPE)
@Name("MLPClassifer")
@Description("A building stage for an Apache Spark ML Multi-Layer Perceptron classifier model. "
+ "This stage expects a dataset with at least two fields to train the model: One as an array of "
+ "numeric values, and, another that describes the class or label value as numeric value.")
public class MLPClassifier extends ClassifierSink {

    ...

}

Parameters

Model Name The unique name of the classifier model.
Features Field The name of the field in the input schema that contains the feature vector.
Label Field The name of the field in the input schema that contains the label.
Data Split The split of the dataset into train & test data, e.g. 80:20. Default is 70:30.
Model Configuration
Solver Algorithm The solver algorithm for optimization. Supported options are 'gd' (minibatch gradient descent) and 'l-bfgs'. Default is 'l-bfgs'.
Layer Sizes The comma-separated list of the sizes of the layers from the input to the output layer. For example: 780,100,10 means 780 inputs, one hidden layer with 100 neurons and an output layer with 10 neurons. At least 2 layers (input, output) must be specified.
Block Size The nonnegative block size for stacking input data in matrices to speed up the computation. Data is stacked within partitions. If block size is more than remaining data in a partition then it is adjusted to the size of this data. Recommended size is between 10 and 1000. Default is 128.
Maximum Iterations The maximum number of iterations to train the Multi-Layer Perceptron model. Default is 100.
Learning Rate The learning rate for shrinking the contribution of each estimator. Must be in interval (0, 1]. Default is 0.03.
Conversion Tolerance The positive convergence tolerance of iterations. Smaller values will lead to higher accuracy with the cost of more iterations. Default is 1e-6.

MLPPredictor

@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("MLPPredictor")
@Description("A prediction stage that leverages a trained Apache Spark ML Multi-Layer Perceptron "
+ "classifier model.")
public class MLPPredictor extends PredictorCompute {

    ...

}

Parameters

Model Name The unique name of the classifier model that is used for predictions.
Features Field The name of the field in the input schema that contains the feature vector.
Prediction Field The name of the field in the output schema that contains the predicted label.

Naive Bayes

NBClassifier

@Plugin(type = SparkSink.PLUGIN_TYPE)
@Name("NBClassifer")
@Description("A building stage for an Apache Spark ML Naive Bayes classifier model. This stage expects "
+ "a dataset with at least two fields to train the model: One as an array of numeric values, and, "  
+ "another that describes the class or label value as numeric value.")
public class NBClassifier extends ClassifierSink {

    ...

}

Parameters

Model Name The unique name of the classifier model.
Features Field The name of the field in the input schema that contains the feature vector.
Label Field The name of the field in the input schema that contains the label.
Data Split The split of the dataset into train & test data, e.g. 80:20. Default is 70:30.
Model Configuration
Model Type The model type of the Naive Bayes classifier. Supported values are 'bernoulli' and 'multinomial'. Choosing the Bernoulli version of Naive Bayes requires the feature values to be binary (0 or 1). Default is 'multinomial'.
Smoothing The smoothing parameter of the Naive Bayes classifier. Default is 1.0.

NBPredictor

@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("NBPredictor")
@Description("A prediction stage that leverages a trained Apache Spark ML Naive Bayes classifier model.")
public class NBPredictor extends PredictorCompute {

    ...

}

Parameters

Model Name The unique name of the classifier model that is used for predictions.
Features Field The name of the field in the input schema that contains the feature vector.
Prediction Field The name of the field in the output schema that contains the predicted label.

Random-Forest Tree

RFClassifier

@Plugin(type = SparkSink.PLUGIN_TYPE)
@Name("RFClassifer")
@Description("A building stage for an Apache Spark ML Random Forest Tree classifier model. This stage expects "
+ "a dataset with at least two fields to train the model: One as an array of numeric values, and, "  
+ "another that describes the class or label value as numeric value.")
public class RFClassifier extends ClassifierSink {

    ...

}

Parameters

Model Name The unique name of the classifier model.
Features Field The name of the field in the input schema that contains the feature vector.
Label Field The name of the field in the input schema that contains the label.
Data Split The split of the dataset into train & test data, e.g. 80:20. Default is 70:30.
Model Configuration
Impurity Impurity is a criterion how to calculate information gain. Supported values: 'entropy' and 'gini'. Default is 'gini'.
Minimum Gain The minimum information gain for a split to be considered at a tree node. The value should be at least 0.0. Default is 0.0.
Maximum Bins The maximum number of bins used for discretizing continuous features and for choosing how to split on features at each node. More bins give higher granularity. Must be at least 2. Default is 32.
Maximum Depth Nonnegative value that maximum depth of the tree. E.g. depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Default is 5.
Number of Trees The number of trees to train the model. Default is 20.

RFPredictor

@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("RFPredictor")
@Description("A prediction stage that leverages a trained Apache Spark ML Random Forest "
+ "classifier or regressor model. The model type parameter determines whether this stage "
+ "predicts from a classifier or regressor model.")		
public class RFPredictor extends PredictorCompute {

    ...

}

Parameters

Model Name The unique name of the classifier or regressor model that is used for predictions.
Model Type The type of the model that is used for prediction, either 'classifier' or 'regressor'.
Features Field The name of the field in the input schema that contains the feature vector.
Prediction Field The name of the field in the output schema that contains the predicted label.