Bisecting K-Means
BisectingKMeansSink
@Plugin(type = SparkSink.PLUGIN_TYPE)
@Name("BisectingKMeansSink")
@Description("A building stage for an Apache Spark ML Bisecting K-Means "
+ "clustering model. This stage expects a dataset with at least one "
+ "feature field as an array of numeric values to train the model.")
public class BisectingKMeansSink extends ClusterSink {
...
}
Parameters
Model Name | The unique name of the clustering model. |
Features Field | The name of the field in the input schema that contains the feature vector. |
Model Configuration | |
Clusters | The desired number of leaf clusters. Must be > 1. Default is 4. |
Maximum Iterations | The maximum number of iterations to train the Bisecting K-Means model. Default is 20. |
Minimum Points | The minimum number of points (if greater than or equal to 1.0) or the minimum proportion of points (if less than 1.0) of a divisible cluster. Default is 1.0. |
BisectingKMeansPredictor
@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("BisectingKMeansPredictor")
@Description("A prediction stage that leverages a trained Apache Spark ML "
+ "Bisecting K-Means clustering model.")
public class BisectingKMeansPredictor extends PredictorCompute {
...
}
Parameters
Model Name | The unique name of the clustering model that is used for predictions. |
Features Field | The name of the field in the input schema that contains the feature vector. |
Prediction Field | The name of the field in the output schema that contains the predicted label. |
Gaussian Mixture Model (GMM)
GaussianMixtureSink
@Plugin(type = SparkSink.PLUGIN_TYPE)
@Name("GaussianMixtureSink")
@Description("A building stage for an Apache Spark ML Gaussian Mixture clustering "
+ "model. This stage expects a dataset with at least one feature field as an "
+ "array of numeric values to train the model.")
public class GaussianMixtureSink extends ClusterSink {
...
}
Parameters
Model Name | The unique name of the clustering model. |
Features Field | The name of the field in the input schema that contains the feature vector. |
Model Configuration | |
Clusters | The number of independent Gaussian distributions in the dataset. Must be > 1. Default is 2. |
Maximum Iterations | The maximum number of iterations to train the Gaussian Mixture model. Default is 100. |
Conversion Tolerance | The positive convergence tolerance of iterations. Smaller values will lead to higher accuracy with the cost of more iterations. Default is 0.01. |
GaussianMixturePredictor
@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("GaussianMixturePredictor")
@Description("A prediction stage that leverages a trained Apache Spark ML Gaussian "
+ "Mixture clustering model.")
public class GaussianMixturePredictor extends PredictorCompute {
...
}
Parameters
Model Name | The unique name of the clustering model that is used for predictions. |
Features Field | The name of the field in the input schema that contains the feature vector. |
Prediction Field | The name of the field in the output schema that contains the predicted label. |
Probability Field | The name of the field in the output schema that contains the probability vector, i.e. the probability for each cluster. |
K-Means
KMeansSink
@Plugin(type = SparkSink.PLUGIN_TYPE)
@Name("KMeansSink")
@Description("A building stage for an Apache Spark ML K-Means clustering model. "
+ "This stage expects a dataset with at least one feature field as an array "
+ "of numeric values to train the model.")
public class KMeansSink extends ClusterSink {
...
}
Parameters
Model Name | The unique name of the clustering model. |
Features Field | The name of the field in the input schema that contains the feature vector. |
Model Configuration | |
Clusters | The number of cluster that have to be created. |
Maximum Iterations | The maximum number of iterations to train the K-Means model. Default is 20. |
Initialization Mode | The initialization mode of the algorithm. This can be either 'random' to choose random random points as initial cluster center, 'parallel' to use the parallel variant of KMeans++. Default value: 'parallel'. |
Initialization Steps | The number of steps for the initialization mode of the parallel KMeans algorithm. Default value: 2. |
Conversion Tolerance | The positive convergence tolerance of iterations. Smaller values will lead to higher accuracy with the cost of more iterations. Default is 1e-4. |
KMeansPredictor
@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("KMeansPredictor")
@Description("A prediction stage that leverages a trained Apache Spark ML K-Means "
+ "clustering model.")
public class KMeansPredictor extends PredictorCompute {
...
}
Parameters
Model Name | The unique name of the clustering model that is used for predictions. |
Features Field | The name of the field in the input schema that contains the feature vector. |
Prediction Field | The name of the field in the output schema that contains the predicted label. |
Latent Dirichlet Allocation (LDA)
LDASink
@Plugin(type = SparkSink.PLUGIN_TYPE)
@Name("LDASink")
@Description("A building stage for an Apache Spark ML Latent Dirichlet Allocation (LDA) "
+ "clustering model. This stage expects a dataset with at least one feature field as "
+ "an array of numeric values to train the model.")
public class LDASink extends ClusterSink {
...
}
Parameters
Model Name | The unique name of the clustering model. |
Features Field | The name of the field in the input schema that contains the feature vector. |
Data Split | The split of the dataset into train & test data, e.g. 80:20. Default is 90:10. |
Model Configuration | |
Topics | The number of topics that have to be created. Default is 10. |
Maximum Iterations | The maximum number of iterations to train the LDA model. Default is 20. |
LDAPredictor
@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("LDAPredictor")
@Description("A prediction stage that leverages a trained Apache Spark ML Latent Dirichlet "
+ "Allocation (LDA) clustering model.")
public class LDAPredictor extends PredictorCompute {
...
}
Parameters
Model Name | The unique name of the clustering model that is used for predictions. |
Features Field | The name of the field in the input schema that contains the feature vector. |
Prediction Field | The name of the field in the output schema that contains the predicted label. |