Clustering

Bisecting K-Means

BisectingKMeansSink

@Plugin(type = SparkSink.PLUGIN_TYPE)
@Name("BisectingKMeansSink")
@Description("A building stage for an Apache Spark ML Bisecting K-Means "
+ "clustering model. This stage expects a dataset with at least one "
+ "feature field as an array of numeric values to train the model.")
public class BisectingKMeansSink extends ClusterSink {

    ...

}

Parameters

Model Name	The unique name of the clustering model.
Features Field	The name of the field in the input schema that contains the feature vector.
Model Configuration
Clusters	The desired number of leaf clusters. Must be > 1. Default is 4.
Maximum Iterations	The maximum number of iterations to train the Bisecting K-Means model. Default is 20.
Minimum Points	The minimum number of points (if greater than or equal to 1.0) or the minimum proportion of points (if less than 1.0) of a divisible cluster. Default is 1.0.

BisectingKMeansPredictor

@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("BisectingKMeansPredictor")
@Description("A prediction stage that leverages a trained Apache Spark ML "
+ "Bisecting K-Means clustering model.")
public class BisectingKMeansPredictor extends PredictorCompute {

    ...

}

Parameters

Model Name	The unique name of the clustering model that is used for predictions.
Features Field	The name of the field in the input schema that contains the feature vector.
Prediction Field	The name of the field in the output schema that contains the predicted label.

Gaussian Mixture Model (GMM)

GaussianMixtureSink

@Plugin(type = SparkSink.PLUGIN_TYPE)
@Name("GaussianMixtureSink")
@Description("A building stage for an Apache Spark ML Gaussian Mixture clustering "
+ "model. This stage expects a dataset with at least one feature field as an "
+ "array of numeric values to train the model.")
public class GaussianMixtureSink extends ClusterSink {

    ...

}

Parameters

Model Name	The unique name of the clustering model.
Features Field	The name of the field in the input schema that contains the feature vector.
Model Configuration
Clusters	The number of independent Gaussian distributions in the dataset. Must be > 1. Default is 2.
Maximum Iterations	The maximum number of iterations to train the Gaussian Mixture model. Default is 100.
Conversion Tolerance	The positive convergence tolerance of iterations. Smaller values will lead to higher accuracy with the cost of more iterations. Default is 0.01.

GaussianMixturePredictor

@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("GaussianMixturePredictor")
@Description("A prediction stage that leverages a trained Apache Spark ML Gaussian "
+ "Mixture clustering model.")
public class GaussianMixturePredictor extends PredictorCompute {

  ...

}

Parameters

Model Name	The unique name of the clustering model that is used for predictions.
Features Field	The name of the field in the input schema that contains the feature vector.
Prediction Field	The name of the field in the output schema that contains the predicted label.
Probability Field	The name of the field in the output schema that contains the probability vector, i.e. the probability for each cluster.

K-Means

KMeansSink

@Plugin(type = SparkSink.PLUGIN_TYPE)
@Name("KMeansSink")
@Description("A building stage for an Apache Spark ML K-Means clustering model. "
+ "This stage expects a dataset with at least one feature field as an array "
+ "of numeric values to train the model.")
public class KMeansSink extends ClusterSink {

    ...

}

Parameters

Model Name	The unique name of the clustering model.
Features Field	The name of the field in the input schema that contains the feature vector.
Model Configuration
Clusters	The number of cluster that have to be created.
Maximum Iterations	The maximum number of iterations to train the K-Means model. Default is 20.
Initialization Mode	The initialization mode of the algorithm. This can be either 'random' to choose random random points as initial cluster center, 'parallel' to use the parallel variant of KMeans++. Default value: 'parallel'.
Initialization Steps	The number of steps for the initialization mode of the parallel KMeans algorithm. Default value: 2.
Conversion Tolerance	The positive convergence tolerance of iterations. Smaller values will lead to higher accuracy with the cost of more iterations. Default is 1e-4.

KMeansPredictor

@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("KMeansPredictor")
@Description("A prediction stage that leverages a trained Apache Spark ML K-Means "
+ "clustering model.")
public class KMeansPredictor extends PredictorCompute {

    ...

}

Parameters

Model Name	The unique name of the clustering model that is used for predictions.
Features Field	The name of the field in the input schema that contains the feature vector.
Prediction Field	The name of the field in the output schema that contains the predicted label.

Latent Dirichlet Allocation (LDA)

LDASink

@Plugin(type = SparkSink.PLUGIN_TYPE)
@Name("LDASink")
@Description("A building stage for an Apache Spark ML Latent Dirichlet Allocation (LDA) "
+ "clustering model. This stage expects a dataset with at least one feature field as "
+ "an array of numeric values to train the model.")
public class LDASink extends ClusterSink {

    ...

}

Parameters

Model Name	The unique name of the clustering model.
Features Field	The name of the field in the input schema that contains the feature vector.
Data Split	The split of the dataset into train & test data, e.g. 80:20. Default is 90:10.
Model Configuration
Topics	The number of topics that have to be created. Default is 10.
Maximum Iterations	The maximum number of iterations to train the LDA model. Default is 20.

LDAPredictor

@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("LDAPredictor")
@Description("A prediction stage that leverages a trained Apache Spark ML Latent Dirichlet "
+ "Allocation (LDA) clustering model.")
public class LDAPredictor extends PredictorCompute {

    ...

}

Parameters

Model Name	The unique name of the clustering model that is used for predictions.
Features Field	The name of the field in the input schema that contains the feature vector.
Prediction Field	The name of the field in the output schema that contains the predicted label.

Table of Content

Bisecting K-Means
- BisectingKMeansSink
- BisectingKMeansPredictor
Gaussian Mixture Model (GMM)
- GaussianMixtureSink
- GaussianMixturePredictor
K-Means
- KMeansSink
- KMeansPredictor
Latent Dirichlet Allocation (LDA)
- LDASink
- LDAPredictor