Overview

In machine learning, imbalanced datasets are no surprise. If the datasets intended for classification problems or other problems related to discrete predictive analytics have an unequal number of samples for different classes, then those datasets are said to be imbalanced.

Classes having comparatively fewer instances than others are said to be a minority with respect to the classes having a comparatively larger number of the samples (majority). Training ML models with imbalanced datasets often causes the models to develop a certain bias towards the majority classes.

The SMOTE algorithm, short for Synthetic Minority Over-sampling Technique, addresses this issue imbalanced classes. It is based on nearest neighbors (determined by the Euclidean distance of data points in the feature space).

The feature values of nearest neighbor samples are used to interpolate synthetic feature values to retrieve a certain predefined percentage of additional synthetic samples (over-sampling).cApplying this algorithm to the samples of the training datasets that belong to the minority classes finally ends up with a more balanced training set and the removal of the model bias.

Apache Spark ML does not support SMOTE out of the box.

The SMOTE algorithm is implemented as an Apache Spark ML Transformer and also made available as a plugin component for CDAP data pipelines.

SMOTESampler


@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("SMOTESampler")
@Description("A preparation stage for either building Apache Spark based classification or "
+ "regression models. This stage leverages the SMOTE algorithm to extends a training "
+ "dataset containing features & labels with synthetic data records.")
public class SMOTESampler extends SMOTECompute {

    ...

}

Parameters

Features Field The name of the field in the input schema that contains the feature vector.
Label Field The name of the field in the input schema that contains the label.
Model Configuration
Number of Hash Tables The number of hash tables used in LSH OR-amplification. LSH OR-amplification can be used to reduce the false negative rate. Higher values for this parameter lead to a reduced false negative rate, at the expense of added computational complexity. Default is 1.
Bucket Length The length of each hash bucket, a larger bucket lowers the false negative rate. The number of buckets will be (max L2 norm of input vectors) / bucketLength.
Nearest Neighbors The number of nearest neighbors that are taken into account by the SMOTE algorithm to interpolate synthetic feature values. Default is 4.