Feature Engineering

Aggregation

VectorAssembler

@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("VectorAssembler")
@Description("A transformation stage that leverages the Apache Spark ML VectorAssembler "
+ "to merge multiple numeric (or numeric vector) fields into a single feature vector.")
public class VectorAssembler extends FeatureCompute {

    ...

}

Parameters

Input Fields	The comma-separated list of numeric (or numeric vector) fields that have to be be assembled.
Output Field	The name of the field in the output schema that contains the transformed features.

Hashing

BucketedLSHBuilder

@Plugin(type = SparkSink.PLUGIN_TYPE)
@Name("BucketedLSHBuilder")
@Description("A building stage for an Apache Spark ML Bucketed Random Projection LSH model.")
public class BucketedLSHBuilder extends FeatureSink {

    ...

}

Parameters

Model Name	The unique name of the feature model.
Input Field	The name of the field in the input schema that contains the features to build the model from.
Model Configuration
Number of Hash Tables	The number of hash tables used in LSH OR-amplification. LSH OR-amplification can be used to reduce the false negative rate. Higher values for this parameter lead to a reduced false negative rate, at the expense of added computational complexity. Default is 1.
Bucket Length	The length of each hash bucket, a larger bucket lowers the false negative rate. The number of buckets will be (max L2 norm of input vectors) / bucketLength.

BucketedLSH

@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("BucketedLSH")
@Description("A transformation stage that leverages a trained Bucketed Random Projection LSH model "
+ "to project feature vectors onto hash value vectors. Similar feature vectors are mapped onto "
+ "the same hash value vector.")
public class BucketedLSH extends FeatureCompute {

    ...

}

Parameters

Model Name	The unique name of the feature model.
Input Field	The name of the field in the input schema that contains the features to build the model from.
Output Field	The name of the field in the output schema that contains the transformed features.

HashingTF

@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("HashingTF")
@Description("A transformation stage that leverages the Apache Spark ML HashingTF and maps "
+ "a sequence of terms to their term frequencies using the hashing trick. Currently the "
+ "Austin Appleby's MurmurHash 3 algorithm is used.")
public class HashingTF extends FeatureCompute {

      ...

}

Parameters

Input Field	The name of the field in the input schema that contains the features.
Output Field	The name of the field in the output schema that contains the transformed features.
Number of Features	The nonnegative number of features to transform a sequence of terms into.

MinLSHBuilder

@Plugin(type = SparkSink.PLUGIN_TYPE)
@Name("MinHashLSHBuilder")
@Description("A building stage for an Apache Spark ML MinHash LSH model.")
public class MinHashLSHBuilder extends FeatureSink {

    ...

}

Parameters

Model Name	The unique name of the feature model.
Input Field	The name of the field in the input schema that contains the features to build the model from.
Model Configuration
Number of Hash Tables	The number of hash tables used in LSH OR-amplification. LSH OR-amplification can be used to reduce the false negative rate. Higher values for this parameter lead to a reduced false negative rate, at the expense of added computational complexity. Default is 1.

MinLSH

@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("MinHashLSH")
@Description("A transformation stage that leverages a trained MinHash LSH model "
+ "to project feature vectors onto hash value vectors.")
public class MinHashLSH extends FeatureCompute {

    ...

}

Parameters

Model Name	The unique name of the feature model.
Input Field	The name of the field in the input schema that contains the features to build the model from.
Output Field	The name of the field in the output schema that contains the transformed features.

Indexing

StringIndexerBuilder

@Plugin(type = SparkSink.PLUGIN_TYPE)
@Name("StringIndexerBuilder")
@Description("A building stage for an Apache Spark ML StringIndexer model.")
public class StringIndexerBuilder extends FeatureSink {

    ...

}

Parameters

Model Name	The unique name of the feature model.
Input Field	The name of the field in the input schema that contains the features to build the model from.

IndexToString

@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("IndexToString")
@Description("A transformation stage that leverages the Apache Spark ML IndexToString transformer. "
+ "This stage requires a trained StringIndexer model.")
public class IndexToString extends FeatureCompute {

    ...

}

Parameters

Model Name	The unique name of the feature model.
Input Field	The name of the field in the input schema that contains the features.
Output Field	The name of the field in the output schema that contains the transformed features.

StringToIndex

@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("StringToIndex")
@Description("A transformation stage that leverages the Apache Spark ML StringIndexer. "
+ "This stage requires a trained StringIndexer model.")
public class StringToIndex extends FeatureCompute {

    ...

}

Parameters

Model Name	The unique name of the feature model.
Input Field	The name of the field in the input schema that contains the features.
Output Field	The name of the field in the output schema that contains the transformed features.

VectorIndexerBuilder

@Plugin(type = SparkSink.PLUGIN_TYPE)
@Name("VectorIndexerBuilder")
@Description("A building stage for an Apache Spark ML VectorIndexer model.")
public class VectorIndexerBuilder extends FeatureSink {

    ...

}

Parameters

Model Name	The unique name of the feature model.
Input Field	The name of the field in the input schema that contains the features to build the model from.
Model Configuration
Maximum Categories	The threshold for the number of values a categorical feature can take. If a feature is found to have more category values than this threshold, then it is declared continuous. Must be greater than or equal to 2. Default is 20.

VectorIndexer

@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("VectorIndexer")
@Description("A transformation stage that leverages the Apache Spark ML VectorIndexer to "
+ "decide which features are categorical and converts the original values into category "
+ "indices. This stage requires a trained VectorIndexer model.")
public class VectorIndexer extends FeatureCompute {

    ...

}

Parameters

Model Name	The unique name of the feature model.
Input Field	The name of the field in the input schema that contains the features.
Output Field	The name of the field in the output schema that contains the transformed features.

Reduction

PCABuilder

@Plugin(type = SparkSink.PLUGIN_TYPE)
@Name("PCABuilder")
@Description("A building stage for an Apache Spark based Principal Component Analysis "
+ "feature model.")
public class PCABuilder extends FeatureSink {

    ...

}

Parameters

Model Name	The unique name of the feature model.
Input Field	The name of the field in the input schema that contains the features to build the model from.
Model Configuration
Number of Components	The positive number of principle components.

PCA

@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("PCA")
@Description("A transformation stage that leverages a trained PCA model to project "
+ "feature vectors onto a lower dimensional vector space.")
public class PCA extends FeatureCompute {

    ...

}

Parameters

Model Name	The unique name of the feature model.
Input Field	The name of the field in the input schema that contains the features to build the model from.
Output Field	The name of the field in the output schema that contains the transformed features.

Scaling

ScalerBuilder

@Plugin(type = SparkSink.PLUGIN_TYPE)
@Name("ScalerBuilder")
@Description("A building stage for an Apache Spark ML feature scaling model. Supported models "
+ "are Min-Max, Max-Abs and Standard Scaler.")
public class ScalerBuilder extends FeatureSink {

    ...

}

Parameters

Model Name	The unique name of the feature model.
Model Type	The type of the scaler model. Supported values are 'maxabs', 'minmax' and 'standard'. Default is 'standard'.
Input Field	The name of the field in the input schema that contains the features to build the model from.
Model Configuration
Lower Bound	The lower bound of the feature range after transformation. This parameter is restricted to the model type 'minmax'. Default is 0.0.
Upper Bound	The upper bound of the feature range after transformation. This parameter is restricted to the model type 'minmax'. Default is 1.0.
With Mean	Indicator to determine whether to center the data with mean before scaling. This parameter applies to the model type 'standard'. Default is 'false'.
With Std	Indicator to determine whether to scale the data to unit standard deviation. This parameter applies to the model type 'standard'. Default is 'true'.

Scaler

@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("Scaler")
@Description("A transformation stage that leverages a trained Scaler model to project feature "
+ "vectors onto scaled vectors. Supported models are 'Max-Abs', 'Min-Max' and 'Standard'.")
public class Scaler extends FeatureCompute {

    ...

}

Parameters

Model Name	The unique name of the feature model.
Model Type	The type of the scaler model. Supported values are 'maxabs', 'minmax' and 'standard'. Default is 'standard'.
Input Field	The name of the field in the input schema that contains the features.
Output Field	The name of the field in the output schema that contains the transformed features.

Selection

ChiSquaredBuilder

@Plugin(type = SparkSink.PLUGIN_TYPE)
@Name("ChiSquaredBuilder")
@Description("A building stage for an Apache Spark ML Chi-Squared Selector model.")
public class ChiSquaredBuilder extends FeatureSink {

    ...

}

Parameters

Model Name	The unique name of the feature model.
Input Field	The name of the field in the input schema that contains the features to build the model from.
Label Field	The name of the field in the input schema that contains the label.
Model Configuration
Selector Type	The selector type. Supported values: 'numTopFeatures, 'percentile, and 'fpr'. Default is 'numTopFeatures'.")
Top Features	The number of features that selector will select, ordered by ascending p-value. number of features is less than this parameter value, then this will select all features. Only applicable when selectorType = 'numTopFeatures'. Default value is 50.
Percentile of Features	Percentile of features that selector will select, ordered by statistics value descending. Only applicable when selectorType = 'percentile'. Must be in range (0, 1). Default value is 0.1.
Highest P-Value	The highest p-value for features to be kept. Only applicable when selectorType = 'fpr'. Must be in range (0, 1). Default value is 0.05.

ChiSquaredSelector

@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("ChiSquaredSelector")
@Description("A transformation stage that leverages a trained Chi-Squared Selector model "
+ "to select categorical features to use for predicting categorical labels.")
public class ChiSquaredSelector extends FeatureCompute {

    ...

}

Parameters

Model Name	The unique name of the feature model.
Input Field	The name of the field in the input schema that contains the features to build the model from.
Output Field	The name of the field in the output schema that contains the transformed features.

Transformation

Binarizer

@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("Binarizer")
@Description("A transformation stage that leverages the Apache Spark ML Binarizer "
+ "to map continuous features onto binary values.")
public class Binarizer extends FeatureCompute {

    ...

}

Parameters

Input Field	The name of the field in the input schema that contains the features.
Output Field	The name of the field in the output schema that contains the transformed features.
Threshold	The nonnegative threshold used to binarize continuous features. The features greater than the threshold, will be binarized to 1.0. The features equal to or less than the threshold, will be binarized to 0.0. Default is 0.0.

Bucketizer

@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("Bucketizer")
@Description("A transformation stage that leverages the Apache Spark ML Feature "
+ "Bucketizer to map continuous features onto feature buckets.")
public class Bucketizer extends FeatureCompute {

    ...

}

Parameters

Input Field	The name of the field in the input schema that contains the features.
Output Field	The name of the field in the output schema that contains the transformed features.
Splits	A comma separated list of split points (Double values) for mapping continuous features into buckets. With n+1 splits, there are n buckets. A bucket defined by splits x,y holds values in the range [x,y) except the last bucket, which also includes y. The splits should be of length >= 3 and strictly increasing. Values at -infinity, infinity must be explicitly provided to cover all Double values; otherwise, values outside the splits specified will be treated as errors.

DCT

@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("DCT")
@Description("A transformation stage that leverages the Apache Spark ML Discrete "
+ "Cosine Tranform to map a feature vector in the time domain into a feature "
+ "vector in the frequency domain.")
public class DCT extends FeatureCompute {

    ...

}

Parameters

Input Field	The name of the field in the input schema that contains the features.
Output Field	The name of the field in the output schema that contains the transformed features.
Inverse	An indicator to determine whether to perform the inverse DCT (true) or forward DCT (false). Default is 'false'.

NGram

@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("NGram")
@Description("A transformation stage that leverages Apache Spark ML N-Gram "
+ "transformer to convert the input array of string into an array of n-grams.")
public class NGram extends FeatureCompute {

    ...

}

Parameters

Input Field	The name of the field in the input schema that contains the features.
Output Field	The name of the field in the output schema that contains the transformed features.
N-gram Length	Minimum n-gram length, greater than or equal to 1. Default is 2.

Normalizer

@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("Normalizer")
@Description("A transformation stage to normalize a feature vector to have unit "
+ "norm using the given p-norm.")
public class Normalizer extends FeatureCompute {

    ...

}

Parameters

Input Field	The name of the field in the input schema that contains the features.
Output Field	The name of the field in the output schema that contains the transformed features.
P-Norm	The p-norm to use for normalization. Supported values are '1' and '2'. Default is 2.

OneHotEncoder

@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("OneHotEncoder")
@Description("A transformation stage to map input labels (indices) to binary vectors. "
+ "This encoding allows algorithms which expect continuous features to use categorical "
+ "features. This transformer expects a numeric input and generates an array of double "
+ "as output.")
public class OneHotEncoder extends FeatureCompute {

    ...

}

Parameters

Input Field	The name of the field in the input schema that contains the features.
Output Field	The name of the field in the output schema that contains the transformed features.
Drop Last Category	An indicator to specify whether to drop the last category in the encoder vectors. Default is 'true'.

QuantileDiscretizer

@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("QuantileDiscretizer")
@Description("A transformation stage that leverages the Apache Spark ML Quantile Discretizer "
+ "to map continuous features of a certain input field onto binned categorical feature.")
public class QuantileDiscretizer extends FeatureCompute {

    ...

}

Parameters

Input Field	The name of the field in the input schema that contains the features.
Output Field	The name of the field in the output schema that contains the transformed features.
Number of Buckets	The number of buckets (quantiles, or categories) into which data points are grouped. Must be greater than or equal to 2. Default is 2.
Relative Error	The relative target precision for the approximate quantile algorithm used to generate buckets. Must be in the range [0, 1]. Default is 0.001.

Tokenizer

@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("Tokenizer")
@Description("A transformation stage that leverages the Apache Spark Regex "
+ "ML Tokenizer to split an input text into a sequence of tokens.")
public class Tokenizer extends FeatureCompute {

    ...

}

Parameters

Input Field	The name of the field in the input schema that contains the features.
Output Field	The name of the field in the output schema that contains the transformed features.
Regex Pattern	The regex pattern used to split the input text. The pattern is used to match delimiters, if 'gaps' = true or tokens if 'gaps' = false: Default is '\\s+'.
Token Length	Minimum token length, greater than or equal to 0, to avoid returning empty strings. Default is 1.
Gaps	Indicator to determine whether regex splits on gaps (true) or matches tokens (false). Default is 'true'.

Text Vectorization

CountVecBuilder

@Plugin(type = SparkSink.PLUGIN_TYPE)
@Name("CountVecBuilder")
@Description("A building stage for an Apache Spark ML CountVectorizer model.")
public class CountVecBuilder extends FeatureSink {

      ...

}

Parameters

Model Name	The unique name of the feature model.
Input Field	The name of the field in the input schema that contains the features to build the model from.
Model Configuration
Vocabulary Size	The maximum size of the vocabulary. If this value is smaller than the total number of different terms, the vocabulary will contain the top terms ordered by term frequency across the corpus.
Minimum Document Frequency	Specifies the minimum nonnegative number of different documents a term must appear in to be included in the vocabulary. Default is 1.
Minimum Term Frequency	Filter to ignore rare words in a document. For each document, terms with frequency (or count) less than the given threshold are ignored. Default is 1.")

CountVec

@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("CountVec")
@Description("A transformation stage that leverages the Apache Spark ML CountVectorizer. "
+ "This stage requires a trained CountVectorizer model.")
public class CountVec extends FeatureCompute {

    ...

}

Parameters

Model Name	The unique name of the feature model.
Input Field	The name of the field in the input schema that contains the features to build the model from.
Output Field	The name of the field in the output schema that contains the transformed features.

TFIDFBuilder

@Plugin(type = SparkSink.PLUGIN_TYPE)
@Name("TFIDFBuilder")
@Description("A building stage for an Apache Spark ML TF-IDF feature model.")
public class TFIDFBuilder extends FeatureSink {

    ...

}

Parameters

Model Name	The unique name of the feature model.
Input Field	The name of the field in the input schema that contains the features to build the model from.
Model Configuration
Number of Features	The nonnegative number of features to transform a sequence of terms into.
Minimum Document Frequency	The minimum number of documents in which a term should appear. Default is 0.

TFIDF

@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("TFIDF")
@Description("A transformation stage that leverages an Apache Spark ML TF-IDF model "
+ "to map a sequence of words into its feature vector. Term frequency-inverse "
+ "document frequency (TF-IDF) is a feature vectorization method widely used in "
+ "text mining to reflect the importance of a term to a document in the corpus.")
public class TFIDF extends FeatureCompute {

    ...

}

Parameters

Model Name	The unique name of the feature model.
Input Field	The name of the field in the input schema that contains the features to build the model from.
Output Field	The name of the field in the output schema that contains the transformed features.

Word2VecBuilder

@Plugin(type = SparkSink.PLUGIN_TYPE)
@Name("W2VecBuilder")
@Description("A building stage for an Apache Spark ML Word2Vec feature model.")
public class W2VecBuilder extends FeatureSink {

    ...

}

Parameters

Model Name	The unique name of the feature model.
Input Field	The name of the field in the input schema that contains the features to build the model from.
Model Configuration
Maximum Iterations	The maximum number of iterations to train the Word-to-Vector model. Default is 1.
Learning Rate	The learning rate for shrinking the contribution of each estimator. Must be in interval (0, 1]. Default is 0.025.
Vector Size	The positive dimension of the feature vector to represent a certain word. Default is 100.
Window Size	The positive window size. Default is 5.
Minimum Word Frequency	The minimum number of times a word must appear to be included in the Word-to-Vector vocabulary. Default is 5.
Maximum Sentence Length	The maximum length (in words) of each sentence in the input data. Any sentence longer than this threshold will be divided into chunks of this length. Default is 1000.

Word2Vec

@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("W2Vec")
@Description("A transformation stage that turns a sentence into a vector to represent "
+ "the whole sentence. The transform is performed by averaging all word vectors it "
+ "contains, based on a trained Word2Vec model.")
public class W2Vec extends FeatureCompute {

      ...

}

Parameters

Model Name	The unique name of the feature model.
Input Field	The name of the field in the input schema that contains the features to build the model from.
Output Field	The name of the field in the output schema that contains the transformed features.