N-Grams

Plugins

Token Ngrams

@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("TokenNgrams")
@Description("A transformation stage that leverages the Spark NLP N-gram generator to map an input "
  + "text field onto an output field that contains its associated N-grams.")
public class TokenNgrams extends TextCompute {

  ...

}

Parameters

Text Field The name of the field in the input schema that contains the text document.
N-Gram Field The name of the field in the output schema that contains the N-grams.
N-Gram Length Minimum n-gram length, greater than or equal to 1. Default is 2.

Vectorization

Plugins

Word2VecBuilder

@Plugin(type = SparkSink.PLUGIN_TYPE)
@Name("Word2VecBuilder")
@Description("A building stage for an Apache Spark-NLP based Word2Vec embedding model.")
public class Word2VecBuilder extends TextSink {

    ...

}

Parameters

Model Name The unique name of the Word2Vec embeddings model.
Text Field The name of the field in the input schema that contains the text document.
Normalization The indicator to determine whether token normalization has to be applied. Normalization restricts the characters of a token to [A-Za-z0-9-]. Supported values are 'true' and 'false'. Default is 'true'.
Model Configuration
Maximum Iterations The maximum number of iterations to train the Word2Vec model. Default is 1.
Learning Rate The learning rate for shrinking the contribution of each estimator. Must be in interval (0, 1]. Default is 0.025.
Vector Size The positive dimension of the feature vector to represent a certain word. Default is 100.
Window Size The positive window size. Default is 5.
Minimum Word Frequency The minimum number of times a word must appear to be included in the Word-to-Vector vocabulary. Default is 5.
Maximum Sentence Length The maximum length (in words) of each sentence in the input data. Any sentence longer than this threshold will be divided into chunks of this length. Default is 1000.

Word2Vec

@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("Word2Vec")
@Description("An embedding stage that leverages a trained Word2Vec model to map an input "
  + "text field onto an output token & word embedding field.")
public class Word2Vec extends TextCompute {

    ...

}

Parameters

Model Name The unique name of the Word2Vec embeddings model.
Text Field The name of the field in the input schema that contains the text document.
Token Field The name of the field in the output schema that contains the extracted tokens.
Embedding Field The name of the field in the output schema that contains the word embeddings.
Normalization The indicator to determine whether token normalization has to be applied. Normalization restricts the characters of a token to [A-Za-z0-9-]. Supported values are 'true' and 'false'. Default is 'true'.

Sent2Vec

@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("Sent2Vec")
@Description("An embedding stage that leverages a trained Word2Vec model to map an input "
  + "text field onto an output sentence & sentence embedding field with a user-specific "
  + "pooling strategy.")
public class Sent2Vec extends TextCompute {

    ...

}

Parameters

Model Name The unique name of the Word2Vec embeddings model.
Text Field The name of the field in the input schema that contains the text document.
Sentence Field The name of the field in the output schema that contains the extracted sentences.
Embedding Field The name of the field in the output schema that contains the sentence embeddings.
Pooling Strategy The pooling strategy how to merge word embedings into sentence embeddings. Supported values are 'average' and 'sum'. Default is 'average'.
Normalization The indicator to determine whether token normalization has to be applied. Normalization restricts the characters of a token to [A-Za-z0-9-]. Supported values are 'true' and 'false'. Default is 'true'.