Text Engineering

N-Grams

Plugins

Token Ngrams

@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("TokenNgrams")
@Description("A transformation stage that leverages the Spark NLP N-gram generator to map an input "
  + "text field onto an output field that contains its associated N-grams.")
public class TokenNgrams extends TextCompute {

  ...

}

Parameters

Text Field	The name of the field in the input schema that contains the text document.
N-Gram Field	The name of the field in the output schema that contains the N-grams.
N-Gram Length	Minimum n-gram length, greater than or equal to 1. Default is 2.

Vectorization

Plugins

Word2VecBuilder

@Plugin(type = SparkSink.PLUGIN_TYPE)
@Name("Word2VecBuilder")
@Description("A building stage for an Apache Spark-NLP based Word2Vec embedding model.")
public class Word2VecBuilder extends TextSink {

    ...

}

Parameters

Model Name	The unique name of the Word2Vec embeddings model.
Text Field	The name of the field in the input schema that contains the text document.
Normalization	The indicator to determine whether token normalization has to be applied. Normalization restricts the characters of a token to [A-Za-z0-9-]. Supported values are 'true' and 'false'. Default is 'true'.
Model Configuration
Maximum Iterations	The maximum number of iterations to train the Word2Vec model. Default is 1.
Learning Rate	The learning rate for shrinking the contribution of each estimator. Must be in interval (0, 1]. Default is 0.025.
Vector Size	The positive dimension of the feature vector to represent a certain word. Default is 100.
Window Size	The positive window size. Default is 5.
Minimum Word Frequency	The minimum number of times a word must appear to be included in the Word-to-Vector vocabulary. Default is 5.
Maximum Sentence Length	The maximum length (in words) of each sentence in the input data. Any sentence longer than this threshold will be divided into chunks of this length. Default is 1000.

Word2Vec

@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("Word2Vec")
@Description("An embedding stage that leverages a trained Word2Vec model to map an input "
  + "text field onto an output token & word embedding field.")
public class Word2Vec extends TextCompute {

    ...

}

Parameters

Model Name	The unique name of the Word2Vec embeddings model.
Text Field	The name of the field in the input schema that contains the text document.
Token Field	The name of the field in the output schema that contains the extracted tokens.
Embedding Field	The name of the field in the output schema that contains the word embeddings.
Normalization	The indicator to determine whether token normalization has to be applied. Normalization restricts the characters of a token to [A-Za-z0-9-]. Supported values are 'true' and 'false'. Default is 'true'.

Sent2Vec

@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("Sent2Vec")
@Description("An embedding stage that leverages a trained Word2Vec model to map an input "
  + "text field onto an output sentence & sentence embedding field with a user-specific "
  + "pooling strategy.")
public class Sent2Vec extends TextCompute {

    ...

}

Parameters

Model Name	The unique name of the Word2Vec embeddings model.
Text Field	The name of the field in the input schema that contains the text document.
Sentence Field	The name of the field in the output schema that contains the extracted sentences.
Embedding Field	The name of the field in the output schema that contains the sentence embeddings.
Pooling Strategy	The pooling strategy how to merge word embedings into sentence embeddings. Supported values are 'average' and 'sum'. Default is 'average'.
Normalization	The indicator to determine whether token normalization has to be applied. Normalization restricts the characters of a token to [A-Za-z0-9-]. Supported values are 'true' and 'false'. Default is 'true'.

Table of Content

N-Grams
- Plugins
  - Token Ngrams
Vectorization
- Plugins