N-Grams
Plugins
Token Ngrams
@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("TokenNgrams")
@Description("A transformation stage that leverages the Spark NLP N-gram generator to map an input "
+ "text field onto an output field that contains its associated N-grams.")
public class TokenNgrams extends TextCompute {
...
}
Parameters
Text Field | The name of the field in the input schema that contains the text document. |
N-Gram Field | The name of the field in the output schema that contains the N-grams. |
N-Gram Length | Minimum n-gram length, greater than or equal to 1. Default is 2. |
Vectorization
Plugins
Word2VecBuilder
@Plugin(type = SparkSink.PLUGIN_TYPE)
@Name("Word2VecBuilder")
@Description("A building stage for an Apache Spark-NLP based Word2Vec embedding model.")
public class Word2VecBuilder extends TextSink {
...
}
Parameters
Model Name | The unique name of the Word2Vec embeddings model. |
Text Field | The name of the field in the input schema that contains the text document. |
Normalization | The indicator to determine whether token normalization has to be applied. Normalization restricts the characters of a token to [A-Za-z0-9-]. Supported values are 'true' and 'false'. Default is 'true'. |
Model Configuration | |
Maximum Iterations | The maximum number of iterations to train the Word2Vec model. Default is 1. |
Learning Rate | The learning rate for shrinking the contribution of each estimator. Must be in interval (0, 1]. Default is 0.025. |
Vector Size | The positive dimension of the feature vector to represent a certain word. Default is 100. |
Window Size | The positive window size. Default is 5. |
Minimum Word Frequency | The minimum number of times a word must appear to be included in the Word-to-Vector vocabulary. Default is 5. |
Maximum Sentence Length | The maximum length (in words) of each sentence in the input data. Any sentence longer than this threshold will be divided into chunks of this length. Default is 1000. |
Word2Vec
@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("Word2Vec")
@Description("An embedding stage that leverages a trained Word2Vec model to map an input "
+ "text field onto an output token & word embedding field.")
public class Word2Vec extends TextCompute {
...
}
Parameters
Model Name | The unique name of the Word2Vec embeddings model. |
Text Field | The name of the field in the input schema that contains the text document. |
Token Field | The name of the field in the output schema that contains the extracted tokens. |
Embedding Field | The name of the field in the output schema that contains the word embeddings. |
Normalization | The indicator to determine whether token normalization has to be applied. Normalization restricts the characters of a token to [A-Za-z0-9-]. Supported values are 'true' and 'false'. Default is 'true'. |
Sent2Vec
@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("Sent2Vec")
@Description("An embedding stage that leverages a trained Word2Vec model to map an input "
+ "text field onto an output sentence & sentence embedding field with a user-specific "
+ "pooling strategy.")
public class Sent2Vec extends TextCompute {
...
}
Parameters
Model Name | The unique name of the Word2Vec embeddings model. |
Text Field | The name of the field in the input schema that contains the text document. |
Sentence Field | The name of the field in the output schema that contains the extracted sentences. |
Embedding Field | The name of the field in the output schema that contains the sentence embeddings. |
Pooling Strategy | The pooling strategy how to merge word embedings into sentence embeddings. Supported values are 'average' and 'sum'. Default is 'average'. |
Normalization | The indicator to determine whether token normalization has to be applied. Normalization restricts the characters of a token to [A-Za-z0-9-]. Supported values are 'true' and 'false'. Default is 'true'. |