Any piece of text which is not relevant to the context of the data can be specified as noise. Noise comprises commonly used but less relevant words of a language, misspellings, multiple variations of word representation which all reduce to the same semantic context.
Noise reduction is an important pre-processing phase for any kind of text analysis.
Multiple variations of a single word, say “player”, “played” and “playing”, are contextually similar. A lemmatizer converts all these disparities into their normalized form (also known as lemma) and thereby is an instrument of noise reduction and text standardization.
Lemmatization is based on a trained Lemma model. The respective training corpus can be formatted as shown below:
monorail -> monorail monorails
monosaccharide -> monosaccharides monosaccharide
monosaturate -> monosaturates monosaturate
monospar -> monospar monospars
monosyllable -> monosyllables monosyllable
monotheist -> monotheist monotheists
@Plugin(type = SparkSink.PLUGIN_TYPE)
@Description("A building stage for a Lemmatization model. The training corpus assigns "
+ "each lemma to a set of term variations that all map onto this lemma.")
public class LemmatizerBuilder extends TextSink {
Model Name | The unique name of the Lemmatization model. |
Corpus Field | The name of the field in the input schema that contains the lemma and assigned tokens. |
Lemma Delimiter | The delimiter to separate lemma and associated tokens in the corpus. Key & value delimiter must be different. |
Token Delimiter | The delimiter to separate the tokens in the corpus. Key & value delimiter must be different. |
@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Description("A transformation stage requires a trained Lemmatization model. It extracts "
+ "normalized terms from a text document and maps each term onto its trained lemma. "
+ "This stage adds an extra field to the input schema that contains the whitespace "
+ "separated set of lemmas.")
Model Name | The unique name of the Lemmatization model. |
Text Field | The name of the field in the input schema that contains the text document. |
Lemma Field | The name of the field in the output schema that contains the detected lemmas. |
Spell Correction
The training corpus of the Spelling Checking model is an Apache Spark dataset with a single text column. This column comprises one or more terms with correct spelling.
@Plugin(type = SparkSink.PLUGIN_TYPE)
@Description("A building stage for a Spell Checking model based on Norvig's algorithm. "
+ "The training corpus provides correctly spelled terms with one or multiple terms per "
+ "document.")
public class NorvigBuilder extends TextSink {
Model Name | The unique name of the Spell Checking model. |
Corpus Field | The name of the field in the input schema that contains the correctly spelled tokens. |
Token Delimiter | The delimiter to separate the tokens in the corpus. |
@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Description("A transformation stage that checks the spelling of each normalized term "
+ "in a text document, leveraging a trained Norvig Spelling model. This stage adds an "
+ "extra field to the input schema that contains the whitespace separated set of suggested "
+ "spelling corrections.")
public class NorvigChecker extends TextCompute {
Model Name | The unique name of the Spell Checking model. |
Text Field | The name of the field in the input schema that contains the text document. |
Suggestion Field | The delimiter to separate lemma and associated tokens in the corpus. Key & value delimiter must be different. |
Probability Threshold | The probability threshold above which a suggested term spelling is accepted. Default is 0.75. |
@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Description("A transformation stage that leverages the Spark NLP Stemmer to map an input "
+ "text field onto its normalized terms and reduce each terms to its linguistic stem. "
+ "This stage adds an extra field to the input schema that contains the whitespace "
+ "separated set of stems.")
public class TokenStemmer extends TextCompute {
Text Field | The name of the field in the input schema that contains the text document. |
Stem Field | The name of the field in the output schema that contains the stemmed tokens. |
Stopword Removal
@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Description("A transformation stage that leverages the Spark NLP Stopword Cleaner to map an input "
+ "text field onto its normalized terms and remove each term that is defined as stop word. "
+ "This stage adds an extra field to the input schema that contains the whitespace "
+ "separated set of remaining tokens.")
public class TokenCleaner extends TextCompute {
Text Field | The name of the field in the input schema that contains the text document. |
Cleaned Field | The name of the field in the output schema that contains the cleaned tokens. |
Stop Words | A delimiter separated list of stop words, i.e. words that have to be removed from the extracted tokens. |
Word Delimiter | The delimiter used to separate the different stop words. Default is comma-separated. |