Any piece of text which is not relevant to the context of the data can be specified as noise. Noise comprises commonly used but less relevant words of a language, misspellings, multiple variations of word representation which all reduce to the same semantic context.

Noise reduction is an important pre-processing phase for any kind of text analysis.

Lemmatization

Multiple variations of a single word, say “player”, “played” and “playing”, are contextually similar. A lemmatizer converts all these disparities into their normalized form (also known as lemma) and thereby is an instrument of noise reduction and text standardization.

Corpus

Lemmatization is based on a trained Lemma model. The respective training corpus can be formatted as shown below:

monorail	->	monorail	monorails
monosaccharide	->	monosaccharides	monosaccharide
monosaturate	->	monosaturates	monosaturate
monospar	->	monospar	monospars
monosyllable	->	monosyllables	monosyllable
monotheist	->	monotheist	monotheists

Plugins

LemmatizerBuilder

@Plugin(type = SparkSink.PLUGIN_TYPE)
@Name("LemmatizerBuilder")
@Description("A building stage for a Lemmatization model. The training corpus assigns "
  + "each lemma to a set of term variations that all map onto this lemma.")
public class LemmatizerBuilder extends TextSink {

    ...

}

Parameters

Model Name The unique name of the Lemmatization model.
Corpus Field The name of the field in the input schema that contains the lemma and assigned tokens.
Lemma Delimiter The delimiter to separate lemma and associated tokens in the corpus. Key & value delimiter must be different.
Token Delimiter The delimiter to separate the tokens in the corpus. Key & value delimiter must be different.

Lemmatizer

@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("Lemmatizer")
@Description("A transformation stage requires a trained Lemmatization model. It extracts "
  + "normalized terms from a text document and maps each term onto its trained lemma. "
  + "This stage adds an extra field to the input schema that contains the whitespace "
  + "separated set of lemmas.")

    ...

}

Parameters

Model Name The unique name of the Lemmatization model.
Text Field The name of the field in the input schema that contains the text document.
Lemma Field The name of the field in the output schema that contains the detected lemmas.

Spell Correction

Corpus

The training corpus of the Spelling Checking model is an Apache Spark dataset with a single text column. This column comprises one or more terms with correct spelling.

Plugins

NorvigBuilder

@Plugin(type = SparkSink.PLUGIN_TYPE)
@Name("NorvigBuilder")
@Description("A building stage for a Spell Checking model based on Norvig's algorithm. "
  + "The training corpus provides correctly spelled terms with one or multiple terms per "
  + "document.")
public class NorvigBuilder extends TextSink {

    ...

}

Parameters

Model Name The unique name of the Spell Checking model.
Corpus Field The name of the field in the input schema that contains the correctly spelled tokens.
Token Delimiter The delimiter to separate the tokens in the corpus.

NorvigChecker

@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("NorvigChecker")
@Description("A transformation stage that checks the spelling of each normalized term "
  + "in a text document, leveraging a trained Norvig Spelling model. This stage adds an "
  + "extra field to the input schema that contains the whitespace separated set of suggested "
  + "spelling corrections.")
public class NorvigChecker extends TextCompute {

    ...

}

Parameters

Model Name The unique name of the Spell Checking model.
Text Field The name of the field in the input schema that contains the text document.
Suggestion Field The delimiter to separate lemma and associated tokens in the corpus. Key & value delimiter must be different.
Probability Threshold The probability threshold above which a suggested term spelling is accepted. Default is 0.75.

Stemming

Plugins

TokenStemmer

@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("TokenStemmer")
@Description("A transformation stage that leverages the Spark NLP Stemmer to map an input "
  + "text field onto its normalized terms and reduce each terms to its linguistic stem. "
  + "This stage adds an extra field to the input schema that contains the whitespace "
  + "separated set of stems.")		
public class TokenStemmer extends TextCompute {

    ...

}

Parameters

Text Field The name of the field in the input schema that contains the text document.
Stem Field The name of the field in the output schema that contains the stemmed tokens.

Stopword Removal

Plugins

TokenCleaner

@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("TokenCleaner")
@Description("A transformation stage that leverages the Spark NLP Stopword Cleaner to map an input "
  + "text field onto its normalized terms and remove each term that is defined as stop word. "
  + "This stage adds an extra field to the input schema that contains the whitespace "
  + "separated set of remaining tokens.")		
public class TokenCleaner extends TextCompute {

    ...

}

Parameters

Text Field The name of the field in the input schema that contains the text document.
Cleaned Field The name of the field in the output schema that contains the cleaned tokens.
Stop Words A delimiter separated list of stop words, i.e. words that have to be removed from the extracted tokens.
Word Delimiter The delimiter used to separate the different stop words. Default is comma-separated.