PredictiveWorks. comes with a variety of text analysis plugins, but the very beginning of each text processing pipeline is characterized by more fundamental steps such as detecting sentence boundaries, extracting terms or tokens and normalizing them.

For user convenience, the following stages are integrated with each text-processing plugin and may not used explicitly:

  • Sentence Detection,

  • Sentence Tokenization, and

  • Token Normalization.

For users who want to leverage these stages by their own, PredictiveWorks. also offers them as plugins.

Plugins

SentenceDetector

@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("SentenceDetector")
@Description("A transformation stage that leverages the Spark NLP Sentence Detector to map an input "
	+ "text field onto an output field that contains detected sentences.")
public class SentenceDetector extends TextCompute {

    ...

}

Parameters

Text Field The name of the field in the input schema that contains the text document.
Sentence Field The name of the field in the output schema that contains the extracted sentences.

SentenceTokenizer

@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("SentenceTokenizer")
@Description("A transformation stage that leverages the Spark NLP Tokenizer to map an input "
	+ "text field into an output field that contains detected sentence tokens.")
public class SentenceTokenizer extends TextCompute {

    ...

}

Parameters

Text Field The name of the field in the input schema that contains the text document.
Token Field The name of the field in the output schema that contains the extracted tokens.

TokenNormalizer

@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("TokenNormalizer")
@Description("A transformation stage that leverages the Spark NLP Normalizer to map an input "
	+ "text field onto an output field that contains normalized tokens. The Normalizer "
	+ "will clean up each token, taking as input column token out from the Tokenizer, "
	+ "and putting normalized tokens in the normal column. Cleaning up includes removing "
	+ "any non-character strings.")

    ...

}

Parameters

Text Field The name of the field in the input schema that contains the text document.
Norm-Token Field The name of the field in the output schema that contains the normalized tokens.