Part of Speech Analysis is the automated process to associate each word in a sentence with its part of speech tag, e.g. noun, verb, adjective, adverbs. These tags define the usage and function of a word in the sentence.

Part of Speech Analysis can be used to learn (and preserve) different contexts of a certain word (other than pure bag of word models), e.g. book as noun and book as verb. This helps to build stronger word features.

Corpus

The provided corpus must be organized as sentence-per-line with the following format:

We|PRP want|VBP you|PRP to|TO Know|VBP why|WRB your|PRP$ support|NN of|IN Goodwill|NNP is|VBZ so|RB important|JJ .|.

Plugins

POSBuilder

@Plugin(type = SparkSink.PLUGIN_TYPE)
@Name("POSBuilder")
@Description("A building stage for a Part of Speech model.")
public class POSBuilder extends TextSink {

    ...

}

Parameters

Model Name The unique name of the Part of Speech model.
Corpus Field Name of the input text field that contains the annotated sentences for training purpose."
Delimiter The delimiter in the input text line to separate tokens and POS tags. Default is '|'.
Model Configuration
Maximum Iterations The (maximum) number of iterations the algorithm has to execute. Default value: 5.

POSTagger


@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("POSTagger")
@Description("A transformation stage that requires a trained Part-of-Speech model. This stage appends "
		+ "two fields to the input schema, one that contains the extracted terms per document, and "
		+ "another that contains their POS tags.")
public class POSTagger extends TextCompute {

    ...

}

Parameters

Model Name The unique name of the Part of Speech model.
Text Field The name of the field in the input schema that contains the text document.
Output Field The name of the field in the output schema that contains the mixin of extracted tokens and predicted POS tags.

POSChunker

Sample: ["(?:<JJ|DT>)(?:<NN|VBG>)+"]
@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("POSChunker")
@Description("A transformation stage that extracts meaningful phrases from text documents. "
  + "Phrase extraction is based on patterns of part-of-speech tags. This stage requires a "
  + "trained Part-of-Speech model.")
public class POSChunker extends TextCompute {

    ...

}

Parameters

Model Name The unique name of the Part of Speech model.
Text Field The name of the field in the input schema that contains the text document.
Chunk Field The name of the field in the output schema that contains the extracted chunks.
Topics The number of topics that have to be created. Default is 10.
Regex Rules A delimiter separated list of chunking rules.
Rule Delimiter The delimiter used to separate the different chunking rules.