Sentiment analysis refers to the classical use case of classification, where text documents (documents, paragraphs or sentences) are mapped onto two categories: positive or negative.

The approach to sentiment classification supported by PredictiveWorks. implements an algorithm proposed by Vivek Narayanan. It is based on a combination of methods like negation handling, word n-grams and feature selection by mutual information results in a significant improvement in accuracy.

Corpus

The training corpus consists of a set of sentiment tokens assigned to a sentiment label. Token set and assigned label have to be provided line by line as shown below:

positive -> amazing voice acting
negative -> horrible acting
negative -> very bad
positive -> very fantastic

Plugins

SentimentBuilder

@Plugin(type = SparkSink.PLUGIN_TYPE)
@Name("SentimentBuilder")
@Description("A building stage for a Sentiment Analysis model based on the sentiment algorithm "
	+ "introduced by Vivek Narayanan. The training corpus comprises a labeled set of sentiment "
	+ "tokens.")
public class SentimentBuilder extends TextSink {

    ...

}

Parameters

Model Name The unique name of the Sentiment model.
Corpus Field The name of the field in the input schema that contains the labeled sentiment tokens.
Sentiment Delimiter The delimiter to separate labels and associated tokens in the corpus. Default is '->'.
Data Split The split of the dataset into train & test data, e.g. 80:20. Default is 90:10.

Sentiment


@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("Sentiment")
@Description("A transformation stage that predicts sentiment labels (positive or negative) for "
	+ "text documents, leveraging a trained Sentiment Analysis model.")
public class Sentiment extends TextCompute {

    ...

}

Parameters

Model Name The unique name of the LDA model.
Embedding Name The unique name of a trained Word2Vec embedding model.
Text Field The name of the field in the input schema that contains the text document.
Pooling Strategy The pooling strategy how to merge word embedings into document embeddings. Supported values are 'average' and 'sum'. Default is 'average'.
Topic Field The name of the field in the output schema that contains the prediction result.
Topic Strategy The indicator to determine whether the trained LDA model is used to predict a topic label or vector. Supported values are 'label' & 'vector'. Default is 'vector'.