Sentiment analysis refers to the classical use case of classification, where text documents (documents, paragraphs or sentences) are mapped onto two categories: positive or negative.
The approach to sentiment classification supported by PredictiveWorks. implements an algorithm proposed by Vivek Narayanan. It is based on a combination of methods like negation handling, word n-grams and feature selection by mutual information results in a significant improvement in accuracy.
Corpus
The training corpus consists of a set of sentiment tokens assigned to a sentiment label. Token set and assigned label have to be provided line by line as shown below:
positive -> amazing voice acting
negative -> horrible acting
negative -> very bad
positive -> very fantastic
Plugins
SentimentBuilder
@Plugin(type = SparkSink.PLUGIN_TYPE)
@Name("SentimentBuilder")
@Description("A building stage for a Sentiment Analysis model based on the sentiment algorithm "
+ "introduced by Vivek Narayanan. The training corpus comprises a labeled set of sentiment "
+ "tokens.")
public class SentimentBuilder extends TextSink {
...
}
Parameters
Model Name | The unique name of the Sentiment model. |
Corpus Field | The name of the field in the input schema that contains the labeled sentiment tokens. |
Sentiment Delimiter | The delimiter to separate labels and associated tokens in the corpus. Default is '->'. |
Data Split | The split of the dataset into train & test data, e.g. 80:20. Default is 90:10. |
Sentiment
@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("Sentiment")
@Description("A transformation stage that predicts sentiment labels (positive or negative) for "
+ "text documents, leveraging a trained Sentiment Analysis model.")
public class Sentiment extends TextCompute {
...
}
Parameters
Model Name | The unique name of the LDA model. |
Embedding Name | The unique name of a trained Word2Vec embedding model. |
Text Field | The name of the field in the input schema that contains the text document. |
Pooling Strategy | The pooling strategy how to merge word embedings into document embeddings. Supported values are 'average' and 'sum'. Default is 'average'. |
Topic Field | The name of the field in the output schema that contains the prediction result. |
Topic Strategy | The indicator to determine whether the trained LDA model is used to predict a topic label or vector. Supported values are 'label' & 'vector'. Default is 'vector'. |