Topic modeling is a process of automatically identifying hidden topics in a text corpus, where topics are defined a repeating patterns of co-occurring terms. The most popular modeling technique is Latent Dirichlet Allocation (LDA).

Topic Clustering

Plugins

LDABuilder

This plugin is responsible for model building. The LDA Estimator of Spark ML is used to train the respective Latent Dirichlet Allocation (LDA) model.

An LDA Model can be used for text clustering or labeling (see LDAText).

The estimator expects that the provided text documents are represented as feature vectors. The mapping onto the feature space is achieved by leveraging a pre-trained word embedding model and an appropriate pooling strategy of how to derive document vectors from word vectors.

@Plugin(type = SparkSink.PLUGIN_TYPE)
@Name("LDABuilder")
@Description("A building stage for a Latent Dirichlet Allocation (LDA) model. An LDA model "
	+ "can be used for text clustering or labeling. This model training stage requires a "
	+ "pre-trained Word Embedding model.")
public class LDABuilder extends TextSink {

    ...

}

Parameters

Model Name The unique name of the LDA model.
Embedding Name The unique name of a trained Word2Vec embedding model.
Text Field The name of the field in the input schema that contains the text document.
Data Split The split of the dataset into train & test data, e.g. 80:20. Default is 90:10.
Model Configuration
Topics The number of topics that have to be created. Default is 10.
Maximum Iterations The (maximum) number of iterations the algorithm has to execute. Default value: 20.
Pooling Strategy The pooling strategy how to merge word embedings into document embeddings. Supported values are 'average' and 'sum'. Default is 'average'.

LDA

@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("LDA")
@Description("A transformation stage to map text documents on their topic vectors or most likely topic label. "
	+ "This stage is based on two trained models, an LDA Topic model and a Word Embedding model.")
public class LDA extends TextCompute {

    ...

}

Parameters

Model Name The unique name of the LDA model.
Embedding Name The unique name of a trained Word2Vec embedding model.
Text Field The name of the field in the input schema that contains the text document.
Pooling Strategy The pooling strategy how to merge word embedings into document embeddings. Supported values are 'average' and 'sum'. Default is 'average'.
Topic Field The name of the field in the output schema that contains the prediction result.
Topic Strategy The indicator to determine whether the trained LDA model is used to predict a topic label or vector. Supported values are 'label' & 'vector'. Default is 'vector'.

Topic Description

Plugins

TopicBuilder

@Plugin(type = SparkSink.PLUGIN_TYPE)
@Name("TopicBuilder")
@Description("A building stage for a Latent Dirichlet Allocation (LDA) model. In contrast to "
	+ "the LDABuilder plugin, this stage leverages an implicit document vectorization based "
	+ "on the term counts of the provided corpus. The trained model can be used to either "
	+ "determine the topic-distribution per document or term-distribution per topic.")
public class TopicBuilder extends TextSink {

    ...

}

Parameters

Model Name The unique name of the Topic model.
Text Field The name of the field in the input schema that contains the text document.
Data Split The split of the dataset into train & test data, e.g. 80:20. Default is 90:10.
Model Configuration
Topics The number of topics that have to be created. Default is 10.
Maximum Iterations The (maximum) number of iterations the algorithm has to execute. Default value: 20.
Vocabulary Size The size of the vocabulary to build vector representations. Default is 10000.

Topic

@Plugin(type = SparkCompute.PLUGIN_TYPE)
@Name("Topic")
@Description("A transformation stage to either determine the topic-distribution per document "
	+ "or term-distribution per topic. This stage is based on a trained Topic model.")
public class Topic extends TextCompute {

    ...

}

Parameters

Model Name The unique name of the Topic model.
Text Field The name of the field in the input schema that contains the text document.
Topic Strategy The indicator to determine whether to retrieve document-topic or topic-term description. Supported values are 'document-topic' and 'topic-term'. Default is 'document-topic'.