Purpose-built Connectors

Objective

Google CDAP offers many connector plugins to integrate with data stores, streaming platforms and cloud applications.

PredictiveWorks. complements CDAP’s built-in data connectors with purpose-built plugins, designed to support Customer, User & Entity Behavior Analytics (CUEBA). The main focus is on (but not limited to) low latency data stores and real-time streaming platforms:

Data Streaming

Eclipse Ditto

Eclipse Ditto

Eclipse Ditto is an IoT technology that implements the "digital twin" pattern. It is an abstraction layer that unifies access to millions of physical devices.

PredictiveWorks. connects to Eclipse Ditto via its web socket interface and exposes twin events and live messages as an Apache Spark real-time stream.

Eclipse Ditto is integrated into IoT platforms such as Aloxy IIOT Hub and Bosch IoT Suite.

Use Cases

(Industrial) Internet-of-Things

@Plugin(type = StreamingSource.PLUGIN_TYPE)
@Name("DittoSource")
@Description("An Eclipse Ditto IoT streaming source that supports "
  + "real-time websocket events from a digital twin service.")
public class DittoStreamSource extends StreamingSource<StructuredRecord>{

  ...

}      

Eclipse Paho

Eclipse Paho

Eclipse Paho is an open-source MQTT 3.1.1 and MQTT-SN client.

PredictiveWorks. exposes MQTT broker events as an Apache Spark real-time stream. The current implementation of this plugin supports JSON schema inference for MQTT streams (default), and explicitly transforms uplink messages from The Things Network server.

Use Cases

(Industrial) Internet-of-Things

@Plugin(type = StreamingSource.PLUGIN_TYPE)
@Name("MQTTSource")
@Description("An MQTT streaming source that listens to an MQTT 3.1.1 broker "
  + "and subscribes to a given topic.")
public class MqttSource extends StreamingSource<StructuredRecord> {

  ...

}      

HiveMQ

HiveMQ

HiveMQ is an Enterprise-ready MQTT Broker that fully supports MQTT 3.x and MQTT 5. PredictiveWorks. exposes HiveMQ broker events as Apache Spark real-time stream.

The current implementation also supports publishing of device-specific anomalies, forecasts, predictions and other behavior analytics events and thereby seamlessly integrates machine intelligence into an (Industrial) IoT environment.

Use Cases

(Industrial) Internet-of-Things

@Plugin(type = StreamingSource.PLUGIN_TYPE)
@Name("HiveMQSource")
@Description("An MQTT streaming source that listens to a HiveMQ MQTT broker "
  + "and subscribes to a given topic.")
public class HiveMQSource extends StreamingSource<StructuredRecord> {

  ...

}      

@Plugin(type = BatchSink.PLUGIN_TYPE)
@Name("HiveMQSink")
@Description("Batch sink plugin to send messages to a HiveMQ MQTT server.")
public class HiveMQSink extends BatchSink<StructuredRecord, Void, Void> {

  ...

}      

Kafka

Apache Kafka

This Kafka connector is an easy-to-use alternative for Google CDAP's plugin. PredictiveWorks. connector comes with automated schema inference. Users no longer have to explicitly specify the data format of a real-time event stream.

Use Cases

Cyber Defense, Internet-of-Things and more

@Plugin(type = StreamingSource.PLUGIN_TYPE)
@Name("KafkaSource")
@Description("A real-time connector plugin to consume events from Apache Kafka topics "
+ "and transform them into structured pipeline records.")
public class KafkaSource extends StreamingSource<StructuredRecord> {

  ...

}      

Kolide

Kolide Fleet

Kolide Fleet is an open-source Osquery Manager and extends Osquery's capabilities from a single machine to the entire fleet of endpoints.

Osquery results sent to Kolide are currently forwarded to Google PubSub. PredictiveWorks. exposes osquery-based Google PubSub events as Apache Spark real-time stream.

Use Cases

Cyber Defense

@Plugin(type = StreamingSource.PLUGIN_TYPE)
@Name("KolideSource")
@Description("A Kolide streaming source to read osquery messages from Google PubSub.")
public class KolideSource extends StreamingSource<StructuredRecord> {

  ...

}      

PubSub

Google PubSub

PubSub is Google's real-time messaging and streaming platform.

PredictiveWorks. exposes Google PubSub events as Apache Spark real-time stream.

Use Cases

Cyber Defense, Internet-of-Things and more

@Plugin(type = StreamingSource.PLUGIN_TYPE)
@Name("PubSubSource")
@Description("A PubSub streaming source to read messages from Google PubSub.")
public class PubSubSource extends StreamingSource<StructuredRecord> {

  ...

}      

Data Stores

PredictiveWorks. connector plugins were built with a strong focus on low latency data stores for Cyber Defense and Internet-of-Things use cases.

Aerospike

Aerospike

A Next-Generation NoSQL Data Platform for low latency solutions at extreme scale, with predictable performance at any scale and unmatched uptime and reliability.

Use Cases

Advertising, Cyber Defense, Internet-of-Things

@Plugin(type = BatchSource.PLUGIN_TYPE)
@Name("AerospikeSource")
@Description("A batch connector plugin to read data records from Aerospike namespaces and "
+ "sets and transform them into structured pipeline records.")
public class AerospikeSource extends BatchSource<AerospikeKey, AerospikeRecord, StructuredRecord> {

  ...

}      

@Plugin(type = BatchSink.PLUGIN_TYPE)
@Name("AerospikeSink")
@Description("A batch connector plugin to write structured pipelines records to "
+ "Aerospike namespaces and sets.")
public class AerospikeSink extends BatchSink<StructuredRecord, AerospikeKey, AerospikeRecord> {

  ...

}

Crate

Crate DB

Crate is a distributed SQL database at IoT-scale built on top of a NoSQL foundation. It combines the familiarity of SQL with the scalability and data flexibility of NoSQL.

Use Cases

Cyber Defense, Internet-of-Things

@Plugin(type = BatchSource.PLUGIN_TYPE)
@Name("CrateSource")
@Description("A batch connector plugin to read data rows from Crate tables "
+ "and transform them into structured pipeline records.")
public class CrateSource extends BatchSource<NullWritable, CrateWritable, StructuredRecord> {

  ...

}      

@Plugin(type = BatchSink.PLUGIN_TYPE)
@Name("CrateSink")
@Description("A batch connector plugin to write structured pipelines records to "
+ "Crate tables.")
public class CrateSink extends BatchSink<StructuredRecord, NullWritable, CrateWritable> {

  ...

}

Druid

Apache Druid

Apache Druid is a real-time analytics database designed for fast slice-and-dice analytics ("OLAP" queries) on large data sets. Druid leverages the columnar storage paradigm.

Use Cases

Advertising, Manufacturing, Marketing, Network Telemetry

@Plugin(type = BatchSource.PLUGIN_TYPE)
@Name("DruidSource")
@Description("A batch connector plugin to read data from Druid datasources "
+ "and transform them into structured pipeline records.")
public class DruidSource extends BatchSource<NullWritable, DruidWritable, StructuredRecord> {

  ...

}      

@Plugin(type = BatchSink.PLUGIN_TYPE)
@Name("CrateSink")
@Description("A batch connector plugin to write structured pipelines records to "
  + "Druid datasources.")
public class DruidSink extends BatchSink<StructuredRecord, NullWritable, DruidWritable> {

  ...

}

Elastic

Elasticsearch

This Elastic connector is an easy-to-use alternative for Google CDAP's plugin. PredictiveWorks. connector comes with automated schema inference. Users no longer have to explicitly specify the data format of Elastic documents.

Use Cases

Cyber Defense, Internet-of-Things and more

@Plugin(type = BatchSource.PLUGIN_TYPE)
@Name("ElasticSource")
@Description("A batch connector plugin to read documents from Elastic indices and mappings "
+ "and transform them into structured pipeline records.")
public class ElasticSource extends BatchSource<NullWritable, ElasticWritable, StructuredRecord> {

  ...

}      

InfluxDB

InfluxDB

InfluxDB is an open-source time series database. It is optimized for fast, high-availability storage and retrieval of time series data in fields such as operations monitoring, application metrics, Internet of Things sensor data, and real-time analytics.

Use Cases

Internet-of-Things

@Plugin(type = BatchSink.PLUGIN_TYPE)
@Name("InfluxSink")
@Description("A batch connector to write data records to an InfluxDB time series database. "
+ "As structured data record is divided into numeric and other fields. Numeric field "
+ "values are transformed into Doubles and persisted as a measurement field. String "
+ "field values are persisted as tags, while all other fields are ignored.")
public class InfluxSink extends BatchSink<StructuredRecord, InfluxPointWritable, NullWritable> {

  ...

}

Ignite

Apache Ignite

Apache Ignite is an in-memory computing platform for transactional, analytical, and streaming workloads delivering in-memory speeds at petabyte scale.

Use Cases

Cyber Defense, Internet-of-Things

@Plugin(type = BatchSource.PLUGIN_TYPE)
@Name("IgniteSource")
@Description("A batch connector plugin to read data from Apache Ignite caches as BinaryObjects "
+ "and transform them into structured pipeline records.")
public class IgniteSource extends BatchSource<NullWritable, IgniteWritable, StructuredRecord> {

  ...

}      

@Plugin(type = BatchSink.PLUGIN_TYPE)
@Name("IgniteSink")
@Description("A batch connector plugin to write structured pipelines records as BinaryObjects to "
  + "Apache Ignite caches.")
public class IgniteSink extends BatchSink<StructuredRecord, NullWritable, IgniteWritable> {

  ...

}

SAP HANA

SAP HANA

SAP in-memory business data computing platform.

Use Cases

General purpose

@Plugin(type = BatchSource.PLUGIN_TYPE)
@Name("HANASource")
@Description("A batch connector plugin to read data from SAP HANA database "
+ "and transform them into structured pipeline records.")
public class HANASource extends BatchSource<NullWritable, HanaWritable, StructuredRecord> {

  ...

}      

@Plugin(type = BatchSink.PLUGIN_TYPE)
@Name("HANASink")
@Description("A batch connector plugin to write structured pipelines records to "
+ "SAP HANA database.")
public class HANASink extends BatchSink<StructuredRecord, NullWritable, HanaWritable> {

  ...

}

Data Warehouses

Panoply

Panoply Data Warehouse

Panoply is a data management platform that integrates with 80+ data sources. Its data warehouse provides access to data from individual e-commerce, marketing, payment and sales platforms.

Use Cases

Accounting, Advertising, E-Commerce, Marketing and Sales.

@Plugin(type = BatchSource.PLUGIN_TYPE))
@Name("PanoplySource")
@Description("A batch source to read structured records from a Panoply data warehouse.")
public class PanoplySource extends RedshiftSource {

  ...

}      

@Plugin(type = BatchSin.PLUGIN_TYPE)
@Name("PanoplySink")
@Description("A batch sink to write structured records to a Panoply data warehouse. "
		+ "It is not recommended to use this sink connector for (very) large datasets. "
		+ "In this case leverage the S3 sink connector and the S3-to-Redshift action.")
public class PanoplySink extends RedshiftSink {

  ...

}      

Redshift

Amazon Redshift

Redshift is Amazon's popular and cloud data warehouse.

Use Cases

General purpose

@Plugin(type = BatchSource.PLUGIN_TYPE)
@Name("RedshiftSource")
@Description("A batch connector plugin to read data from Amazon Redshift data warehouse "
+ "and transform them into structured pipeline records.")
public class RedshiftSource extends BatchSource<NullWritable, RedshiftWritable, StructuredRecord> {

  ...

}      

@Plugin(type = BatchSink.PLUGIN_TYPE)
@Name("RedshiftSink")
@Description("A batch connector plugin to write structured pipelines records to "
+ "Amazon Redshift data warehouse.")
public class RedshiftSink extends BatchSink<StructuredRecord, NullWritable, RedshiftWritable> {

  ...

}

Snowflake

Snowflake

Snowflake is a general purpose cloud data platform and warehouse.

Use Cases

General purpose

@Plugin(type = BatchSource.PLUGIN_TYPE)
@Name("SnowflakeSource")
@Description("A batch source to read structured records from a Snowflake database.")
public class SnowflakeSource extends JdbcSource {

  ...

}      

@Plugin(type = BatchSink.PLUGIN_TYPE)
@Name("SnowflakeSink")
@Description("A batch sink to write structured records to a Snowflake database.")
public class SnowflakeSink extends JdbcSink<SnowflakeWritable> {

  ...

}

Human Intelligence

DataSift

DataSift

DataSift connects to a real-time feed of human data, uncovers insights with sophisticated data augmentation, filtering and classification engine, and provides the data for analysis.

Use Cases

Human Intelligence

@Plugin(type = BatchSource.PLUGIN_TYPE)
@Name("DataSiftSource")
@Description("A batch connector plugin to read data from DataSift cloud service "
+ "and transform them into structured pipeline records.")
public class DataSiftSource extends BatchSource<NullWritable, DataSiftWritable, StructuredRecord> {

  ...

}      

Internet of Things

ThingsBoard

ThingsBoard

ThingsBoard is an open-source IoT Platform for device management, data collection, processing and visualization.

Use Cases

(Industrial) Internet-of-Things

@Plugin(type = StreamingSource.PLUGIN_TYPE)
@Name("ThingsboardSource")
@Description("An Apache Kafka streaming source that supports real-time "
  + "events that originate from Thingsboard.")
public class ThingsboardSource extends StreamingSource<StructuredRecord> {

  ...

}      

@Plugin(type = BatchSink.PLUGIN_TYPE)
@Name("ThingsBoardSink")
@Description("A batch connector plugin to write structured pipelines records to "
+ "ThingsBoard.")
public class ThingsBoardSink extends BatchSource<NullWritable, ThingsBoardWritable, StructuredRecord> {

  ...

}      

Table of Content