With growth of Big Data solutions there is more and more need to use solutions focused on scalability. The traditional Big Data is still data oriented. It assumes the data already exists and needs to be processed. This approach is called Data at Rest.
The event processing aspects, both Complex Event Processing and Event Stream Processing, are inherently related to the message passing called Data in Motion.
Passing quickly small amounts of data raises challenges significantly different from the typical problems solved by massive data processing platforms. The popular Big Data solutions like Hadoop are optimized to defer the data movement to the latest possible time and to execute most of the logic in where the data is stored. It significantly increases data processing throughput, but at the same time reduces data mobility.
The approach where the collected data is analyzed and visualized brings vital information to the decision makers. The process is typically based on historical data that leads to decision being made on potentially stale data. At the same time the results of the past data processing requires significant human processing like implementation and deployment of decision rules. This adds further delay and increases operational cost.

Contemporary enterprises expect to be able:

  • To act on the fire hose of incoming events
  • Execute analytics on the data as soon as it arrives
  • Deploy the predictive models with minimal operational latency

TIBCO is real time integration solutions vendor and provides proven solutions for event processing. The Big Data solutions generate special challenges for event processing:

  • Horizontally scalability for moving data
  • Analytics friendly storage of the data
  • Native support for execution of analytics models

The Big Data accelerator uses large retailer transaction monitoring and propensity prediction scenario to show how these challenges can be addresses with mix of TIBCO and open source products.


Conceptually, the Apache Spark Accelerator does four things:

  1. Capture – this includes several distinct and separate activities: connect to data sources to bring data into the system, then cleanse, normalize, and persist the data
  2. Analytics – this includes data discovery and model discovery.  In this step, the historical data is analyzed to learn to be predictive.
  3. Streaming Analytics – here we execute on the predictive model given the real-time data streams
  4. Model Tracking – this includes tracking the real-time KPI

Benefits and Business Value

The key is the speed. Contrary to most popular solutions TIBCO products offer real time integration and reporting.
The Big Data accelerator shows an easy way to combine Fast Data with Big Data. With the proper approach the delay between data collection and analytics availability can be significantly reduce leading to the faster trend detection.
At the same time, combination of event processing products and live data mart executes vital performance indicators computation in the real time and exposes them immediately to the operations teams. Accurate real-time indicators, like anomaly detection or less than optimal model predictions, can be handled by automated processes to mitigate the losses and increase the profit.
At micro scale timely tracking customer behavior combined with statistical models derived from massive data increases ability to hit the opportunity window.
The proposed solution grows with business. It is inherently scalable while it still retains real-time intelligence.

Technical Scenario

The Accelerator combines TIBCO portfolio with open source solutions. It demonstrates ability to convert the Fast Data to Big Data (or Data in Motion to Data at Rest).
The example retailer has large number of stores and POS terminals. The whole traffic is handled by the centralized solution. The transaction information with attached customer identification (loyalty card) is delivered using Apache Kafka message bus. Kafka is extremely horizontally scalable messaging solution. The advantage over TIBCO EMS is that it can be natively scaled out and in depending on the workload.
The messages are processed by event processing layer implemented with TIBCO StreamBase CEP. The events are enriched with customer history kept in Apache HBase. HBase is another horizontally scalable solution. It is column-oriented datastore with primary access by unique key. Its design allows to achieve constant-cost access to the customer data with millisecond-range latency. HBase, in the proposed solution, keeps customer transaction history. It provides lock-free constant-cost access for both reading the past transactions and appending new ones.
The StreamBase application does also transaction data enrichment by identifying categories for each purchased item. The categorized transactions are used to build customer numerical profile that is in turn processed by statistical models. As a result customer propensity to accept each of actual offers is evaluated. The offer with best score is sent back to the POS.
On the server side the categorized transaction information is retained to track sales performance and react on changing trends.
The transaction information accompanied with offer sent to the customer is passed to data analytics store in HDFS. HDFS is key component of Hadoop. It is distributed filesystem optimized to reliably store and process huge amounts of data. The HDFS is a perfect tool for Big Data analytics, but it performs poorly when the data is still moving. In particular the reliable writes of small chunks of data to HDFS are too slow to be directly used in event processing applications.
To store data efficiently in HDFS the accelerator uses staging approach. First the data is sent to Flume component. Flume aggregates data in batches and stores them in append-friendly Avro format.
The data in the HDFS cluster is used for customer behavior modelling. TIBCO Spotfire component coordinates model training and deployment. The actual data access and transformation is performed by Apache Spark component. Spotfire communicates with Spark to aggregate the data and to process the data for model training. In order to improve the data access Spark is used to convert Avro files to analytics-friendly Parquet format in ETL process. The models are built with Spark and H2O. H2O is open source data analytics solution designed for distributed model processing.
With Spotfire the data scientist creates orchestrates model creation and submits the best models for hot-deployment. To do that Spotfire communicates with Zookeeper component that keeps the runtime configuration settings and passes the changes to all active instances of StreamBase, which closes the processing loop.


Event Processor

The EventProcessor is a StreamBase application. This component is central to the event processing with Fast Data story. The application consumes transaction messages from Kafka topic and applies regular event processing to them showing several possible techniques:

  • in-memory enrichment
  • state delegation
  • filtering
  • transformation
  • model execution
  • decisioning
  • process tracking
  • persistence

Real Time Dashboard

The demo operational dashboard is Live DataMart instance. The intended purpose is to keep up-to-date ledger of recent data. The DataMart does not keep the whole event history.
At the moment it contains transactions data and transaction content that is visible in LiveView Web application. The user can drill-down from transactions to transaction content.

Data Service

The Big Data resides in HDFS. It is typically too large to load in single process. The data service is a Spark application that exposes REST/HTTP interface (and Spark SQL, too). The application provides services:

  • Access to preregistered views (like ranges or available categories)
  • ETL process converting Avro to Parquet
  • Featurization result preview
  • Model training execution (with Sparkling Water)


Spotfire is main analytics and model lifecycle dashboard. It is used to:

  • Inspect the collected data
  • Prepare the models
  • Review the trained models
  • Build configurations
  • Deploy models

Traffic Simulator

Traffic simulator is a Jython (Java Python) application injecting transactions that can be consumed by event processing layer. The application sends events to Kafka topic.

Open Source Software Components:

  • Apache Hadoop
  • Apache Kafka
  • Apache HBase
  • Apache Zookeeper
  • Apache Spark
  • H2O, Sparking Water

Big Data Analytics and Visualization

Big Data Connectivity for High Performance Analytics

Spotfire offers three primary types of native integration with Hadoop and other big data sources:

  • Visualizing Data: Native out-of-the-box data connectors that facilitate super fast interactive data visualizations.
  • Performing Calculations:
    • Bring the engine to the data: Integration with in-data source distributed computing frameworks that enable data calculations of any complexity on big data.
    • Bring the data to the engine: Integration with external statistical engines that get data directly from any data source, including traditional databases.

Together, these modes of integration offer a combination of visual data discovery and advanced analytics. They enable business users to access, combine, and analyze data from any underlying data structures with dashboards and workflows that are powerful and easy to use.

In-Datasource Distributed Computing

In addition to convenient Spotfire point-click SQL operations running distributed within the data source, advanced statistical and machine learning algorithms can be initiated from Spotfire to be run in-data source on very large datasets, only returning the results needed for visualizations in Spotfire:

  • Users interact with point-and-click dashboards that call scripts using the TERR instance embedded in Spotfire.
  • The TERR scripts initiate distributed computing jobs via Map/Reduce, H2O, SparkR, or Fuzzy Logix.
  • These jobs drive high-performance engines deployed on the Hadoop or other data source nodes.
  • TERR can be deployed as the advanced analytics engine in Hadoop nodes that are driven by MapReduce or Spark. It can also be called on Teradata nodes.
  • Results are visualized in Spotfire