Rainbow Analytics with StreamSight
StreamSight is a query-driven framework for streaming analytics in edge computing. It supports users in composing analytic queries that are automatically translated and mapped to streaming operations suitable for running on distributed processing engines deployed in wide areas of coverage. Our reference implementation has been integrated with Apache Storm 2.2.0.
What does StreamSight do?
Data scientists and platform operators can compose complex analytic queries without any knowledge of the programmable model of the underlying distributed processing engine by solely using high-level directives and expressions.An example of such analytic query from the domain of intelligent transportation is the following:
insight = compute avg("_SYSTEM_CPU_VISIBLETOTAL" from (stream2), 10s) every 5s;
The above query computes for a 10 seconds sliding window the average CPU total segment with new datapoints considered every 5s, which is particularly useful for system administrators in order to detect CPU throttling.
- Make sure you’ve installed Docker on your machine.
- Build the cluster using the following command:
- Create and submit the job using the following script
- See this section for configuring StreamSight.
- See this section for composing your queries.
The following figure depicts a high-level and abstract overview of the Fog Analytics Cycle. Users submit ad-hoc queries following the declarative query model and the system compiles these queries into low-level streaming components.
The next figure presents the Storm architecture when an analytics job is submitted. Each fog node is comprised of a set of Supervisor nodes responsible for the execution of analytics tasks. Zookeeper is responsible for the communication of Master Node (Nimbus) with the Supervisor nodes.
Insight Declaration Examples
In this section, we present a few examples for useful insights from monitoring a cloud infrastructure.
A useful insight that many companies need to monitor and take decisions on that is CPU utilization.So the following insight returns the average CPU utilization of a service for 30 seconds every 10 seconds.
cpu_utilization = COPMUTE AVG("cpu_user" FROM (stream), 30s) + AVG("cpu_sys" FROM (stream), 30s) EVERY 10s;
Next we present an insight for maximum number of HTTP Requests per Second for 10 seconds window which is computed every 30 seconds grouped by region. For DevOps engineers and developers, who works on web-based applications, the peak of traffic in a specific region can be a critical factor.
http_requests_per_seconds_by_region = COPMUTEMAX("requests_per_seconds" FROM (stream), 10s) BY "region" EVERY 30s;
Below we present an example of a query that the execution depends on the condition on the right side. This is particularly important in the cases where analysts want to avoid collecting unnecessary results for values that are not important to them.
abnormal_temperature = compute avg("temperature" from (stream)) when > 100
The example below presents a cumulative operation of the energy consumption which is accumulated every time there is a new tuple as input.
total_energy = compute sum("energy_consumption" from (stream))
In the case that we need to calculate the overall cost as well we can multiply the above with the double value that represents the cost of the unit (0.002).
total_cost_energy = 0.002 * compute sum("energy_consumption" from (stream))
|Topology Name||The name of the StreamSight topology to be created|
|Output Operation||The operation to be applied for output results (Local print, Ignite exporter, Influx exporter)|
|Provider Hosts||The hostnames of data providers|
|Provider Port||The port number of data providers|
|Consumer Host||The hostname of data consumer|
|Consumer Port||The port number of data consumer|
The framework is open-sourced under the Apache 2.0 License base. The codebase of the framework is maintained by the authors for academic research and is therefore provided “as is”.