Streaming Aggregators to realise data summarisation on data streams stored in a Solid environment #84

pbonte · 2022-10-19T09:15:48Z

Pitch

Data streams are becoming omnipresent, however, storing and analysing real-time data streams in a decentralised fashion using solid is still hard to achieve. This is mainly due to the high frequency of changes in the answers to the issued queries on these streams and the temporal validity of the answers.
A first prototype of streaming aggregators is necessary to prepare the answers of a continuous query over streaming data for a client and keep the query results up to date. This eliminated the need for the client to process the whole stream while the aggregator allow the client to retrieve the results instantaneously.
In patient monitoring system, data streams produced by personal vitality sensors and activity trackers are semantically annotated and stored in the data pods. Healthcare providers are interested in summaries of the activity of a single patient our summaries across multiple patients.
Streaming aggregators are required to realise an improved data summarisation and instantaneous results as the data to be analysed in a pull-based fashion is extremely large due to the continuous dimension of the data streams.
The DAHCC Dataset will be used as the data stored for each patient in a solid pod to realise the aggregators.

Desired Solution

A first proof of concept streaming aggregator which runs as a service, with whom a client application can interact. The client application can specify the query that needs to be continuously evaluated on one or multiple data streams stored in Solid Pods. The solution is required to,

Execute the queries as requested by the client application over a specific time-based window (to be specified by the client application).
Is able to do aggregate data resulting from streams stored on a single pod as well as over multiple pods.
Is able to compute time-based tumbling and sliding windows over the data streams.
Store the result of the continuous queries in the aggregator, so that the client can execute a GET request to access the aggregated data summarization.

Use Case

The dataset has sensor values from multiple patients. To monitor the patient's location, we use the sensors which detects the presence of the person in the house. The person detection sensor is employed in the 3 halls, kitchen and the bedroom in the DAHCC dataset. We will aggregate each patient's location in a particular window, as well as the location of all the patients. This allows to compute a summary of the activity of each patient, which is a useful insight for healthcare providers.

Acceptance Criteria

A demo resulting from the solution should be able to,

Accept continuous queries for streams resulting form a single or multiple pods
The client should be able to get the results of the queries through a GET operation.
Show that the results are complete
Show the speed up compared to a client applications that does not use the aggregation service.

Assumptions

Long term server-side authenticated sessions has been resolved.
The registered queries are SPARQL Select queries (or RSPQL queries if we want to define the streaming operator inside the query)
This is a first prototype that does not need to be fully optimised
LDES will be used to store the streams on Solid pods, and the LDES client will be used to continuously retrieve the latest changes to the LDES.

Compared to (#24), we focus on the streamming and windowing aspect for aggregation of data.

Scenarios

This is part of a larger scenario

s-minoo · 2022-10-22T09:02:48Z

A few papers that I came across might be relevant to sliding windows aggregations:

The author, Jonas Traub, developed the aggregate stream slicing in that specific order

github-actions · 2022-11-10T03:16:23Z