Clepsydra Storage - Introduction

Introduction

Clepsydra is a flexible and scalable system for aggregation, processing and provisioning of data from heterogeneous sources. It was designed and developed to be a basis for services focused on aggregation and enrichment of (meta)data describing on-line collections of cultural heritage digital objects from Polish memory institutions. The first production deployment of this system is the PIONIER Network Digital Libraries Federation.

More information about Clepsydra and its components can be found on the main Clepsydra website.

Clepsydra Storage is a component of Clepsydra, which main aim is to serve as a flexible and scalable service allowing storing and accessing large amounts of heterogeneous data, organized in a way which represents the fact that the data describes some kind of objects which are organized in collections (data sets), and each object can have its many representations (e.g. a digitized book may be at the same time represented by several metadata records in different schemas, by thumbnails of its cover in different formats/dimensions, and finally by extracted textual content).

Clepsydra Storage implementation is a stateless service written in Java (TM), available through a REST API. It stores data records in an underlying cluster of Apache Cassandra NoSQL database and the metadata of these data records in PostgreSQL database (preferably also a high-availability cluster). Metadata is stored separately in a relatively simple relational database, in order to allow selective access to records stored in Clepsydra Storage service.

One of the important aspects of Clepsydra Storage is that it is designed with the assumption that:

the data stored in the service will come from many sources,
after storing in Clepsydra, the data will be processed and the processing results (derived data) will be stored back to Clepsydra,
there will be external services which will be interested in further use/processing of a selection of data stored in Clepsydra e.g. data from particular source(s) or data in particular schema.

Therefore each data record stored in Clepsydra is described by several parameters (metadata) including the source of the record, the reference to a record from which particular record was derived, the date of last record modification (processing) etc. The details of the metadata structure can be found in the REST API documentation and the sample data lifecycle is described in the documentation. For services which prefer push approach while getting data from Clepsydra Storage a dedicated JMS interface with changes notifications was prepared.