Data Aggregation and Enrichment Framework
Clepsydra is a flexible and scalable system for aggregation, processing and provisioning of data from heterogeneous sources. It was designed and developed to be a basis for services focused on aggregation and enrichment of (meta)data describing on-line collections of cultural heritage digital objects from Polish memory institutions. The first production deployment of this system is the PIONIER Network Digital Libraries Federation.
Below you can find more details on the motivation for the development of the system and very general description of its components. For more information about the system features see below and for its architecture please check the link in the right menu.
Each Clepsydra component which is currently available as an open source has also its own subpage, also available in the menu. These components are independent and we are going to release them one after another, when they will be reaching proper development stage.
Functional requirements. While designing and developing the Clepsydra system we wanted to have the possibility to:
Store and access large amounts of heterogeneous data records, organized in a way which supports the following:
Users and components of Clepsydra, when accessing the stored data:
Aggregate data records from many different kinds of sources, organized in a way which supports the following:
Process data records from one format to another, organized in a way which supports the following:
Non-functional requirements. One of the key non-functional aspects of architecture in large production quality systems are the scalability and robustness. Experiences from previously mentioned the Digital Libraries Federation shown us that the most serious scalability issues are related to the increase of the amount of aggregated data or the complexity of data processing workflow. Faults are in majority of cases caused by technical problems or unexpected behavior on the data providers side. Real-life examples can be: hang-outs during data transmission, very long data provider response times or a situation when a data source holding 100 thousand data records, which usually adds or updates up to 300 records daily, suddenly reports that 90% of all objects were updated during single working day. To address this, we designed the aggregation component architecture as a set of independent fault-tolerant agents, and all system components were designed to scale easily. This also included choosing proper underlying technologies like Apache Cassandra.
Another important non-functional requirement is interoperability. As written in the functional requirements section, the designed system should allow to aggregate data from heterogeneous sources, which expresses the need of interoperability in contact with data providers. The interoperability should be also assured on the level of interfaces which will be used by external systems willing to access the aggregated data, offering access in a technology neutral manner. That is why we have decided that all public APIs of Clepsydra should be REST APIs.
The data aggregation, processing and provisioning system should be also ready for several years long operation period possible without major architecture redesigns, therefore maintainability and extensibility, as well as operability, availability and reliability are additional important non-functional requirements which we are trying to address. With the proposed architecture tt should be possible to easily extend the system features in the areas of aggregation, processing and provisioning. The data aggregation and processing components are separated from data provisioning processes, to make possible to access the data even when data aggregation and processing services are not functioning for some reason or are overloaded with unexpected amount of data. On the other hand the increased number of end-users requests should not influence data aggregation operations.
Clepsydra consists of three core components corresponding to the three groups of functional features described above:
The figure below symbolically shows the relation between the three core components of Clepsydra and also explains the name of the project.
We are looking forward for any parties intrested in using our product or participating in its development. If you are interested in our project, please feel free to contact us.
The project source code is licensed on Apache License, Version 2.0.
More components will be available later on.