Techniques for Enhancing the Performance of Data-intensive Management Systems

Abstract: The demand for storing and analyzing large volumes of data is today on the rise as web-based enterprises introduce innovative and interactive applications, that attract more and more users on a global scale. To cope with such data volumes, data management systems have been evolving to deliver increasingly better performance and efficiency at lower costs in large-scale scenarios. A fundamental property of these systems is data consistency. In storage systems, consistency refers to how accurate, fresh and synchronized is the state of data replicas across different machines. Most of these systems, sacrifice consistency in favor of availability and performance; while others, provide strong consistency and sacrifice availability and performance. In data processing systems, dataflow and stream continuous processing, consistency refers to the completeness state of the input that is reflected in the dataflow end output within a time frame. Traditional dataflow management systems are strongly consistent by enforcing strict temporal synchronization across processing steps. In a multitude of scenarios, such model results in inefficient executions that solely cause a marginal impact on the output, with respect to a previous state. On the other hand, stream processing systems, that deal with timestamped events, tend to be looser in terms of consistency in order to sustain low latency and not overload resources, which might not be acceptable in mission critical applications. The main goal of our research is to study performance optimizations for data-intensive management systems. At the heart of these optimizations resides the tuning of data consistency. For this tuning, we take into account the semantics of data in order to trade-off consistency for performance and resource usage in data management systems. Our evaluation indicates that we can achieve substantial performance gains, namely in terms of latency, throughput, bandwidth, and resource utilization, while keeping application outputs within acceptable levels of correctness, as defined by decision makers.

Tags: