novaVM: Enhanced Java Virtual Machine for Big Data Applications

Abstract: The need to process large amounts of data, i.e. Big Data, is a reality. From scientific experiments to social networks, Big Data applications require processing and storing massive amounts of data in an efficient way. In addition, with fast development cycles and large community resources, managed object-oriented programming languages (such as Java) turned into the preferred languages to implement many Big Data applications.
However, these languages run on top of managed runtimes that were not built to cope with the challenges imposed by Big Data applications. In particular, this work identifies three problems/challenges that need to be addressed: i) the need to quickly recover from failed nodes or to spawn more nodes to accommodate new workload demands; ii) the need to improve runtime memory management to be able to scale to large amounts of data in memory without sacrificing the application latency; iii) the need to efficiently manage resources and minimize resource waste. These are fundamental problems to most Big Data applications running on managed runtimes and can not be solved using previously proposed solutions.
To solve the aforementioned problems, this work proposes a number of algorithms: i) ALMA, a migration/replication algorithm that takes advantage of internal memory management information to improve the runtime migration/replication; ii) NG2C, an N-Generational Garbage Collector that reduces applications' long tail latencies; iii) POLM2, an offline profiler that can be used to profile workloads, and whose output profiling information can be used to configure NG2C; iv) ROLP, an online profiler, running inside the runtime that automatically profiles the application and configures NG2C; v) Dynamic Vertical Scaling, a new heap sizing strategy that improves runtime resource management to reduce resource waste.
All the proposed algorithms are implemented as sub-systems of novaVM, a new Java Virtual Machine (JVM), implemented on top of OpenJDK 8 HotSpot JVM, a widely-used industrial JVM. Each algorithm is evaluated using benchmarks and workloads based on real-world applications. Results are very promising, demonstrating that the proposed goals were achieved.
This work is supported by a number of publications in international journals and conferences. In addition, novaVM's source code is opensource and part of it is now included in several opensource projects such as CRIU and OpenJDK. 
