Using Kubernetes Operators for Distributed Data Processing Pipelines

Big Data and data processing has been the center of many emerged businesses over the last decade. As data amounts grow beyond treatability by single processors, scalability has become a rising concern for implementing efficient and fast structures to process data. Particularly, as data outgrows compute performance, scaling data processing horizontally, i.e., processing in parallel on more machines, instead of vertically, i.e., processing on stronger machines has become the main focus of exploring newer and better ways to handle the ever-growing data sizes. Horizontal scaling poses different challenges however, as distributing the processing among a large number of machines creates both complexity for the actual distribution of data, as well as coordination of the particular processing workers. This demand has been recognized and tackled accordingly by several prominently used frameworks such as Apache Spark or Apache Hadoop. These frameworks impose a mental model of a central processing controller, that is, a program instance that imperatively states the processing jobs to the workers. We propose a shift in both mental and programming model, where we use a different architecture for the processing to run in. Instead of employing commonly used frameworks such as the above mentioned and relying on central coordination or what we refer to as orchestration, we want to drive the approach of choreography for distributed data processing workflows. For this, we adapt from the Operator pattern proposed for handling domain-specific workloads and cloud-computing platforms to leverage low coupling and better maintainability as benefits gained from decentralized control flow. We explore the feasibility of such an approach in an implementation of an exemplary batch processing workflow for data processing using the Kubernetes platform as an underlying driver for containerized workloads and give insights into the complexity that is distributing control in such a decentral way.