Scalable DataFrames on Kubernetes

Chathura Widanage
5 min readJun 15, 2021

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects.

Although pandas is considered the first implementation of DataFrame that was solid enough to catch a massive user-base, pandas is not designed to handle huge volumes of data across a distributed cluster of computing nodes. Instead, pandas can be considered the baseline for most other DataFrame implementations that emerged around pandas creating a new Eco-system.

Cylon is a fast, scalable, distributed memory, parallel runtime with a Pandas like DataFrame. Since Cylon core is written in C++ to use Apache Arrow to manipulate data, it has proven to be more efficient[1][2] than the existing distributed DataFrame implementations. Additionally, the python DataFrame API doesn’t inherit python’s performance bottlenecks due to the solid C++ core that performs all the heavy lifting, including the P2P communication between the distributed workers.

While performance is the crucial functional focus of Cylon, usability had always been the key non-functional focus. The goal is to keep the programming model fixed for the users by providing a DataFrame API that they are already familiar with and allowing the environment to be switched freely to achieve the best possible…

--

--