Hacking BigData with Cylon DataFrames
Cylon is a fast and scalable library to manipulate terabytes of data on a distributed setup with a Python-based DataFrame API. Although Python and “fast and scalable” don’t go hand in hand, the following facts will give you an idea of why Cylon is fast at runtime, while you are getting the flexibility to code your application logic in the language that every data scientist love, “Python”.
- Cylon’s core is entirely written in C++. Python API is just a wrapper around this high performance core.
- The DataFrames are directly mapped at runtime to Apache Arrow tables making data aggregation and transformation efficient and cache-friendly due to the columnar data representation.
- Communication is entirely written on top of frameworks proven to be faster(MPI and UCX).
- Since MPI/UCX and Apache Arrow all works on binary level, Cylon doesn’t have any data serialization/de-serialization overhead.
Getting started with Cylon
To get started with Cylon development, you need few prerequisites to be present in you development environment.
- Docker
- VSCode
- 5 minutes!
Step 1: Pull Cylon containers from the Docker Hub
docker pull cylondata/cylon
If everything went fine, you would see an output similar to below.
Using default tag: latest
latest: Pulling from cylondata/cylon
Digest: sha256:dc6f0b0dc167e2e4f595b72c9edfb4e66780974bfa515dc3edc7f3084d32a23a
Status: Downloaded newer image for cylondata/cylon:latest
docker.io/cylondata/cylon:latest
Step 3: Create a Docker Volume
Create a docker volume for your application code.
docker volume create cylon-vol
Step 4: Start the Cylon Container with the volume attachment
docker run -it -v cylon-vol:/code cylondata/cylon
Step 2: Fire up VSCode
Head into the Extensions panel and install the Remote Containers extension.
Step 3: Attach to Cylon Container
Switch to the containers section and attach to the Cylon container.
Now you should see the IDE has attached to the container.
Running a Sample Application
The Cylon source and binaries are located at /cylon directory, and your development enthronement is already preloaded with everything you need to run a Cylon application locally. With the below command, you should be able to run sample applications.
cylon@d4872133cdee:~$ python3 /cylon/python/examples/dataframe/join.py
Develop Your First Cylon App
Step 1: Open the development directory
Go to File->Open Folder and type /code to open our development directory.
This step is vital to ensure all your work will be saved into the docker volume we created in the previous step. Else you will lose your work once the container is shut down.
Step 2: Set the Cylon ENV as the interpreter
Do a Ctrl+Shift+P and find the option to change the python interpreter. Set /cylon/ENV as the interpreter location. This will enable IntelliSense!
Step 3: Create the .py file and Import Cylon
To create a new python file File -> New File and select Python as the language. Do a Ctrl+S to save the file. Import DataFrame and start coding!
from pycylon import DataFrame
Step 4: Testing & Running
Let’s consider the following code as your application.
Go ahead and create some test csv files too.
Now do a Ctrl+F5 to run your application.
Give your Apps the Super Powers — Running Cylon on Distributed Mode.
You can define a distributed runtime(environment) to make your application run-able on multi-process mode, including processes distributed across different nodes.
Step 1: Create an Environment
Initialize a Cylon environment with MPIConfig.
Passing env to an operator switches that operator into the distributed mode.
Step 2: Testing & Running
Jump to the terminal and type,
mpirun -np 2 python3 myfirst.py
This starts two Cylon workers and executes the merge algorithm in the distributed mode.
Step 3: Data Loading
Up to this point, both workers read the same data. But this is not what we usually need. You could use any of the following options to split initial data between workers.
Option 1: Use Cylon’s built in data slicing option.
You could use slice=True option with the read_csv to make cylon to balance the data evenly between the available workers.
Option 2: Load different files based on the worker rank.
Or you could load different files to different workers based on the worker’s rank as follows. This option gives you more control over the data distribution.
Congratulations! Your DataFrames application is now ready to be thrown into a cluster handling Tera Bytes of data.
Learn More about Cylon
Contribute to Cylon
More Articles on Cylon
Call To Action
- Clap. Appreciate and let others find this article.
- Comment. Share your views on this article.
- Follow me. Chathura Widanage to receive updates on articles like this.
- Keep in touch. LinkedIn, Twitter