Sitemap

Hacking BigData with Cylon DataFrames

5 min readJun 1, 2021

--

Cylon is a fast and scalable library to manipulate terabytes of data on a distributed setup with a Python-based DataFrame API. Although Python and “fast and scalable” don’t go hand in hand, the following facts will give you an idea of why Cylon is fast at runtime, while you are getting the flexibility to code your application logic in the language that every data scientist love, “Python”.

  • Cylon’s core is entirely written in C++. Python API is just a wrapper around this high performance core.
  • The DataFrames are directly mapped at runtime to Apache Arrow tables making data aggregation and transformation efficient and cache-friendly due to the columnar data representation.
  • Communication is entirely written on top of frameworks proven to be faster(MPI and UCX).
  • Since MPI/UCX and Apache Arrow all works on binary level, Cylon doesn’t have any data serialization/de-serialization overhead.

Getting started with Cylon

To get started with Cylon development, you need few prerequisites to be present in you development environment.

  • Docker
  • VSCode
  • 5 minutes!

Step 1: Pull Cylon containers from the Docker Hub

docker pull cylondata/cylon

If everything went fine, you would see an output similar to below.

Using default tag: latest
latest: Pulling from cylondata/cylon
Digest: sha256:dc6f0b0dc167e2e4f595b72c9edfb4e66780974bfa515dc3edc7f3084d32a23a
Status: Downloaded newer image for cylondata/cylon:latest
docker.io/cylondata/cylon:latest

Step 3: Create a Docker Volume

Create a docker volume for your application code.

docker volume create cylon-vol

Step 4: Start the Cylon Container with the volume attachment

docker run -it -v cylon-vol:/code cylondata/cylon

Step 2: Fire up VSCode

Head into the Extensions panel and install the Remote Containers extension.

Step 3: Attach to Cylon Container

Switch to the containers section and attach to the Cylon container.

Switch to the Containers
Attach to the Cylon Container

Now you should see the IDE has attached to the container.

Running a Sample Application

The Cylon source and binaries are located at /cylon directory, and your development enthronement is already preloaded with everything you need to run a Cylon application locally. With the below command, you should be able to run sample applications.

cylon@d4872133cdee:~$ python3 /cylon/python/examples/dataframe/join.py

Develop Your First Cylon App

Step 1: Open the development directory

Go to File->Open Folder and type /code to open our development directory.

This step is vital to ensure all your work will be saved into the docker volume we created in the previous step. Else you will lose your work once the container is shut down.

Step 2: Set the Cylon ENV as the interpreter

Do a Ctrl+Shift+P and find the option to change the python interpreter. Set /cylon/ENV as the interpreter location. This will enable IntelliSense!

Step 3: Create the .py file and Import Cylon

To create a new python file File -> New File and select Python as the language. Do a Ctrl+S to save the file. Import DataFrame and start coding!

from pycylon import DataFrame

Step 4: Testing & Running

Let’s consider the following code as your application.

Go ahead and create some test csv files too.

Now do a Ctrl+F5 to run your application.

Give your Apps the Super Powers — Running Cylon on Distributed Mode.

You can define a distributed runtime(environment) to make your application run-able on multi-process mode, including processes distributed across different nodes.

Step 1: Create an Environment

Initialize a Cylon environment with MPIConfig.

Passing env to an operator switches that operator into the distributed mode.

Step 2: Testing & Running

Jump to the terminal and type,

mpirun -np 2 python3 myfirst.py

This starts two Cylon workers and executes the merge algorithm in the distributed mode.

Step 3: Data Loading

Up to this point, both workers read the same data. But this is not what we usually need. You could use any of the following options to split initial data between workers.

Option 1: Use Cylon’s built in data slicing option.

You could use slice=True option with the read_csv to make cylon to balance the data evenly between the available workers.

Option 2: Load different files based on the worker rank.

Or you could load different files to different workers based on the worker’s rank as follows. This option gives you more control over the data distribution.

Congratulations! Your DataFrames application is now ready to be thrown into a cluster handling Tera Bytes of data.

Learn More about Cylon

Contribute to Cylon

More Articles on Cylon

Call To Action

  • Clap. Appreciate and let others find this article.
  • Comment. Share your views on this article.
  • Follow me. Chathura Widanage to receive updates on articles like this.
  • Keep in touch. LinkedIn, Twitter

--

--

Chathura Widanage
Chathura Widanage

Written by Chathura Widanage

Digital Science Center - Indiana University

No responses yet