Overview
Modin is a parallel data frame system designed to speed up Pandas workflows. It can be used to handle large datasets, leveraging Ray or Dask as the backend framework for distributed computing in Python.
This instructor-led, live training (online or onsite) is aimed at data scientists and developers who wish to use Modin to build and implement parallel computations with Pandas for faster data analysis.
By the end of this training, participants will be able to:
- Set up the necessary environment to start developing Pandas workflows at scale with Modin.
- Understand the features, architecture, and advantages of Modin.
- Know the differences between Modin, Dask, and Ray.
- Perform Pandas operations faster with Modin.
- Implement the entire Pandas API and functions.
Format of the Course
- Interactive lecture and discussion.
- Lots of exercises and practice.
- Hands-on implementation in a live-lab environment.
Course Customization Options
- To request a customized training for this course, please contact us to arrange.
Requirements
- Familiarity with Pandas
- Python programming experience
Audience
- Data scientists
- Developers
Course Outline
Introduction
- Modin vs Dask vs Ray
- Overview of Modin features and architecture
- Pandas fundamentals
Getting Started
- Installing Modin
- Importing Pandas from Modin
- Defaulting to Pandas in Modin
- Supported APIs
Managing Pandas workflows using Modin
- Using Modin on a single node
- Using Modin on a cluster
- Connecting to a database (read_sql)
- Optimizing resources for Modin
Interacting with Datasets
- Reading data, dropping columns, and finding values
- Executing advanced Pandas operations
- Common issues and examples
Troubleshooting
Summary and Next Steps