Overview
Apache Arrow is an open-source in-memory data processing framework. It is often used together with other data science tools for accessing disparate data stores for analysis. It integrates well with other technologies such as GPU databases, machine learning libraries and tools, execution engines, and data visualization frameworks.
In this onsite instructor-led, live training, participants will learn how to integrate Apache Arrow with various Data Science frameworks to access data from disparate data sources.
By the end of this training, participants will be able to:
- Install and configure Apache Arrow in a distributed clustered environment
- Use Apache Arrow to access data from disparate data sources
- Use Apache Arrow to bypass the need for constructing and maintaining complex ETL pipelines
- Analyze data across disparate data sources without having to consolidate it into a centralized repository
Audience
- Data scientists
- Data engineers
Format of the Course
- Part lecture, part discussion, exercises and heavy hands-on practice
Note
- To request a customized training for this course, please contact us to arrange.
Requirements
- A basic undersanding of SQL
- Familiarity with Python or R
- Some familiarity with Apache Spark
Course Outline
Introduction
- Apache Arrow vs Parquet
Installing and Configuring Apache Arrow
Overview of Apache Arrow Features and Architecture
Exploring Data with Pandas and Apache Arrow
Exploring Data with Spark and Apache Arrow
Exploring Data with R and Apache Arrow
Exploring Data with MapD and Apache Arrow
Other Data Analysis Integrations
- PySpark, Parquet files on S3, and Oracle tables and Elasticsearch indices
Troubleshooting
Summary and Conclusion