Apache Arrow for Data Analysis across Disparate Data Sources Training Course

Overview

Apache Arrow is an open-source in-memory data processing framework. It is often used together with other data science tools for accessing disparate data stores for analysis. It integrates well with other technologies such as GPU databases, machine learning libraries and tools, execution engines, and data visualization frameworks.

In this onsite instructor-led, live training, participants will learn how to integrate Apache Arrow with various Data Science frameworks to access data from disparate data sources.

By the end of this training, participants will be able to:

  • Install and configure Apache Arrow in a distributed clustered environment
  • Use Apache Arrow to access data from disparate data sources
  • Use Apache Arrow to bypass the need for constructing and maintaining complex ETL pipelines
  • Analyze data across disparate data sources without having to consolidate it into a centralized repository

Audience

  • Data scientists
  • Data engineers

Format of the Course

  • Part lecture, part discussion, exercises and heavy hands-on practice

Note

  • To request a customized training for this course, please contact us to arrange.

Requirements

  • A basic undersanding of SQL
  • Familiarity with Python or R
  • Some familiarity with Apache Spark

Course Outline

Introduction

  • Apache Arrow vs Parquet

Installing and Configuring Apache Arrow

Overview of Apache Arrow Features and Architecture

Exploring Data with Pandas and Apache Arrow

Exploring Data with Spark and Apache Arrow

Exploring Data with R and Apache Arrow

Exploring Data with MapD and Apache Arrow

Other Data Analysis Integrations

  • PySpark, Parquet files on S3, and Oracle tables and Elasticsearch indices

Troubleshooting

Summary and Conclusion

Leave a Reply

Your email address will not be published. Required fields are marked *