Spark Streaming with Python and Kafka Training Course

Overview

Apache Spark Streaming is a scalable, open source stream processing system that allows users to process real-time data from supported sources. Spark Streaming enables fault-tolerant processing of data streams.

This instructor-led, live training (online or onsite) is aimed at data engineers, data scientists, and programmers who wish to use Spark Streaming features in processing and analyzing real-time data.

By the end of this training, participants will be able to use Spark Streaming to process live data streams for use in databases, filesystems, and live dashboards.

Format of the Course

  • Interactive lecture and discussion.
  • Lots of exercises and practice.
  • Hands-on implementation in a live-lab environment.

Course Customization Options

  • To request a customized training for this course, please contact us to arrange.

Requirements

  • Experience with Python and Apache Kafka
  • Familiarity with stream-processing platforms

Audience

  • Data engineers
  • Data scientists
  • Programmers

Course Outline

Introduction

Overview of Spark Streaming Features and Architecture

  • Supported data sources
  • Core APIs

Preparing the Environment

  • Dependencies
  • Spark and streaming context
  • Connecting to Kafka

Processing Messages

  • Parsing inbound messages as JSON
  • ETL processes
  • Starting the streaming context

Performing a Windowed Stream Processing

  • Slide interval
  • Checkpoint delivery configuration
  • Launching the environment

Prototyping the Processing Code

  • Connecting to a Kafka topic
  • Retrieving JSON from data source using Paw
  • Variations and additional processing

Streaming the Code

  • Job control variables
  • Defining values to match
  • Functions and conditions

Acquiring Stream Output

  • Counters
  • Kafka output (matched and non-matched)

Troubleshooting

Summary and Conclusion

Leave a Reply

Your email address will not be published. Required fields are marked *