Overview
Audience
- Developers
Format of the Course
- Lectures, hands-on practice, small tests along the way to gauge understanding
Requirements
- A general familiarity with distributed computing.
Course Outline
Introduction
Principles of Distributed Computing
- Apache Spark
- Hadoop
Principles of Data Serialization
- How data object is passed over the network
- Serialization of objects
- Serialization approaches
- Thrift
- Protocol Buffers
- Apache Avro
- data structure
- size, speed, format characteristics
- persistent data storage
- integration with dynamic languages
- dynamic typing
- schemas
- untagged data
- change management
Data Serialization and Distributed Computing
- Avro as a subproject of Hadoop
- Java serialization
- Hadoop serialization
- Avro serialization
Using Avro with
- Hive (AvroSerDe)
- Pig (AvroStorage)
Porting Existing RPC Frameworks
Summary and Conclusion