Overview
Many real world problems can be described in terms of graphs. For example, the Web graph, the social network graph, the train network graph and the language graph. These graphs tend to be extremely large; processing them requires a specialized set of tools and processes — these tools and processes can be referred to as Graph Computing (also known as Graph Analytics).
In this instructor-led, live training, participants will learn about the technology offerings and implementation approaches for processing graph data. The aim is to identify real-world objects, their characteristics and relationships, then model these relationships and process them as data using a Graph Computing (also known as Graph Analytics and Distributed Graph Processing) approach. We start with a broad overview and narrow in on specific tools as we step through a series of case studies, hands-on exercises and live deployments.
By the end of this training, participants will be able to:
- Understand how graph data is persisted and traversed.
- Select the best framework for a given task (from graph databases to batch processing frameworks.)
- Implement Hadoop, Spark, GraphX and Pregel to carry out graph computing across many machines in parallel.
- View real-world big data problems in terms of graphs, processes and traversals.
Format of the course
- Part lecture, part discussion, exercises and heavy hands-on practice
Requirements
- An undersanding of Java programming and frameworks
- A general understanding of Python is helpful but not required
- A general understanding of database concepts
Audience
- Developers
Course Outline
Introduction
- Graph databases and libraries
Understanding Graph Data
- The graph as a data structure
- Using vertices (dots) and edges (lines) to model real-world scenarios
Using Graph Databases to Model, Persist and Process Graph Data
- Local graph algorithms/traversals
- neo4j, OrientDB and Titan
Exercise: Modeling Graph Data with neo4j
- Whiteboard data modeling
Beyond Graph Databases: Graph Computing
- Understanding the property graph
- Graph modeling different scenarios (software graph, discussion graph, concept graph)
Solving Real-World Problems with Traversals
- Algorithmic/directed walk over the graph
- Determining circular cependencies
Case Study: Ranking Discussion Contributors
- Ranking by number and depth of contributed discussions
- A note on sentiment and concept analysis
Graph Computing: Local, In-Memory Graph toolkits
- Graph analysis and visualization
- JUNG, NetworkX, and iGraph
Exercise: Modeling Graph Data with NetworkX
- Using NetworkX to model a complex system
Graph Computing: Batch Processing Graph Frameworks
- Leveraging Hadoop for storage (HDFS) and processing (MapReduce)
- Overview of iterative algorithms
- Hama, Giraph, and GraphLab
Graph Computing: Graph-Parallel Computation
- Unifying ETL, exploratory analysis, and iterative graph computation within a single system
- GraphX
Setup and Installation
- Hadoop and Spark
GraphX Operators
- Property, structural, join, neighborhood aggregation, caching and uncaching
Iterating with Pregel API
- Passing arguments for sending, receiving and computing
Building a Graph
- Using vertices and edges in an RDD or on disk
Designing Scalable Algorithms
- GraphX Optimization
Accessing Additional Algorithms
- PageRank, Connected Components, Triangle Counting
Exercis: Page Rank and Top Users
- Building and processing graph data using text files as input
Deploying to Production
Closing Remarks