Data Science
Software Engineering with Apache Spark
Combine the power of Apache Spark with the quality from agile software engineering to transform your user data into business intelligence to take your organization to the next level.
Description
Apache Spark is great for organizations who want their analytics developed fast and executed fast. The problem is that large-scale distributed computing is always hard. It’s not like you can set a breakpoint on code running on multiple machines when things go wrong. Monitoring your analytics jobs and optimizing accordingly are important. You also have to rethink the way you approach concepts like architecture, security, and software engineering techniques like testing and DevSecOps.
Software Engineering with Apache Spark teaches you how to leverage just enough Scala to integrate powerful data pipelines into your existing monolithic, microservices, and/or serverless architecture. You will also learn how your pipelines can benefit from the same testing, continuous integration, containerization with Docker, container orchestration with Kubernetes, and security hardening as the others components of your enterprise.
What makes this course different
Apache Spark is a huge topic, and the typical Spark course tries to cover too much in too short a period, which leaves you overwhelmed. Meanwhile, it ignores any discussion of architectural patterns, security, and software engineering--the things that separate experiments in your garage from critical pieces of your production architecture.
We take a more practical approach. There are rich code examples providing deep insight into the API as you’d expect, but the exercises utilize real datasets from Data.gov and encourage you to collaborate with your peers, the Spark Scaladoc, Stack Overflow, and other sources to discover solutions not explicitly discussed–just as you would at work. We also advise you on where you should draw the line between configuring Spark yourself and outsourcing configuration to a provider so you can focus on what you care about.
In addition, agile software development and DevOps have taught us to incorporate quality through build automation, testing, continuous integration, and continuous delivery, and we don’t have to abandon them just because we are working in a distributed environment. In fact, the complexity of distributed systems makes them even more valuable. Software Engineering with Apache Spark shows you how to apply these techniques to improve the quality and reliability of your analytics.
Course Syllabus
Session 1: Mastering the Spark API
- MapReduce: The Phantom Menace
- Advantages of Spark
- Just Enough Scala
- Using the Spark Shell
- Writing You Own Spark Jobs
- The Spark Ecosystem
Session 2: Professional Spark
- Just Enough Hadoop
- Testing Your Spark Jobs
- Optimizing Spark and When to Stop Trying
- Spark on Docker
- Deploying Spark to Kubernetes
- Spark Security
- Visualizing Your Spark Jobs
I have built several Scala and Apache Spark applications currently in production, and I worked with the original Apache Spark team, AMPLab at UC Berkeley, on a research project for DARPA known as XDATA. Somehow I have helped enough developers around the world to earn Spark and Scala badges on Stack Overflow. I am passionate about Spark and look forward to helping you harness its power.