Analytics With Apache Spark.

Apache Spark has emerged as arguably the most compelling technology in Big Data. Tiny startups like Netflix, Yahoo, and OpenTable are using Spark, and cloud providers like Amazon Web Services and Microsoft Azure are competing to offer the best Spark support.

With data analytics expanding into virtually every industry to solve very interesting problems, mastering Apache Spark will guarantee you will always be in demand. Take the first step toward reaching that goal now.

For more information, contact us.


Apache Spark has become a popular alternative for companies who want their analytics developed fast and executed fast. The problem is that large-scale distributed computing is always hard. It’s not like you can set a breakpoint on code running on multiple machines when things go wrong. Monitoring your analytics jobs and optimizing accordingly are important. You also have to rethink the way you approach concepts like architecture, security, and software engineering techniques like unit and integration testing, continuous integration, and continuous delivery.

After a comparison with Hadoop MapReduce and a brief tutorial on Scala for those who need it, Analytics with Apache Spark demonstrates the power of Spark’s Scala API with numerous examples and exercises using actual data sets from Once you have mastered Spark’s API and ecosystem in the first session of the course, you will learn how to build Spark analytics in the real world. You will use SBT to package your code and deploy to Hadoop. The remainder of the course provides tips for monitoring and optimizing your Spark jobs as well as best practices for architecture, security, and software engineering.

For more information, contact us.

What makes this course different

Apache Spark is a huge topic, and the typical Spark course tries to cover too much in too short a period, which leaves you overwhelmed. Meanwhile, it ignores any discussion of concepts like architecture, security, and software engineering.

Analytics with Apache Spark takes a different approach. There are rich code examples providing deep insight into the API as you’d expect, but the exercises utilize real datasets from and encourage you to collaborate with your peers, the Spark Scaladoc, Stack Overflow, and other sources to discover solutions not explicitly discussed–just as you would at work.

In addition, agile software development and DevOps have taught us the value of build automation, testing, continuous integration, and continuous delivery, and we don’t have to abandon them just because we are working in a distributed environment. In fact, the complexity makes them even more valuable. Analytics with Apache Spark shows you how to apply these techniques to improve the quality of your Spark code.

The course also describes the role Spark can play in Nathan Marz’s Lambda Architecture and Jay KrepsKappa Architecture.

Finally, the course provides you a survey of many of the other tools out there for analytics. Some are competitors with Spark; others complementary. Either way, it is important to understand the entire landscape so you can make informed technical decisions.

When you finish this course, you will be immediately productive at work as a leading authority on Apache Spark, and you will have a solid foundation on which to add to your skills as it continues to evolve.

For more information, contact us.


Two four-hour onsite sessions with lecture and exercises.


Knowledge of basic programming concepts, familiarity with a Java IDE, and knowing how to operate a VM are required. Knowledge of Scala as well as Hadoop concepts like HDFS and MapReduce is helpful but not assumed.

About the instructor

Neil Chaudhuri has well over a decade of experience building complex software applications for commercial and government clients. He has built several Scala and Apache Spark applications currently in production, and he worked with the original Apache Spark team, AMPLab at UC Berkeley, on a research project for DARPA known as XDATA. He has Spark and Scala badges on Stack Overflow. As you can see in his blog post, Mr. Chaudhuri has extensive knowledge of Spark and believes very strongly in it.


Session I: Mastering the API

  • MapReduce Overview
  • Advantages of Spark
  • Just Enough Scala
  • The Spark Shell
  • Spark Programs
  • The Spark Ecosystem

Session II: Spark in the Real World

  • Just Enough Hadoop
  • Deploying to Hadoop
  • Optimizing Jobs
  • Monitoring Jobs
  • Security
  • Software Engineering
  • Scaling Your Architecture
  • Alternatives to Spark
  • Third-Party Integration


Certificate of Achievement upon course completion

Priority responses on code questions for three months after completion of the course

For more information, contact us.