By Neil Chaudhuri | January 25, 2014
Although most developers and users are still feeling their way through Hadoop and (more specifically MapReduce), the truth is Google wrote that paper in 2004. That’s ten years ago! Million Dollar Baby won Best Picture that year. Yeah! by Usher, Lil John and Ludacris topped the charts in the United States. And Facebook had only just started to kill work productivity and violate your privacy.
As long ago as that feels, it is an eternity in technology. Google has of course moved on way past MapReduce to things like Dremel. The rest of us aren’t moving so quickly, but that doesn’t mean we still need to plod along writing low-level MapReduce code.
Several abstractions have come along to make cloud-scale analytics easier. One of them is Apache Spark. Spark was originally created by AMPLab at UC Berkeley and is now in incubation at Apache. It utilizes cluster memory to minimize the disk reads and writes that slow down MapReduce. Spark is written in Scala and exposes both a Scala and Java API. Though as I’ve said before, once you get past the learning curve, I don’t see how anyone would prefer Java to Scala for analytics.
Please take a look at the Spark documentation to learn about the RDD abstraction. *RDD*’s are basically fancy arrays. You load data into them and perform whatever combination of operations–maps, filters, reducers, etc.– you want. *RDD*’s make analytics so much more fun to write than canonical MapReduce. In the end, Spark doesn’t just run faster; it lets us write faster.
The web has a bunch of examples of using Spark with Hadoop components like HDFS and Hive (via Shark, also made by AMPLab), but there is surprisingly little on using Spark to create *RDD*’s from HBase, the Hadoop database. If you don’t know HBase, check out this excellent presentation by Ian Varley. It’s just a cloud-scale key-value store.
The Spark Quick Start documentation says, “*RDD*s can be created from Hadoop InputFormats.” You may know that InputFormat is the Hadoop abstraction for anything that can be processed in a MapReduce job. As it turns out, HBase uses a TableInputFormat, so it should be possible to use Spark with HBase.
It turns out that it is.
As the Scaladoc for RDD
shows, there are numerous concrete
RDD implementations–each best suited for different situations. The
is great for reading data stored in later versions of Hadoop, which is exactly what we need here.
Check out this code.
Once we instantiate the
for the local machine, we write an anonymous function to create an
[HBaseConfiguration](http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HBaseConfiguration.html), which will enable
us to add the HBase configuration files to any Hadoop ones we have as well as the name of the table
we want to read. That’s an HBase thing–not a Spark thing.
Then we create an instance of
NewHadoopRDD with the
SparkContext instance and three Java
[Class](http://docs.oracle.com/javase/7/docs/api/java/lang/Class.html) objects related to HBase:
- One for the
TableInputFormat, as mentioned earlier
- One for the key type, which on a table scan is always
- One for the value type, which on a table scan is always
Then we call our anonymous function with the location of
and the name of the table we want to read,
to provide Spark with the necessary HBase configuration to construct the
RDD. Once we have the
RDD, we can perform all
the usual operations on HBase we usually see with the more conventional usage of
RDDs in Spark.
The code example does feature some HBase-specific operations though. When the RDD is constructed, it loads all the
table-with-data as (
Result) tuples–key-value pairs as you would expect. For each
one of these pairs, we grab the second item in the tuple (the
get the column
from it marked with a specific column family and column qualifier.
As advocated by big data overlord Nathan Marz, HBase has no notion of deletes; new values are just appended. Consequently, each column is really a collection, and after converting the Java collection returned by the HBase API into a Scala collection, the last map call extracts the latest–and therefore the “right”–KeyValue. We now have a collection of all the current data in each row.
If that is unclear, let’s summarize the mapping operations this way:
- Load an
Result) tuples from the table
- Transform previous into an
- Transform previous into an
RDDof collections of
KeyValues (where the latest one is what we want). Collection of collections? Oh no!
- Transform previous into an RDD of
byte(binary representations of the current data)
So using Spark, we can read the contents of an entire HBase table and perform one transformation after another–maybe along with some filters, aggregations, and other operations–to do ALL THE ANALYTICS.
Astute or experienced readers might have noticed that
getColumn call in the second
map operation is actually
deprecated in favor of
which returns a collection of Cells ordered by
timestamp. That’s the superior approach. I just used
getColumn as an excuse to demonstrate Spark’s ability to apply
anonymous functions within
RDD operations in the next
The next time some analytics are in order, don’t bother with old-school MapReduce. The higher level of abstraction over anything with an
InputFormat, including HBase, makes Spark a great choice for Hadoop analytics.