MapReduce greatly simplified big data analysis
But as it got popular, users wanted more complex, multi-stage applications (e.g. iterative graph algorithms and machine learning). Users also required more interactive ad-hoc queries and faster data sharing across parallel jobs.
Spark facilitates all of this
# Features
- In-memory data storage for very fast iterative queries
- General execution graphs and powerful optimizations
- Up to 100x faster than Hadoop MapReduce in memory
- Compatible with Hadoop's storage APIs
- Can read/write to any Hadoop-supported system, including HDFS, HBase, SequenceFiles, etc
# Why fast?
Spark is fast due to its distributed Memory IO unlike Hadoop which has to constantly load it from HDFS for each iteration in machine learning or each parallel node in parallel queries. This allows spark to load data on to memory once and keep reassessing it, and also storing intermediate results in memory.
# Resilient Distributed Datasets (RDDs)
- Distributed collections of objects that allows users to explicitly persist intermediate results in memory across multiple cluster nodes
- Manipulated through various parallel operators
- Automatically rebuilt on failure
# Fault Tolerance
RDDs track the series of transformations used to build them (their lineage) to recompute lost data
# Example
# Spark SQL
- Port of Apache Hive to run on Spark
- Compatible with Hive data, metastores, and queries (HiveQL, UDFs, etc)
- Similar speedups of up to 40x
- Faster if the data is cached in memory
# Spark Streaming
- Track and update state in memory as events arrive
- Large-scale reporting, click analysis, spam filtering are some of the examples
- Can 42 million records/sec (4 GB/s) on 100 nodes at sub-second latency