Viveckh's Notepad

MapReduce greatly simplified big data analysis

But as it got popular, users wanted more complex, multi-stage applications (e.g. iterative graph algorithms and machine learning). Users also required more interactive ad-hoc queries and faster data sharing across parallel jobs.

Spark facilitates all of this

# Features

In-memory data storage for very fast iterative queries
General execution graphs and powerful optimizations
Up to 100x faster than Hadoop MapReduce in memory
Compatible with Hadoop's storage APIs
- Can read/write to any Hadoop-supported system, including HDFS, HBase, SequenceFiles, etc

# Why fast?

Spark is fast due to its distributed Memory IO unlike Hadoop which has to constantly load it from HDFS for each iteration in machine learning or each parallel node in parallel queries. This allows spark to load data on to memory once and keep reassessing it, and also storing intermediate results in memory.

# Resilient Distributed Datasets (RDDs)

Distributed collections of objects that allows users to explicitly persist intermediate results in memory across multiple cluster nodes
Manipulated through various parallel operators
Automatically rebuilt on failure

# Fault Tolerance

RDDs track the series of transformations used to build them (their lineage) to recompute lost data

# Example

# Spark SQL

Port of Apache Hive to run on Spark
Compatible with Hive data, metastores, and queries (HiveQL, UDFs, etc)
Similar speedups of up to 40x
Faster if the data is cached in memory

# Spark Streaming

Track and update state in memory as events arrive
Large-scale reporting, click analysis, spam filtering are some of the examples
Can 42 million records/sec (4 GB/s) on 100 nodes at sub-second latency

← D3 Hbase →