Introduction
PySpark is a fast and general engine for large-scale data processing. When a problem is too big for one machine, Spark can slice data and distribute it among a cluster of computers to divide and split it up.
Architecture
A driver script with the Spark context runs on top of a cluster manager (Spark has its own cluster manager but can also run over a Hadoop cluster). The cluster manager creates multiple executors, ideally one per CPU core. It provides a single-script development experience.
Spark vs MapReduce
Spark or MapReduce? They have a lot in common and solve the same problems. While MapReduce has been around longer and has a more mature ecosystem of tools, Spark is really fast - MapReduce is 100 times slower. Spark achieves this speed by using a DAG engine to optimize workflows.
Spark SQL and Data Warehousing
Spark SQL is for running on Hive context, allowing structured data SQL queries so you can treat Spark as a warehouse.
Language Considerations
-
Python is easier for running Spark - it just works, unlike Java that deals with JAR files and compiling
-
Scala is more popular since Spark is written in Scala