Data Engineering Tidbits - Introduction to Pyspark

Introduction

PySpark is a fast and general engine for large-scale data processing. When a problem is too big for one machine, Spark can slice data and distribute it among a cluster of computers to divide and split it up.

Architecture

A driver script with the Spark context runs on top of a cluster manager (Spark has its own cluster manager but can also run over a Hadoop cluster). The cluster manager creates multiple executors, ideally one per CPU core. It provides a single-script development experience.

spark architecture diagram

Spark vs MapReduce

Spark or MapReduce? They have a lot in common and solve the same problems. While MapReduce has been around longer and has a more mature ecosystem of tools, Spark is really fast - MapReduce is 100 times slower. Spark achieves this speed by using a DAG engine to optimize workflows.

Spark SQL and Data Warehousing

Spark SQL is for running on Hive context, allowing structured data SQL queries so you can treat Spark as a warehouse.

Language Considerations

  • Python is easier for running Spark - it just works, unlike Java that deals with JAR files and compiling

  • Scala is more popular since Spark is written in Scala