Learning Apache Spark Fundamentals

Description

After Data Pipelines, Data Processing is one of the most important things within Data Engineering. As a Data Engineer, you will come across processing everywhere and it is very crucial to set up a powerful and well distributed processing for your work. A very useful and widely used tool for doing that is Apache Spark.

In this Apache Spark Fundamentals training, you learn about the Spark architecture and the fundamentals of how Spark works. You train your skills with Spark transformations and actions and you work with Jupyter Notebooks on Docker. You also dive into DataFrames, SparkSQL and RDDs to learn how they work. In the end you have all the fundamental knowledge to write your own jobs and build your pipelines with it.

Sections

Spark Basics

Learn why you should use Spark by understanding the difference between vertical and horizontal scaling. Also learn what kind of data it can work and where you can run it.

To fully understand how Spark works, we go through the basic Spark components: executor, driver, context, and cluster manager. In this context we also dig into the different cluster types. Furthermore, I will explain to you the difference between client and cluster deployment so you better understand when to use which of them.

Data & Environment

Here, I will show you which tools we are going to use within this Spark training. Also get to know the chosen dataset we are using. Finally, learn how you can set up and run your development environment by installing and using Docker and Jupyter Notebook.

Spark Coding Basics

Before getting into the hands-on part of this training, it is necessary to go through the most important Spark coding basics. Learn what Resilient Distributed Datasets (RDDs) and DataFrames are and how they work, especially for unstructured or structured data. Understand the difference between transformations and actions and how they are related. Also learn how they work with data and DataFrames and get to know the most typical types.

Hands-on Part

In the GitHub repository that I linked for you, you can find all the source codes of the Jupyter Notebooks that we are going to work with. Also find the download links for our datasets so that you can directly start with our hands-on part.

By working through five Notebooks in total, you learn how to work with transformations, with schemas and their columns and data types as well as DataFrames on JSON and CSV files. Also understand how you can modify and join such DataFrames. With the last two Notebooks, you get an introduction to Spark SQL, which is really easy to use, and you will do coding with RDDs, which are very helpful especially with unstructured data.