Databricks is a super popular platform to do data processing, for instance with Apache Spark, and host modern data warehouses (Lakehouses). In this course, you will learn everything you need to know to get started with Databricks. You get to know the basics of the platform, how it works and why you should use it. You create notebooks, use a compute cluster and Databricks SQL warehouse.
Setup & Dataset
Before getting to the practical part, you are going to do the Databricks setup on AWS, create a bucket, where you later store your data, and create your workspace. We also look through the AWS cloud formation template that Databricks is using so you understand how this is going to be set up automatically and we do a review of the cluster you create while working through this course.
You will also get familiar with the dataset you are going to use for this project and learn how your pipeline is going to look, which will be a simple ETL pipeline.
Hands-On Processing Data
In the hands-on part you first learn about the two ways of how to load data onto Databricks: uploading it directly onto it or uploading it on S3 before integrating it on Databricks. Furthermore, you learn how to create your repos - your code. Again, there are two ways of doing this. You can use a Github repos and integrate it - which is a pretty easy way to do it - or you create a repos straight on Databricks and import and create your files manually.
During this project you are going to do two jobs. One is where you actually process data, run your ETL job, query your data and create some tables out of it, which you then store on Databricks. The second job is a notebook where you are going to visualize your data with Spark SQL and other tools. You’ll also learn how the data is stored in Databricks.
In addition to working with notebooks, you are going to connect PowerBI to the data tables in Databricks. You will try out both ways of doing this: Through the compute cluster and through Databricks SQL (Warehouse). This way you will be able to integrate databricks with external tools.
Before you jump into this course, I recommend you take my Apache Spark Fundamentals course.With that background you are ready to code on Databricks: Click Here
- Your own AWS Account
- Your own Databricks account
- AWS costs are negligible, especially if you are under the free tier
Basic Spark knowledge (Spark Fundamentals course)
Databricks Fundamentals is included in our Data Engineering Academy