Data Engineering on AWS

Your AWS project example

Description

The AWS project is the perfect project for everyone who wants to start with Cloud platforms. Currently, AWS is the most used platform for data processing. It is really great to use, especially for those people who are new in their Data Engineering job or looking for one.

In this project I show you in easy steps how you can start with AWS, which topics need to be taken into consideration and how you can set up a complete end-to-end project. For this, we use data from an e-commerce dataset. Based on the project, you learn how to model data and which AWS tools are important, such as Lambda, API Gateway, Glue, Redshift Kinesis and DynamoDB.

Sections

Dataset

We start with having a closer look on the dataset you are using for the project as well as typical data types. We will also define the goals of the project as this is very important for a successful development. Without a good goal definition, development is just difficult. Furthermore, you learn about the different possibilities for saving data within a NoSQL database.

Platform Design & Data Pipelines

To recall the general platform design with its pipelines, we go through the platform blueprint and its sections again, which you should already know from my training “Platform & Pipeline Design”. After that, you design your pipelines, which you will implement later. We start with ingestion pipelines, which describe how data gets into the platform. You also define a pipeline to directly store the data in a data lake (S3). For the data storage within an OLTP database you will use DynamoDB as a document store. Besides that, you also design pipelines for interfaces as well as pipelines to stream data to a Redshift data warehouse and upload the data as a so-called bulk-import.

AWS Basics

For the hands-on part, we start with the AWS basics. I will show you how to create an account and which things need to be noted here. Learn how identity and access management works and how you can use the log-in function with Cloudwatch. We also have a look at Boto3, the AWS library for Python.

Data Ingestion Pipeline

Here, you learn how to develop and create an API with API Gateway and how to send data from the API to Kinesis. Therefore, I show you how to set up Kinesis and how to configure the identity and access management. Afterwards you create the ingestion pipeline with Python.

Stream to S3 Storage (Data Lake)

After you put the data into Kinesis, you will develop a Lambda function, which pulls data directly from Kinesis and saves them as files in S3, our used data lake.

Stream to DynamoDB

To save the data in a database, we use the NoSQL database DynamoDB. Here, you develop a pipeline which takes the data from Kinesis and streams them directly to DynamoDB.

Visualization

In order to use the data within our database, you have to develop an API. This allows you to take data directly from the database. I will also explain to you why this is so important and why a direct access to the database from the visualization makes no sense.

Visualization through Redshift Data

After working with data lakes and NoSQL databases only so far, you take the data stream and send it directly into the data warehouse Redshift via Kinesis Firehose. You learn how to install Redshift and set up a cluster. We go through the security topics, set up the identity and access management and create the tables. Then you configure Kinesis Firehose to send data to Redshift. As this can be quite difficult sometimes, I also included a bug fixing session where we have a look at some common problems which can occur during this setup. As the final part in this section, you install Power BI on a local computer and connect it with Redshift. With this, you can evaluate data directly from Redshift.

Batch Processing AWS Glue, S3 & Redshift

Besides streaming, batch processing is also very important for you to learn. Therefore, you will develop a batch pipeline using AWS Glue to write data from S3 into Redshift. I explain to you how to configure and run Crawler for S3 and Redshift and we have a closer look at our data catalogue after it is created. Then you configure and run the Glue job. Since problems can arise here, we will also do a little debugging session.