Data pipelines are the number one thing within the Data Science platform. Without them, data ingestion or machine learning processing, for example, would not be possible.
This training will help you understand how to create stream and batch processing pipelines as well as machine learning pipelines by going through some of the most essential basics - complemented by templates and examples for useful cloud computing platforms.
Platform & Pipeline Basics
In the first part, we will have a detailed look at the platform blueprint and different types of pipelines. You will learn about the difference between those types of pipelines, how they work, how machine learning on a platform looks and how you bring the different pipelines together.
Platform Blueprint & End to End Pipeline Example
The platform blueprint as well as end to end pipelines are crucial topics within the Data Engineering world. They can be found on every platform and really work everywhere. Without being too detailed, the blueprint is more of a framework, which presents the most important parts of a platform: connect, buffer, process, store, and visualize. It also shows where single tools can be used. This way, you understand how a platform looks and how it works. Using end to end pipelines as an example, I will also show you how to easily make use of the blueprint for your work as a Data Engineer.
Push and Pull Pipelines
As a Data Engineer, it is very crucial to understand the difference between push and pull pipelines, which is why they are also a topic of this course. They visualize the different ways of how data is transferred to the platform - so whether they are sent or fetched. For better understanding, I also added many examples with further details.
Batch & Streaming Pipelines
Batch and streaming pipelines are classic types of pipelines that you will come across often in your work as a Data Engineer. In this chapter you will understand the difference between those two types and how they work. You will also develop a feel for which pipeline you are dealing with or which pipeline you need to create for a certain scenario.
Data processing and storage is a huge topic and it is very important to get this data flow visualized in some way - even if you don’t have direct access to the data. The chapter visualization pipelines is a little guide to how you can manage that and is complemented by an example with Apache Spark.
Lambda architecture is a topic that you will always come across when dealing with platforms and pipelines. A lambda architecture enables you to bring batch and streaming pipelines together within your platform. It is also used a lot for machine learning where you can train with batch pipelines and apply your analysis with the streaming pipelines. So it is definitely worth expanding your knowledge about it.
In the last part of the training, we go through some templates and examples based on the platform blueprint. I will show you how such a platform architecture can look like on AWS, Hadoop, GCP, or Azure - with all its streaming and batch pipelines.
This way, you can understand how such a platform looks in theory and how it would look in practice. I show you which tools can be used at which point within the architecture and what important techniques come to use. This knowledge will help you a lot in your job as a Data Engineer as you will learn how to build pipelines and use tools like Lambda, API gateway or DynamoDB in a more practical way.