You want to become a Data Engineer, but don't know how to set up a data engineering project? I will show you!

Do not make this mistake!

First of all you should not make the mistake that unfortunately many people make!

Often people want to build the whole thing from the beginning.

They say: "Okay I need to do a project. I need to make a big thing. I don't even know what data and what tools I want to use." But they want to do the full chain right away.

Start small!

I always say start small! Start by selecting a few data that make sense and that interest you.

And then start with one tool. Then you build something on top of it. And then you build something on that. If necessary, you might exchange something for something else. Because you are more interested in the other thing, for example!

Do not learn everything!

It makes absolutely no sense to go to the Cookbook and look at every tool and try to learn every tool! That is completely useless.

How the Cookbook helps you? Choose a few tools that are interesting or that you can see are in demand. Then you look at them, use them and learn how to use them. That is the main thing!

Find a data set

Just go out and find the data. Look through the free data sources like Kaggle - there are tons of them. Select yourself a data set e.g. a CSV and try to do something with it. You can also look for a public API.

Select tools and work with them

At the beginning, focus on a few tools. You can watch my Data Science Blueprint and decide where to start. You can e.g. store some data in a database and you do some processing with Python. If you're a bit more into the data scientist route, set up a Jupyter notebook and work with that notebook to process the data.

Then you add something to it: You could add a message queue to it, something like Apache Kafka. If you're working on AWS add a Kinesis to it. Then you would add a Lambda function that takes the data out of the message queue and processes it and puts it into your storage.

Grow your platform

This way you're always growing your platform and growing your knowledge. If you are not interested in message queues - just take the data, store it somewhere as a file. Read it with a notebook for instance in AWS Sagemaker. And then store it and save it to DynamoDB or some NoSQL database or a relational database.

And then add something. Add for example a visualization to it (Grafana is great). Maybe add an API where you can then access the data.

This is how you should go at it, to actually grow your knowledge, grow your data, grow your pipelines, and grow your first Data Engineering project.

See you later.