When you watch the blueprint I also use in my cookbook you see the different phases: Connect, Processing Framework, Store and Buffer. At the beginning you think about where you validate the data? To make sure that the data is okay and makes sense.
Do you validate the data directly on the API - before you put it in the buffer or when you take it out of the buffer and process it? Or do you validate the data where you store it?
I think this is a really interesting question because it gives a lot of depth, like thinking about an operations part.
Validation in the connect phase
When you think about a credit card transaction or someone buying something, it is important that the transaction is processed correctly. You need to make sure in an API call that this transaction actually is valid. Once you ingest the data you would then make sure that you validate the data and send the result back to the client.
If you get a json string that is malformed you would tell the client: we have some problem with the transaction - this credit card number is wrong. In this case validation of the data in the connect phase absolutely makes sense.
Filtering in the processing framework
Another example could be a small IoT pipeline with Raspberry Pi. Think about e.g. people using a Raspberry Pi for actually measuring the weather data out there or have some weather stations that are sending data into your platform.
At first the data is going through the connect phase, for instance to an API. Do you really validate the weather data on the API and tell the weather station this data doesn't make sense? Or do you just take the data, confirm you received it and then filter it out in the backend, in the processing behind it.
In this case, broken or wrong data will not get into the store. You can drop the data as it's not really important. Because you know that the weather station is going to send another data point in 10 minutes.
Decision depends on situation
Where to validate the data? What's a good answer here?
It depends on what you want to do; how important is the data that you're processing? That's something actually to think about and that actually I needed to think about when I got this question.
See you later.
- Check out our Data Engineering Academy to learn the Plumbing of Data Science!