Semantic Log Indexing and Search

Introduction

Description

Semantic search is one of the most practical ways to bring Generative AI into real data engineering projects. In this course, we move beyond the basics of embeddings from The Hidden Foundation of GenAI course and put them to work. You’ll learn how to build a semantic search pipeline from scratch: embedding data, storing it in a vector database, and querying it using natural language.

This is not just theory. We’ll wrap everything around a real-world data observability project. You’ll set up a pipeline that collects log messages, processes them with FastAPI, and stores embeddings in qdrant, a high-performance vector store. Then, you’ll build a Streamlit dashboard that lets you search log data based on meaning, not just keywords—while comparing the approach to traditional SQL queries with DuckDB.

By the end of this course, you’ll not only understand the mechanics of semantic search but also have a complete, hands-on project that you can adapt for your own AI-driven data solutions.

From Embeddings to Search

Start by revisiting embeddings and see how they power semantic search. Learn why vector similarity is the key to retrieving the most relevant results and how vector databases like qdrant are built for this purpose.

Building the Pipeline

Get hands-on with FastAPI to create an API that processes log data and generates embeddings. You’ll see how metadata and vectors are stored together, and why this pairing is essential for meaningful search.

Working with qdrant

Dive into qdrant collections, points, and similarity search. Learn how cosine similarity drives ranking and how to structure your embeddings to improve results.

Streamlit Dashboard

Create a user-friendly search interface with Streamlit. Compare semantic search results to traditional SQL queries with DuckDB, and understand where each approach shines.

Improving Search Accuracy

Learn techniques to tune and optimize embeddings, making your search results more accurate and relevant. See how query formatting and natural language play a role in getting the best matches.

Dockerized Setup

Use Docker Compose to launch the full stack (FastAPI, qdrant, Streamlit, and DuckDB) so you can replicate the complete pipeline on your own machine.

Bonus: DuckDB for Traditional Analytics

To round things off, you'll explore how DuckDB complements semantic search in a data observability setup. While qdrant handles natural language queries through embeddings, DuckDB is perfect for traditional OLAP-style analysis with SQL.

In this bonus section, you’ll learn how to implement a Write-Ahead Log (WAL) to manage data ingestion into DuckDB, avoid file-locking issues in Dockerized environments, and compare semantic search results with classic SQL queries. This gives you a clear view of when to use vector search versus structured queries. An essential skill for real-world data engineering projects.