Back to Writing

2026 · AI · System Design

Building an AI Chatbot Assistant: Technical Design and Model Choices

Jonas Ng · 3 min read

The motivation to build Baymax AI assistant was rather straightforward. Someone visiting my portfolio has specific questions about my experiences. The answers are on the site but scattered across pages. The solution is to build an AI chatbot assistant that already knows everything and can answer directly on questions about me.

How it works

The assistant runs as a separate Vercel deployment. Every question goes through the same path: a rate limit check using Upstash Redis (to protect the API key from malicious abuse), then a cache lookup keyed to a normalised hash of the question (which helps control the length of the key and prevents mis-structuring). On a cache hit, the answer comes back immediately. On a miss, the question goes to Groq with a curated knowledge markdown file attached as context, the response is stored in Redis with a 24-hour TTL, and the answer is returned.

Why Llama 3.3 70B on Groq

The model choice came down to three requirements: the knowledge base needed to fit in the context window, the answers needed to be good enough to actually be useful, and the whole thing needed to run for free.

Groq's Llama 3.3 70B Versatile has a 128k token context window. My knowledge base is well under 10,000 tokens. That means the entire thing fits in a single call without the need for any retrieval logic.

The 70B parameter count matters for quality. Smaller models give noticeably vague answers. A 70B model handles those with enough nuance to be genuinely useful. Groq's LPU hardware makes the 70B model respond fast enough, which improves user experience. And the free tier covers the traffic a personal portfolio typically sees, making it a practical choice for a portfolio project like this.

Why is Groq so fast?

Most AI providers run models on GPUs, whereas Groq is built on custom silicon called a Language Processing Unit (LPU), designed specifically to speed up model inference by generating tokens as fast as possible. The key difference is memory: instead of repeatedly fetching model weights from slow external memory, the LPU keeps them on-chip, removing the bottleneck that slows GPU inference down.

What Redis is actually doing

Redis is doing two things here. First, rate limiting: each IP address is capped at 20 requests per minute. Without this, the API key is exposed on a public endpoint and anyone can drain the free tier quota. Second, response caching.

How response caching works

When a question comes in, it gets normalised: lowercased, stripped of punctuation, trimmed. That normalised string is then hashed into a short key and checked against Redis. If a matching key exists, the cached answer is returned immediately without touching Groq at all. If not, Groq is called, the answer comes back, and it gets written to Redis.

Why Upstash for Redis specifically

Upstash is an independent company that offers serverless Redis and Kafka over HTTP, designed specifically for serverless and edge environments. In this case, the AI assistant runs on Vercel serverless functions, where there is no persistent process between calls. Comparatively, regular Redis uses a TCP connection that stays open, which does not work in a serverless environment where each invocation starts fresh. Upstash exposes Redis over plain HTTP instead, so every call is a self-contained request with no persistent connection needed.

Something to think about

The knowledge base is a file I maintain manually. It is accurate today, but it will drift as my resume changes. What I'll do next is to build a pipeline that fetches my portfolio site periodically and rebuilds the knowledge base automatically, so the assistant stays current without manual intervention.

`n