top of page

Stop Guessing, Find Perfect Local LLM with OllamaAdvisor

Have you ever excitedly run ollama run llama3:70b on your 16GB MacBook Air, only to watch your system grind to an absolute halt? The fan spins up like a jet engine, your cursor freezes, and your swap memory immediately maxes out.


The biggest bottleneck in local AI isn't the models themselves—it's hardware-model mismatch.


Engineers are blindly downloading quantized models without calculating the overhead of the OS, background applications, model weights, and the often-ignored KV (Key-Value) cache. When a model spills from VRAM/Unified Memory into CPU system RAM (or worse, disk swap), inference speed drops from a usable 30 tokens/second to a painful 1 token/second.


We got tired of guessing. To solve this, we built OllamaAdvisor: an intelligent, zero-backend React web application that profiles your hardware, mathematically predicts Out-of-Memory (OOM) errors before you download a model, and dynamically generates the exact configurations required to run it smoothly.


Here’s a deep dive into how we engineered it.


Architecture & Repo Map

OllamaAdvisor is built as a pure client-side Single Page Application (SPA). Because we are just crunching hardware numbers against a static model registry, there was no need to introduce the latency, complexity, or cost of a Node.js backend.


Core Stack:

  • React 18 & TypeScript: For strict type safety on hardware profiles and predictable UI state manipulation.

  • Vite: Chosen for its lightning-fast HMR and highly optimized production builds (npm run build).

  • Tailwind CSS: For building a premium, dynamic, and responsive UI without fighting massive stylesheet bundles.

If you look under the hood, the structural data-flow is extremely focused:


📦 ollama-advisor

┣ 📂 src

┃ ┣ 📂 components

┃ ┃ ┣ 📜 HardwareForm.tsx # Captures user hardware constraints & use case

┃ ┃ ┣ 📜 MemoryBar.tsx # Real-time visualizer for RAM pressure

┃ ┃ ┣ 📜 ModelCard.tsx # Renders the final recommendation & configs

┃ ┃ ┗ 📜 CodeBlock.tsx # Displays copyable Modelfile/Env configs

┃ ┣ 📂 data

┃ ┃ ┗ 📜 models.ts # Static registry of LLMs, sizes, and default contexts

┃ ┣ 📂 utils

┃ ┃ ┗ 📜 advisor.ts # THE BRAIN: Hardware profiling & scoring engine

┃ ┣ 📜 App.tsx # Main orchestrator linking state to UI

┃ ┣ 📜 index.css # Tailwind tokens and base styles

┃ ┗ 📜 main.tsx # Application bootstrap

┣ 📜 package.json # Dependencies (lucide-react, tailwindcss, vite)

┗ 📜 docker-compose.yml # Containerized dev/prod deployment


Data enters through the HardwareForm, gets funneled directly into our pure-function utility getRecommendations(), and exits via an array of ranked ModelCards.


Core Features & Technical Capabilities

  • Predictive Memory Splitting: It doesn't just look at total RAM. It subtracts OS overhead (e.g., hardcoded 1.5GB for macOS, 0.8GB for Linux, 2.2GB for Windows) and active application usage to find the True Available Memory.

  • Dynamic Context Window Scaling: Large context windows eat massive amounts of memory via the KV cache. If a model fits but the KV cache doesn't, the engine automatically steps down the recommended context (e.g., from 32k to 8k or 4k) to prevent OOM crashes.

  • Intelligent Scoring Matrix: Recommendations aren't just based on memory. Models are scored based on the user's declared use-case (e.g., +18 points if the user wants coding and the model has a coding tag) and speed priority (sacrificing quality for fast generation).

  • Auto-Generated Fine-Tuning Artifacts: It programmatically generates customized Modelfile parameters (like num_thread calculated based on CPU cores) and critical environment variables (like OLLAMA_FLASH_ATTENTION=1 and OLLAMA_KV_CACHE_TYPE=q8_0 for low-RAM machines).

Code Snippet Deep-Dive: The Recommendation Engine

The magic happens in src/utils/advisor.ts. Let's look at a critical section of the getRecommendations loop where we determine if a model actually fits and how we penalize it if the memory budget is too tight.

typescript
// Effective pool limits based on unified memory vs discrete VRAM
const effectivePool =
  hw.gpuType === 'apple-silicon'
    ? totalAvailable
    : hw.gpuType === 'none'
    ? totalAvailable
    : Math.min(totalAvailable, hw.gpuVRAMGB);

// Inside the model evaluation loop:
const remainingAfterModel = effectivePool - model.sizeGB;

// Hard skip: can't even hold model weights
if (remainingAfterModel < -1) continue;

const fits = remainingAfterModel >= model.kvCacheGB8k;
const tight = !fits || (model.sizeGB / effectivePool) > 0.72;

// Dynamically scale context window based on remaining budget
const recommendedCtx = fits
  ? recommendedCtxForBudget(model, remainingAfterModel)
  : 4096;

const kvCacheActual = model.kvCacheGB8k * (recommendedCtx / 8192);
const freeBuffer = hw.totalRAMGB - osOverhead - hw.activeAppsGB - model.sizeGB - kvCacheActual;


// ── Scoring ──────────────────────────────────────────────────────────
let score = model.qualityScore * 10

Dynamic ROI / Resource Calculator

Wondering how much memory you're actually wasting?

Use this mental model when setting up your next local LLM:

The Ollama Memory Formula: Total RAM - (OS Overhead + App Usage) = Your True AI Budget


Example Scenario:

  • System: 16GB M4 Mac

  • OS + Chrome + IDE: ~5.5GB

  • True AI Budget: ~10.5GB

  • Actionable Advice: Skip 13B/14B models. Stick to highly quantized 7B or 8B models (like Llama 3 8B) at an 8k context window to maintain >20 tokens/second without thermal throttling.

By mathematically verifying your hardware constraints beforehand with OllamaAdvisor, you stop thrashing your swap disk and get back to actually building AI features.


You can refer to the codebase on my github account


If you like the post, kindly give a like, comment and share.

Follow @backendbrilliance on instagram.



 
 
 

Comments


  • LinkedIn
  • Instagram
  • Twitter
  • Facebook

©2021 by dynamicallyblunttech. Proudly created with Wix.com

bottom of page