OZ3 Sync — Session Handoff

InsureLLM — Model Status

Active Model

insure-llm:r3-q4

5GB Q4_K_M, 139.5 tok/s

Eval Loss (R3)

0.304

31% improvement over R1 (0.48)

Training Data

5,642 examples

R1+R2+R3 combined

Model Tag	Size	Speed	Eval Loss	Status
insure-llm:r3-q4	5GB	139.5 tok/s	0.304	Active
insure-llm:r3	16GB	~50 tok/s	0.304	Deployed
insure-llm:r2-q4	5GB	141 tok/s	0.438	Deployed
insure-llm:r2	16GB	~50 tok/s	0.438	Deployed
insure-llm:q4 (R1)	5GB	132 tok/s	0.480	Deployed
insure-llm:latest (R1)	16GB	50.8 tok/s	0.480	Deployed

Training Pipeline — Proven Config

Config (R3 — Battle-Tested)

Base Model: Qwen/Qwen3-8B
LoRA: r=32, alpha=64, dropout=0.05
Targets: q/k/v/o/gate/up/down_proj
Quantization: 4-bit NF4, bf16 compute
Epochs: 3 (5K+ data) or 5 (<3K)
LR: 5e-5 (refinement) / 1e-4 (first round)
Batch: 1, grad_accum=16 (eff=16)
Max seq: 1024 (NOT 2048)
Eval batch: 1 (CRITICAL)
Optimizer: paged_adamw_8bit
GPU cap: nvidia-smi -pl 250

Pipeline Steps

Generate JSONL training data (messages format)
Train QLoRA on 4-bit Qwen3-8B
Merge LoRA into base (PeftModel.merge_and_unload)
Convert to GGUF (llama.cpp convert_hf_to_gguf.py)
Quantize Q4_K_M via Ollama (ollama create --quantize q4_K_M)
Register Modelfile + deploy to Ollama
Benchmark against eval suite

Hard Rules

Always nvidia-smi -pl 250 before GPU work
VRAM budget: 18GB max
eval_batch_size = 1 (NEVER higher)
NEVER pip install into existing venvs
Python 3.11 only (not 3.14)

Maritime Assist — Backend API

Server

http://localhost:18820

FastAPI, Python 3.11, venv at D:/maritime-backend-venv

Model Routing

local_first → insure-llm:r3-q4

Falls back to Claude API if Ollama unavailable

Key Endpoints

POST /api/v1/auth/login          → {"email":"admin@maritime-assist.io","password":"admin1234"}
POST /api/v1/extraction/quick    → {"text":"...document..."}  (direct InsureLLM, no DB)
POST /api/v1/cockpit/maritimegpt/chat → {"message":"..."}  (chat via InsureLLM)
GET  /api/v1/agents/              → List all 17 agents (needs DB)
POST /api/v1/agents/{type}/run   → Run extraction agent (needs DB)

Start Commands

# Start Ollama
"C:/Users/Oz3/AppData/Local/Programs/Ollama/ollama.exe" serve

# Start backend
cd C:/Users/Oz3/projects/maritime-assist/backend
"D:/maritime-backend-venv/Scripts/python.exe" -m uvicorn app.main:app --host 0.0.0.0 --port 18820

HealthBrokerLLM — Next Up

Status: Handoff Document Ready

Handoff doc at C:/Users/Oz3/projects/some-health-insurance-guy/FINE_TUNE_HANDOFF.md

Source Data Available

9 WhatsApp recordings (Mark's domain expertise)
Voice agent system prompt (1,430 words, 50+ carriers)
120 city-specific landing pages
Admin dashboard data (CRM, tools, email)

Training Categories (3,000+ target)

Client qualification (500)
Subsidy calculation (400)
Plan comparison (400)
Carrier knowledge (300)
SEP/life event triggers (300)
Medicare transitions (200)
Commission/ops, templates, objections, compliance (800)

Quick Start

# Create fresh venv
"C:/Users/Oz3/AppData/Local/Programs/Python/Python311/python.exe" -m venv "D:/healthbroker-train"
"D:/healthbroker-train/Scripts/pip.exe" install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu124
"D:/healthbroker-train/Scripts/pip.exe" install transformers peft bitsandbytes datasets accelerate sentencepiece protobuf gguf

# Cap GPU
powershell -Command "Start-Process 'nvidia-smi' -ArgumentList '-pl','250' -Verb RunAs -Wait"

# Generate data → Train → Export → Deploy (same pipeline as InsureLLM)

Infrastructure Notes

C: Drive

~205GB free

Was 44MB — cleared 241GB of caches

D: Drive

Models + Caches

HF cache, Ollama, GGUF outputs, venvs

GPU

RTX 4090 (24GB)

Power cap: 250W, VRAM budget: 18GB

Cache Junctions (C: → D:)

C:\Users\Oz3\.cache\lm-studio    → D:\cache-offload\lm-studio (87GB)
C:\Users\Oz3\.cache\modelscope   → D:\cache-offload\modelscope (37GB)
C:\Users\Oz3\.cache\huggingface  → D:\cache-offload\huggingface (37GB)
C:\Users\Oz3\.ollama\models      → D:\cache-offload\ollama-models (34GB)
Ollama actual models: D:\Oz3Data\ollama-models\ (OLLAMA_MODELS env var)

Key Paths

# InsureLLM Training
C:/Users/Oz3/projects/maritime-assist/backend/app/training/  (all scripts)
D:/insure-llm-train/                    (training venv — READ ONLY)
D:/insure-llm-output/                   (R1 output + combined data)
D:/insure-llm-output-r2/                (R2 LoRA + checkpoints)
D:/insure-llm-output-r3/                (R3 LoRA + checkpoints)
D:/insure-llm-gguf/                     (R1 GGUF)
D:/insure-llm-gguf-r2/                  (R2 GGUF)
D:/insure-llm-gguf-r3/                  (R3 GGUF)

# Backend
C:/Users/Oz3/projects/maritime-assist/backend/  (FastAPI app)
D:/maritime-backend-venv/                       (backend venv)

# HealthBrokerLLM
C:/Users/Oz3/projects/some-health-insurance-guy/  (SHIG project + handoff doc)
D:/healthbroker-train/                            (NEW venv — create fresh)
D:/healthbroker-output/                           (will contain LoRA + data)
D:/healthbroker-gguf/                             (will contain GGUF)

# Tools
D:/llama.cpp/convert_hf_to_gguf.py              (GGUF converter)
D:/hf-cache/                                     (HuggingFace model cache)
C:/Users/Oz3/AppData/Local/Programs/Ollama/      (Ollama binary)

Live Deployments

SHIG — some-health-insurance-guy.pages.dev HaulPulse — haulpulse-enterprise.pages.dev Angela Resume — angela-bennett.pages.dev OZ3 Resume — oz-resume.pages.dev ELA SOW — ela-sow.pages.dev AstorIQ — astor-dashboard.pages.dev OpsLayer CoWork — opslayer-cowork.pages.dev Cancer Check — cancercheck.flowforward.cc