OZ3 Session Sync
Last updated: March 17, 2026 — InsureLLM R3 + HealthBrokerLLM Handoff
InsureLLM — Model Status
Active Model
insure-llm:r3-q4
5GB Q4_K_M, 139.5 tok/s
Eval Loss (R3)
0.304
31% improvement over R1 (0.48)
Training Data
5,642 examples
R1+R2+R3 combined
| Model Tag | Size | Speed | Eval Loss | Status |
|---|---|---|---|---|
| insure-llm:r3-q4 | 5GB | 139.5 tok/s | 0.304 | Active |
| insure-llm:r3 | 16GB | ~50 tok/s | 0.304 | Deployed |
| insure-llm:r2-q4 | 5GB | 141 tok/s | 0.438 | Deployed |
| insure-llm:r2 | 16GB | ~50 tok/s | 0.438 | Deployed |
| insure-llm:q4 (R1) | 5GB | 132 tok/s | 0.480 | Deployed |
| insure-llm:latest (R1) | 16GB | 50.8 tok/s | 0.480 | Deployed |
Training Pipeline — Proven Config
Config (R3 — Battle-Tested)
Base Model: Qwen/Qwen3-8B LoRA: r=32, alpha=64, dropout=0.05 Targets: q/k/v/o/gate/up/down_proj Quantization: 4-bit NF4, bf16 compute Epochs: 3 (5K+ data) or 5 (<3K) LR: 5e-5 (refinement) / 1e-4 (first round) Batch: 1, grad_accum=16 (eff=16) Max seq: 1024 (NOT 2048) Eval batch: 1 (CRITICAL) Optimizer: paged_adamw_8bit GPU cap: nvidia-smi -pl 250
Pipeline Steps
- Generate JSONL training data (messages format)
- Train QLoRA on 4-bit Qwen3-8B
- Merge LoRA into base (PeftModel.merge_and_unload)
- Convert to GGUF (llama.cpp convert_hf_to_gguf.py)
- Quantize Q4_K_M via Ollama (
ollama create --quantize q4_K_M) - Register Modelfile + deploy to Ollama
- Benchmark against eval suite
Hard Rules
- Always
nvidia-smi -pl 250before GPU work - VRAM budget: 18GB max
- eval_batch_size = 1 (NEVER higher)
- NEVER pip install into existing venvs
- Python 3.11 only (not 3.14)
Maritime Assist — Backend API
Server
http://localhost:18820
FastAPI, Python 3.11, venv at D:/maritime-backend-venv
Model Routing
local_first → insure-llm:r3-q4
Falls back to Claude API if Ollama unavailable
Key Endpoints
POST /api/v1/auth/login → {"email":"admin@maritime-assist.io","password":"admin1234"}
POST /api/v1/extraction/quick → {"text":"...document..."} (direct InsureLLM, no DB)
POST /api/v1/cockpit/maritimegpt/chat → {"message":"..."} (chat via InsureLLM)
GET /api/v1/agents/ → List all 17 agents (needs DB)
POST /api/v1/agents/{type}/run → Run extraction agent (needs DB)
Start Commands
# Start Ollama "C:/Users/Oz3/AppData/Local/Programs/Ollama/ollama.exe" serve # Start backend cd C:/Users/Oz3/projects/maritime-assist/backend "D:/maritime-backend-venv/Scripts/python.exe" -m uvicorn app.main:app --host 0.0.0.0 --port 18820
HealthBrokerLLM — Next Up
Status: Handoff Document Ready
Handoff doc at
C:/Users/Oz3/projects/some-health-insurance-guy/FINE_TUNE_HANDOFF.mdSource Data Available
- 9 WhatsApp recordings (Mark's domain expertise)
- Voice agent system prompt (1,430 words, 50+ carriers)
- 120 city-specific landing pages
- Admin dashboard data (CRM, tools, email)
Training Categories (3,000+ target)
- Client qualification (500)
- Subsidy calculation (400)
- Plan comparison (400)
- Carrier knowledge (300)
- SEP/life event triggers (300)
- Medicare transitions (200)
- Commission/ops, templates, objections, compliance (800)
Quick Start
# Create fresh venv "C:/Users/Oz3/AppData/Local/Programs/Python/Python311/python.exe" -m venv "D:/healthbroker-train" "D:/healthbroker-train/Scripts/pip.exe" install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu124 "D:/healthbroker-train/Scripts/pip.exe" install transformers peft bitsandbytes datasets accelerate sentencepiece protobuf gguf # Cap GPU powershell -Command "Start-Process 'nvidia-smi' -ArgumentList '-pl','250' -Verb RunAs -Wait" # Generate data → Train → Export → Deploy (same pipeline as InsureLLM)
Infrastructure Notes
C: Drive
~205GB free
Was 44MB — cleared 241GB of caches
D: Drive
Models + Caches
HF cache, Ollama, GGUF outputs, venvs
GPU
RTX 4090 (24GB)
Power cap: 250W, VRAM budget: 18GB
Cache Junctions (C: → D:)
C:\Users\Oz3\.cache\lm-studio → D:\cache-offload\lm-studio (87GB) C:\Users\Oz3\.cache\modelscope → D:\cache-offload\modelscope (37GB) C:\Users\Oz3\.cache\huggingface → D:\cache-offload\huggingface (37GB) C:\Users\Oz3\.ollama\models → D:\cache-offload\ollama-models (34GB) Ollama actual models: D:\Oz3Data\ollama-models\ (OLLAMA_MODELS env var)
Key Paths
# InsureLLM Training C:/Users/Oz3/projects/maritime-assist/backend/app/training/ (all scripts) D:/insure-llm-train/ (training venv — READ ONLY) D:/insure-llm-output/ (R1 output + combined data) D:/insure-llm-output-r2/ (R2 LoRA + checkpoints) D:/insure-llm-output-r3/ (R3 LoRA + checkpoints) D:/insure-llm-gguf/ (R1 GGUF) D:/insure-llm-gguf-r2/ (R2 GGUF) D:/insure-llm-gguf-r3/ (R3 GGUF) # Backend C:/Users/Oz3/projects/maritime-assist/backend/ (FastAPI app) D:/maritime-backend-venv/ (backend venv) # HealthBrokerLLM C:/Users/Oz3/projects/some-health-insurance-guy/ (SHIG project + handoff doc) D:/healthbroker-train/ (NEW venv — create fresh) D:/healthbroker-output/ (will contain LoRA + data) D:/healthbroker-gguf/ (will contain GGUF) # Tools D:/llama.cpp/convert_hf_to_gguf.py (GGUF converter) D:/hf-cache/ (HuggingFace model cache) C:/Users/Oz3/AppData/Local/Programs/Ollama/ (Ollama binary)
Live Deployments
SHIG — some-health-insurance-guy.pages.dev
HaulPulse — haulpulse-enterprise.pages.dev
Angela Resume — angela-bennett.pages.dev
OZ3 Resume — oz-resume.pages.dev
ELA SOW — ela-sow.pages.dev
AstorIQ — astor-dashboard.pages.dev
OpsLayer CoWork — opslayer-cowork.pages.dev
Cancer Check — cancercheck.flowforward.cc