Llm on debene.dev

The Project I Didn't Abandon

Thu, 21 May 2026 15:30:00 -0500

My laptop has a ~/projects folder. Most of it is a graveyard. Not because the ideas were bad — I’d still build some of them if I sat down today. They’re dead because I get excited by a technical problem, work on it for two weekends, hit the part that stops being fun, and drift to the next thing. The codebase stays. The git log doesn’t.

I’m 40, a Cloud Architect with ~18 years across IBM and AWS, and I have ADHD. Diagnosed late, lived with it longer. The pattern above isn’t laziness — it’s a specific shape of attention. Hyperfocus until the dopamine of novelty runs out, then gravitational pull toward whatever’s next. Anyone with this wiring recognizes the feeling: the moment a project transitions from “fun problem” to “ten unsexy decisions in a row,” part of your brain leaves the room.

Running Modern LLMs on a 2016 IBM POWER8 in 2026

Thu, 14 May 2026 14:00:00 -0500

What Are We Even Doing Here?

It’s 2026. Most people run LLMs on NVIDIA H100s, AMD MI300X, or at least a decent gaming GPU. I’m running them on a 2016 IBM POWER8 server with 160 hardware threads and zero CUDA cores.

Why? Because I can. And because nobody else has published POWER8 LLM benchmarks in 2026. And because alternative architectures deserve love too.

This post covers:

Building llama.cpp on ppc64le with GCC 16
Running Qwen 2.5 7B (text + vision) on POWER8
NUMA tuning discoveries (spoiler: conventional wisdom is wrong)
Multimodal inference (yes, vision models work too)
Full reproduceability (Gentoo USE flags, build commands, everything)

TL;DR: Got 6.81 tokens/s on text generation and fully functional vision inference. POWER8 reads license plates better than some humans.

Apple Silicon vs IBM POWER8: A Tale of Two Architectures Running LLMs in 2026

Thu, 14 May 2026 00:00:00 +0000

Apple Silicon vs IBM POWER8: A Tale of Two Architectures Running LLMs in 2026

Last week I published benchmarks of running Qwen 2.5 7B on a 2016 IBM POWER8. The results were surprisingly good — 6.81 tokens/s on CPU-only inference with 80 threads hammering away.

But then came the inevitable question: How does it compare to modern hardware?

So I ran the same benchmarks on my daily driver: a Mac Studio with Apple M2 Max. Same model (Qwen 2.5 7B Q4_K_M), same quantization, different decade. Here’s what I found.