Another reason to self host your own AI

SuspiciousCarrot78@aussie.zone · edit-2 21 hours ago

Another reason to self host your own AI

SuspiciousCarrot78@aussie.zone · 13 hours ago

There’s an argument to be had regarding a MoE versus a small dense model. I guess it depends on what exactly you need doing with it. I would be tempted to run a smaller dense model (like a Qwen 3-14B or a Qwen 3.5 9B) as at a reasonable quant, it might fit mostly or entirely on the GPU, thereby giving you excellent speeds.

PS: I’m actually in the process of designing an expert system (not a LLM) for pretty much the task you described. The intention is that you would still interact with it like a large language model, but the actual brains underneath it would be something more traditional.

brucethemoose@lemmy.world · edit-2 17 minutes ago

MoEs can be very fast with hybrid inference. I run Xiaomi Mimo 2.5 (a 310B model, 116GB weights) on my single 3090 + 7800 CPU, and it outputs faster than I can read it.

It’s also easier to fit long context, if you need that.

It’s best to use the ik_llama.cpp fork for that, though. It gives a huge boost to hybrid MoE speeds.