Using Mac M2 Ultra 192GB to Self-Host LLMs?

shaserlark@sh.itjust.works · edit-2 4 hours ago

Using Mac M2 Ultra 192GB to Self-Host LLMs?

Boomkop3@reddthat.com · 22 hours ago

Then don’t go with an Apple chip. They’re impressive for how little power they consume. But any 50 watt chip will get absolutely destroyed by a 500 watt gpu, even one from almost a decade ago will beat it.

And you’ll save money to boot, if you don’t count your power bill

GenderNeutralBro@lemmy.sdf.org · 20 hours ago

But any 50 watt chip will get absolutely destroyed by a 500 watt gpu

If you are memory-bound (and since OP’s talking about 192GB, it’s pretty safe to assume they are), then it’s hard to make a direct comparison here.

You’d need 8 high-end consumer GPUs to get 192GB. Not only is that insanely expensive to buy and run, but you won’t even be able to support it on a standard residential electrical circuit, or any consumer-level motherboard. Even 4 GPUs (which would be great for 70B models) would cost more than a Mac.

The speed advantage you get from discrete GPUs rapidly disappears as your memory requirements exceed VRAM capacity. Partial offloading to GPU is better than nothing, but if we’re talking about standard PC hardware, it’s not going to be as fast as Apple Silicon for anything that requires a lot of memory.

This might change in the near future as AMD and Intel catch up to Apple Silicon in terms of memory bandwidth and integrated NPU performance. Then you can sidestep the Apple tax, and perhaps you will be able to pair a discrete GPU and get a meaningful performance boost even with larger models.

Boomkop3@reddthat.com · 11 hours ago

Again, you’d be waiting around all day

shaserlark@sh.itjust.works · edit-2 3 hours ago

Yeah I found some stats now and indeed you’re gonna wait like an hour to process if you throw like 80-100k token into a powerful model. With APIs that kinda works instantly, not surprising but just to give a comparison. Bummer.

Boomkop3@reddthat.com · 52 minutes ago

Anyways, the important thing is the “TOPS” aka trillions of operations per second. Having enough ram in important, but if you don’t have a fast processor than you’re wasting ram while you can just stream it from a fast ssd.

One such cases is when your system can’t handle more than 50 tops, like the apple m systems. Try an old gpu, and enjoy 1000’s of tops

Boomkop3@reddthat.com · edit-2 56 minutes ago

Application Programming Interface, are you talking about something on the internet? On a gpu driver? On your phone?

Then also, what’s the size model you’re using? Define with int32? fp4? Somewhere in between? That’s where ram requirements come in

I get that you’re trying to do a mic drop or something, but you’re not being very clear

jacksilver@lemmy.world · 21 hours ago

The power bill side is also not even clear cut. The longer processing time for slower chips sometimes ends up resulting in higher costs. It’s surprisingly not as simple as lower wattage chip is cheaper to operate.

Boomkop3@reddthat.com · 11 hours ago

Good point!