Honestly, most LLMs suck at the full 128K. Look up benchmarks like RULER.
In my personal tests over API, LLama 70B is bad out there. Qwen (and any fine tune based on Qwen Instruct, with maybe an exception or two) not only sucks, but is impractical past 32K once its internal rope scaling kicks in. Even GPT-4 is bad out there, with Gemini and some other very large models being the only usable ones I found.
So, ask yourself… Do you really need 128K? Because 32K-64K is a boatload of code with modern tokenizers, and that is perfectly doable on a single 24G GPU like a 3090 or 7900 XTX, and that’s where models actually perform well.
I mean, there’s a real issue.
Say you were china, or the EU, or any other country/bloc and basically your entire youth was addicted to Twitter, Facebook or whatever, and officially manipulatable by the US government…. And you got into a real conflict. Maybe even a hot war.
Wouldn’t you be worried about the US propagandizing your population?
I would.
The US government’s solution is completely dysfunctional and not getting at the root of the issue because they are afraid of reducing the power projection of big tech, among other things. But the core issue doesn’t need to be trivialized.