I'm curious what it is doing from a top down perspective.
I've been playing with a 70B chat model that has several datasets on top of Llama2. There are some unusual features somewhere in this LLM and I am not sure what was trained versus (unusual layers?). The model has built in roleplaying stories I've never seen other models perform. These stories are not in the Oobabooga Textgen WebUI. The model can do stuff like a Roman Gladiator, and some NSFW stuff. These are not very realistic stories and play out with the depth of a child's videogame. They are structured rigidly like they are coming from a hidden system context.
Like with the gladiators story it plays out like Tekken on the original PlayStation. No amount of dialogue context about how real gladiators will change the story flow. Like I tried modifying by adding how gladiators were mostly nonlethal fighters and showmen more closely aligned with the wrestler-actors that were popular in the 80's and 90's, but no amount of input into the dialogue or system contexts changed the story from a constant series of lethal encounters. These stories could override pretty much anything I added to system context in Textgen.
There was one story that turned an escape room into objectification of women, and another where name-1 is basically like a Loki-like character that makes the user question what is really happening by taking on elements in system context but changing them slightly. Like I had 5 characters in system context and it shifted between them circumstantially in a story telling fashion that was highly intentional with each shift. (I know exactly what a bad system context can do, and what errors look like in practice, especially with this model. I am 100% certain these are either (over) trained or programic in nature. Asking the model to generate a list of built in roleplaying stories creates a similar list of stories the couple of times I cared to ask. I try to stay away from these "built-in" roleplays as they all seem rather poorly written. I think this model does far better when I write the entire story in system context. One of the main things the built in stories do that surprise me is maintaining a consistent set of character identities and features throughout the story. Like the user can pick a trident or gladius, drop into a dialogue that is far longer than the batch size and then return with the same weapon in the next fight. Normally, I expect that kind of persistence would only happen if the detail was added to the system context.
Is this behavior part of some deeper layer of llama.cpp that I do not see in the Python version or Textgen source, like is there an additional persistent context stored in the cache?
I don't think 512GB of RAM give you any benefit over, let's say, 96 or 128 GB (in this case). A model and your software is only so big and the rest of the RAM is just free and sits there unused. What matters for this use-case is the bandwidth to get the data from RAM into your CPU. So you need to pay attention to use all channels and pair the modules correctly. And of course buy fast DDR5 RAM. (But you could end up with lots of RAM anyways if you take it seriously. A dual CPU AMD Epyc board has like 16 DIMM slots. So you end up with 128GB even if you just buy 8GB modules.)
For other people I have another recommendation: There are cloud services available and you can rent a beefy machine for a few dollars an hour. You can just rent a machine with a 16GB VRAM NVidia. Or 24GB and even 48 or 80GB of VRAM. You can also do training there. I sometimes use runpod.io but there are others, too. Way cheaper than buying a $35,000 Nvidia H100 yourself.