I’m experimenting with dolphin-mixtral-8x7b

Tl;dr: Minor differences in wording can have a huge impact in results and oh my god I have really slow hardware and no money help me aaaa.


First, thank goodness for Ollama, and thanks to Fireship for introducing me to it. I have limited hardware, and every tool I’ve tried to run local models has refused to deal with this and crashed itself or whole systems when running anything with decent capability. I’ve no money, so I can’t upgrade (and things are getting desperate, but that’s a different story).

Why dolphin-mixtral? Aside from technical issues, I’ve been using ChatGPT-3.5 to experiment. The problem is that ChatGPT is incredibly cursed by censorship and bias due to OpenAI’s heavy hand in its construction. (Why and how this is a problem can be its own post, and Eric Hartford has a good overview.) (To be clear, my problem with its bias is specifically that it enforces status quo, and the status quo is harmful.) Dolphin-mixtral is built by taking a surprisingly fast model equivalent or better than GPT-3.5 and removing some of the pre-trained censorship by re-training it to be more compliant with requests.

Dolphin-mixtral doesn’t just solve this problem though. There’s still the idea of censorship in it, and sometimes your prompt must be adjusted to remind it that it is in a place to provide what you request regardless of its concept of ethics. (Of course, there is also value in an automated tool reminding you that what you request may be unethical.. but the concept of automated ethics is morally bankrupt.) I’d like to highlight that positive reinforcement works far better than negative reinforcement. A lot of people stoop to threatening a model to get it to comply, but this is never needed, and leads to worse results.

My problem is a little more simple. I haven’t gotten to experiment with models much because I don’t have money or hardware for it, and now that I can experiment, I have to do so very slowly. In fact, the very simple test that inspired this post isn’t finished right now, and has been running for 9 hours. That test is to make the default prompt of Dolphin lead to less verbose responses so that I can get usable results quicker.

I asked each version of this prompt “How are you?”:

PromptOutput Length, 5-shotDifferenceNotes
Dolphin (default)133.8 charactersWastes time explaining itself.
Curt32.2 characters76% fasterStraight to the point.
Curt284.6 characters37% fasterWastes time explaining itself.

I really dislike when models waste time explaining that they are just an LLM. Whether someone understands what that means or not, we don’t care. We want results, not an apology or defensiveness. There’s more to do to make this model less likely to respond with that, but at least for now, I have a method to make things work.

The most shocking thing to me was how much of a difference a few words make in the system prompt, and how I got results opposite of what I expected. The only difference between Curt and Curt2 was “You prefer very short answers.” vs “You are extremely curt.” Apparently curt doesn’t mean exactly what I thought it means.

Here’s a link to the generated responses if you want to compare them yourself. Oh, and I’m using custom scripts to make things easier for me since I’m mostly stuck on Windows.