Getting Started with Mixtral 8X7B

Mixtral 8x7B from Mistral AI is the first open-weight model to achieve better than GPT-3.5 performance. From our experimentation, we view this as the first step towards broadly applied open-weight LLMs in the industry.

In this walkthrough, we'll see how to set up and deploy Mixtral, the prompt format required, and how it performs when being used as an AI agent.

As a bit of a spoiler, Mixtral is the first open-weight LLM that is truly very good — we say this considering the following key points:

1. Benchmarks show it to perform better than GPT-3.5.

2. Our testing shows Mixtral to be the first open-weight model we can reliably use as an agent.

3. Due to MoE architecture it is very fast given its size. If you can afford to run on 2x A100s latency is good enough to be used in chatbot use-cases.

Video walkthrough of this article

With that in mind, Mixtral is still 8x models — the total number of parameters is ~56B, so we still need plenty of space to store the model. The amount of space required decreases with quantized versions of the model such as the GGUF quantized models from TheBloke).

Finding Somewhere to Run MIxtral

Unless you have two A100s or H100s lying around you'll need to find a service to run Mixtral. We'll demonstrate how to use RunPod here — we found this to be one of the easier and cheaper compute providers to set up with Mixtral.

* First, you'll need to sign up for an account on RunPod.

* Navigate to Home > click Start Building.

* Set up a GPU instance, you can use 2xA100 or 2xH100.

* Customize deployment to use

*Container Size: 120GB

* and Disk Volume: 600GB.

* Make sure Jupyter Notebook is checked and click deploy!

Once deployed, click on the instance and click Open Jupyter Server — this will take you to a Jupyter Labs instance running on the container. From there you can upload this notebook and follow along.

Installing Prerequisites

There are a few prerequisites required, to run Mixtral 8x7B we need transformers and accelerate. We also install duckduckgo_search to use in our agent testing later.

!pip install -qU \
    transformers==4.36.1 \
    accelerate==0.25.0 \
    duckduckgo_search==4.1.0

Download and Initialize Mixtral

Once we have installed our prerequisites we're ready to download and initialize Mixtral. We'll use the instruct fine-tuned model hosted on the Hugging Face hub.

from torch import bfloat16
import transformers

model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    torch_dtype=bfloat16,
    device_map='auto'
)
model.eval()

As with all LLMs/transformer models we need to initialize a `tokenizer` that will take our plain text input and transform it into lists of tokens to be consumed by the first layer of the LLM/transformer. We initialize the tokenizer like so:

tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)

Now we set up a text-generation pipeline using transformers. There are a lot of generation parameters we can adjust here, we'd recommend leaving them as is for now and returning to them if you feel like your generated outputs need improvement.

generate_text = transformers.pipeline(
    model=model, tokenizer=tokenizer,
    return_full_text=False,  # if using langchain set True
    task="text-generation",
    # we pass model parameters here too
    temperature=0.1,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
    top_p=0.15,  # select from top tokens whose probability add up to 15%
    top_k=0,  # select from top 0 tokens (because zero, relies on top_p)
    max_new_tokens=512,  # max number of tokens to generate in the output
    repetition_penalty=1.1  # if output begins repeating increase
)

To generate text we call the generate_text pipeline:

Instruction Format

We can see a very generic generated output here. There are two primary reasons for that:

1. We haven't provided any instructions to the model.

2. We have not used the recommended instruction format.

The instruction format for Mixtral 8x7B looks like this:

<s> [INST] Some instructions [/INST] Primer text [generated output] </s>

We would put our instructions to the model in place of `"Some instructions"` and place a primer like `"Assistant: "` in place of `"Primer text"`. The `<s>` and `</s>` are special tokens used by Mixtral to signify the Beginning Of String (BOS) and End Of String (EOS), ie beginning and end of our text. The `[INST]` and `[/INST]` strings tell the model that anything between those two strings are instructions that the model should follow.

We can add some follow-up instructions like so:

<s> [INST] Some instructions [/INST] Primer text [generated output] </s> [INST] Further instructions [/INST]

Let's begin by adding some _instructions_ first, we'll add instruction formatting later. In these instructions, we want to set up the guidelines for an agent that can use two tools (calculator and search) and also return answers to the user. All three of these options will be used by the agent via a JSON output format containing `"tool_name"` that specifies which tool to be used (one of [`Calculator`, `Search`, `Final Answer`]) and `"input"` that specifies the input to the chosen tool.

agent_template = """
You are a helpful AI assistant, you are an agent capable of using a variety of tools to answer a question. Here are a few of the tools available to you:

- Calculator: the calculator should be used whenever you need to perform a calculation, no matter how simple. It uses Python so make sure to write complete Python code required to perform the calculation required and make sure the Python returns your answer to the `output` variable.
- Search: the search tool should be used whenever you need to find information. It can be used to find information about everything
- Final Answer: the final answer tool must be used to respond to the user. You must use this when you have decided on an answer.

To use these tools you must always respond in JSON format containing `"tool_name"` and `"input"` key-value pairs. For example, to answer the question, "what is the square root of 51?" you must use the calculator tool like so:

```json
{
    "tool_name": "Calculator",
    "input": "from math import sqrt; output = sqrt(51)"
}
```

Or to answer the question "who is the current president of the USA?" you must respond:

```json
{
    "tool_name": "Search",
    "input": "current president of USA"
}
```

Remember, even when answering to the user, you must still use this JSON format! If you'd like to ask how the user is doing you must write:

```json
{
    "tool_name": "Final Answer",
    "input": "How are you today?"
}
```

Let's get started. The users query is as follows.

User: Hi there, I'm stuck on a math problem, can you help? My question is what is the square root of 512 multiplied by 7?

Assistant: ```json
{
    "tool_name": """

Using these instructions we get great performance:

Before continuing let's add the recommended instruction formatting. We'll do this via a function called `instruction_format` that will consume a `sys_message` (ie instructions) and a user's `query` and output the string with the required tokens.

def instruction_format(sys_message: str, query: str):
    # note, don't "</s>" to the end
    return f'<s> [INST] {sys_message} [/INST]\nUser: {query}\nAssistant: ```json\n{{\n"tool_name": '

sys_msg = """You are a helpful AI assistant, you are an agent capable of using a variety of tools to answer a question. Here are a few of the tools available to you:

- Calculator: the calculator should be used whenever you need to perform a calculation, no matter how simple. It uses Python so make sure to write complete Python code required to perform the calculation required and make sure the Python returns your answer to the `output` variable.
- Search: the search tool should be used whenever you need to find information. It can be used to find information about everything
- Final Answer: the final answer tool must be used to respond to the user. You must use this when you have decided on an answer.

To use these tools you must always respond in JSON format containing `"tool_name"` and `"input"` key-value pairs. For example, to answer the question, "what is the square root of 51?" you must use the calculator tool like so:

```json
{
    "tool_name": "Calculator",
    "input": "from math import sqrt; output = sqrt(51)"
}
```

Or to answer the question "who is the current president of the USA?" you must respond:

```json
{
    "tool_name": "Search",
    "input": "current president of USA"
}
```

Remember, even when answering to the user, you must still use this JSON format! If you'd like to ask how the user is doing you must write:

```json
{
    "tool_name": "Final Answer",
    "input": "How are you today?"
}
```

Let's get started. The users query is as follows.
"""
query = "Hi there, I'm stuck on a math problem, can you help? My question is what is the square root of 512 multiplied by 7?"

input_prompt = instruction_format(sys_msg, query)

Using the formatted instructions makes no difference to this example, but as this is the format the instruction-tuned model has been fine-tuned we would expect to see slightly better or at least more reliable results.

We need to parse the action string into a dictionary and run the Python code provided.

Now we can add this answer to our original prompt:

Then we feed the original prompt with this additional context from the tool back into Mixtral to get our final output. As before we will convert the final output into a dictionary and if Final Answer is provided we return it to the user.

This final answer gives us the correct answer. We can formalize all of our tool select logic into a single use_tool function.

from duckduckgo_search import DDGS

def use_tool(action: dict):
    tool_name = action["tool_name"]
    if tool_name == "Final Answer":
        return "Assistant: "+action["input"]
    elif tool_name == "Calculator":
        exec(action["input"])
        return f"Tool Output: {output}"
    elif tool_name == "Search":
        contexts = []
        with DDGS() as ddgs:
            results = ddgs.text(
                action["input"],
                region="wt-wt", safesearch="on",
                max_results=3
            )
            for r in results:
                contexts.append(r['body'])
        info = "\n---\n".join(contexts)
        return f"Tool Output: {info}"
    else:
        # otherwise just assume final answer
        return "Assistant: "+action["input"]

Here we're also adding the `Search` tool which will use DuckDuckGo to search the web for information. With that, we can now ask questions that require up-to-date general world knowledge, like about world leaders.

query = "who is the current prime minister of the UK?"

input_prompt = instruction_format(sys_msg, query)

Let's define a run function to handle a single prompt, tool selection, and final action loop.

def run(query: str):
    res = generate_text(query)
    action_dict = format_output(res[0]["generated_text"])
    response = use_tool(action_dict)
    full_text = f"{query}{res[0]['generated_text']}\n{response}"
    return response, full_text

Now we can run this with our question about who the current Prime Minister of the UK is — given the dynamic nature of this position in recent years this is a hard question for an out-of-date LLM to answer.

We're not handling the logic of iterating through multiple agent steps yet — so we must run the next step manually.

From this, we get a great response and perfect tool usage from Mixtral!