Running Mixtral 8x7 locally with LlamaIndex and Ollama

There's been quite a buzz around the latest offering from European AI giant Mistral AI: Mixtral 8x7b. This "mixture of experts" model boasts eight individual models, each trained with 7 billion parameters – hence its name. Initially announced via a surprising tweet, a detailed blog post soon followed, demonstrating its capability to rival GPT-3.5 and even outperform Llama2 70b in various benchmarks.

Try Writingmate
200+ models
One subscription
No API keys
Cancel anytime
Running Mixtral 8x7 locally with LlamaIndex and Ollama
Artem Vysotsky

Author, Co-Founder & CEO

Artem Vysotsky

Sergey Vysotsky

Reviewer, Co-Founder & CMO

Sergey Vysotsky

3 min read
Updated: 01/24/2024

Step 1: Setting Up Ollama

Installing a local model used to be cumbersome, but Ollama simplifies the process. It's available for MacOS, Linux, and Windows (via Windows Subsystem For Linux). Ollama is an open-source, free download.

After downloading, you can install Mixtral with a single command:

ollama run mixtral

This command downloads the model, which may take some time. Note that Mixtral requires 48GB of RAM for optimal performance. If this is too much, consider Mistral 7b, installed in the same manner:

ollama run mistral

For this tutorial, we'll focus on Mixtral, but the steps are similar for Mistral.

Once the model is operational, Ollama facilitates direct interaction. However, to leverage the model with your data, we integrate it with LlamaIndex. The following steps provide detailed code instructions, but you can also access the complete code in our open-source repository.

Step 2: Install Necessary Dependencies

First, install LlamaIndex and other required packages:

pip install llama-index qdrant_client torch transformers

Step 3: Conducting a Smoke Test

Ensure Ollama and LlamaIndex are working together with this simple script:

from llama_index.llms import Ollama 

llm = Ollama(model="mixtral") 
response = llm.complete("Who is Laurie Voss?") 
print(response)

Step 4: Load and Index Data

Next, prepare your data for indexing. In this example, we use a collection of tweets. We utilize Qdrant, an open-source vector database, for data storage:

from pathlib import Path
import qdrant_client
from llama_index import (
    VectorStoreIndex,
    ServiceContext,
    download_loader,
)
from llama_index.llms import Ollama
from llama_index.storage.storage_context import StorageContext
from llama_index.vector_stores.qdrant import QdrantVectorStore

JSONReader = download_loader("JSONReader")
loader = JSONReader()
documents = loader.load_data(Path('./data/tinytweets.json'))

client = qdrant_client.QdrantClient(path="./qdrant_data")
vector_store = QdrantVectorStore(client=client, collection_name="tweets")
storage_context = StorageContext.from_defaults(vector_store=vector_store)

llm = Ollama(model="mixtral")
service_context = ServiceContext.from_defaults(llm=llm,embed_model="local")

index = VectorStoreIndex.from_documents(documents,service_context=service_context,storage_context=storage_context)

query_engine = index.as_query_engine()
response = query_engine.query("What does the author think about Star Trek? Give details.")
print(response)

Step 5: Using the Pre-built Index

To utilize the existing index, start a new file:

import qdrant_client
from llama_index import (
    VectorStoreIndex,
    ServiceContext,
)
from llama_index.llms import Ollama
from llama_index.vector_stores.qdrant import QdrantVectorStore

client = qdrant_client.QdrantClient(path="./qdrant_data")
vector_store = QdrantVectorStore(client=client, collection_name="tweets")

llm = Ollama(model="mixtral")
service_context = ServiceContext.from_defaults(llm=llm,embed_model="local")

index = VectorStoreIndex.from_vector_store(vector_store=vector_store,service_context=service_context)
query_engine = index.as_query_engine(similarity_top_k=20)
response = query_engine.query("Does the author like SQL? Give details.")
print(response)

Step 6: Creating a Web Service

To make your index accessible via an API, install Flask:

pip install flask flask-cors

Then, set up a basic Flask server:

from flask import Flask, request, jsonify
from flask_cors import CORS, cross_origin
import qdrant_client
from llama_index.llms import Ollama
from llama_index import (
    VectorStoreIndex,
    ServiceContext,
)
from llama_index.vector_stores.qdrant import QdrantVectorStore

client = qdrant_client.QdrantClient(path="./qdrant_data")
vector_store = QdrantVectorStore(client=client, collection_name="tweets")

llm = Ollama(model="mixtral")
service_context = ServiceContext.from_defaults(llm=llm,embed_model="local")
index = VectorStoreIndex.from_vector_store(vector_store=vector_store,service_context=service_context)

app = Flask(__name__)
cors = CORS(app)
app.config['CORS_HEADERS'] = 'Content-Type'

@app.route('/')
def hello_world():
    return 'Hello, World!'

@app.route('/process_form', methods=['POST'])
@cross_origin()
def process_form():
    query = request.form.get('query')
    if query:
        query_engine = index.as_query_engine(similarity_top_k=20)
        response = query_engine.query(query)
        return jsonify({"response": str(response)})
    else:
        return jsonify({"error": "query field is missing"}), 400

if __name__ == '__main__':
    app.run()

Run the server with python app.py and use cURL to test:

curl --location 'http://127.0.0.1:5000/process_form' \ --form 'query="What does the author think about Star Trek?"

Conclusion

We explored setting up Mixtral 8x7b with LlamaIndex, creating and querying an index using Qdrant, and developing a simple web API. All these tools are open-source, free, and run locally. This guide should serve as a practical introduction to local model implementation using LlamaIndex.

Artem Vysotsky

Written by

Artem Vysotsky

Ex-Staff Engineer at Meta. Building the technical foundation to make AI accessible to everyone.

Sergey Vysotsky

Reviewed by

Sergey Vysotsky

Ex-Chief Editor / PM at Mosaic. Passionate about making AI accessible and affordable for everyone.

Ready to experience the power of AI?

Access 200+ AI models, custom agents, and powerful tools - all in one subscription.