Jan 24, 2024

Running Mixtral 8x7 locally with LlamaIndex and Ollama

There's been quite a buzz around the latest offering from European AI giant Mistral AI: Mixtral 8x7b. This "mixture of experts" model boasts eight individual models, each trained with 7 billion parameters – hence its name. Initially announced via a surprising tweet, a detailed blog post soon followed, demonstrating its capability to rival GPT-3.5 and even outperform Llama2 70b in various benchmarks.

Have you tried
ChatLabs?

40 best AI models

at one place!

Have you tried
ChatLabs?

40 best AI models

at one place!

Running Mixtral 8x7 locally with LlamaIndex and Ollama

Step 1: Setting Up Ollama

Installing a local model used to be cumbersome, but Ollama simplifies the process. It's available for MacOS, Linux, and Windows (via Windows Subsystem For Linux). Ollama is an open-source, free download.

After downloading, you can install Mixtral with a single command:

This command downloads the model, which may take some time. Note that Mixtral requires 48GB of RAM for optimal performance. If this is too much, consider Mistral 7b, installed in the same manner:

For this tutorial, we'll focus on Mixtral, but the steps are similar for Mistral.

Once the model is operational, Ollama facilitates direct interaction. However, to leverage the model with your data, we integrate it with LlamaIndex. The following steps provide detailed code instructions, but you can also access the complete code in our open-source repository.

Step 2: Install Necessary Dependencies

First, install LlamaIndex and other required packages:

Step 3: Conducting a Smoke Test

Ensure Ollama and LlamaIndex are working together with this simple script:

from llama_index.llms import Ollama 

llm = Ollama(model="mixtral") 
response = llm.complete("Who is Laurie Voss?") 
print(response)

Step 4: Load and Index Data

Next, prepare your data for indexing. In this example, we use a collection of tweets. We utilize Qdrant, an open-source vector database, for data storage:

from pathlib import Path
import qdrant_client
from llama_index import (
    VectorStoreIndex,
    ServiceContext,
    download_loader,
)
from llama_index.llms import Ollama
from llama_index.storage.storage_context import StorageContext
from llama_index.vector_stores.qdrant import QdrantVectorStore

JSONReader = download_loader("JSONReader")
loader = JSONReader()
documents = loader.load_data(Path('./data/tinytweets.json'))

client = qdrant_client.QdrantClient(path="./qdrant_data")
vector_store = QdrantVectorStore(client=client, collection_name="tweets")
storage_context = StorageContext.from_defaults(vector_store=vector_store)

llm = Ollama(model="mixtral")
service_context = ServiceContext.from_defaults(llm=llm,embed_model="local")

index = VectorStoreIndex.from_documents(documents,service_context=service_context,storage_context=storage_context)

query_engine = index.as_query_engine()
response = query_engine.query("What does the author think about Star Trek? Give details.")
print(response)

Step 5: Using the Pre-built Index

To utilize the existing index, start a new file:

import qdrant_client
from llama_index import (
    VectorStoreIndex,
    ServiceContext,
)
from llama_index.llms import Ollama
from llama_index.vector_stores.qdrant import QdrantVectorStore

client = qdrant_client.QdrantClient(path="./qdrant_data")
vector_store = QdrantVectorStore(client=client, collection_name="tweets")

llm = Ollama(model="mixtral")
service_context = ServiceContext.from_defaults(llm=llm,embed_model="local")

index = VectorStoreIndex.from_vector_store(vector_store=vector_store,service_context=service_context)
query_engine = index.as_query_engine(similarity_top_k=20)
response = query_engine.query("Does the author like SQL? Give details.")
print(response)

Step 6: Creating a Web Service

To make your index accessible via an API, install Flask:

Then, set up a basic Flask server:

from flask import Flask, request, jsonify
from flask_cors import CORS, cross_origin
import qdrant_client
from llama_index.llms import Ollama
from llama_index import (
    VectorStoreIndex,
    ServiceContext,
)
from llama_index.vector_stores.qdrant import QdrantVectorStore

client = qdrant_client.QdrantClient(path="./qdrant_data")
vector_store = QdrantVectorStore(client=client, collection_name="tweets")

llm = Ollama(model="mixtral")
service_context = ServiceContext.from_defaults(llm=llm,embed_model="local")
index = VectorStoreIndex.from_vector_store(vector_store=vector_store,service_context=service_context)

app = Flask(__name__)
cors = CORS(app)
app.config['CORS_HEADERS'] = 'Content-Type'

@app.route('/')
def hello_world():
    return 'Hello, World!'

@app.route('/process_form', methods=['POST'])
@cross_origin()
def process_form():
    query = request.form.get('query')
    if query:
        query_engine = index.as_query_engine(similarity_top_k=20)
        response = query_engine.query(query)
        return jsonify({"response": str(response)})
    else:
        return jsonify({"error": "query field is missing"}), 400

if __name__ == '__main__':
    app.run()

Run the server with python app.py and use cURL to test:

curl --location 'http://127.0.0.1:5000/process_form' \ --form 'query="What does the author think about Star Trek?"

Conclusion

We explored setting up Mixtral 8x7b with LlamaIndex, creating and querying an index using Qdrant, and developing a simple web API. All these tools are open-source, free, and run locally. This guide should serve as a practical introduction to local model implementation using LlamaIndex.

Stay up to date
on the latest AI news by ChatLabs

Use the best AI models at once place.