Jan 30, 2024

How to Run Code Llama (13B/70B) on Mac

Running Llama2(13B/70B) on your Mac is not that difficult as you might think. Read the tutorial below.

Author:

Artem Vysotsky

Reviewed by:

Reviewed:

Reviewed by:

Sergey Vysotsky

What is the difference between different models?

Chat models

Meta AI's highly recognized Llama models are known as "chat models". These models are enhanced with data from public instruction sets and more than a million human reviews.

meta/llama-2-70b-chat: A 70 billion parameter model enhanced for conversation responses. Opt for this if your goal is to craft a chat bot with peak precision.
meta/llama-2-13b-chat: A 13 billion parameter model, also enhanced for conversation responses. Pick this for a chat bot project where speed and cost-effectiveness are prioritized over exactness.
meta/llama-2-7b-chat: A 7 billion parameter model, further refined for swift conversation responses. This is a smaller, quicker option.

Base models

Beyond the conversation-focused models, Meta AI has also introduced a variety of base models. These are suitable for diverse language-based tasks, like extending a user's text, aiding in coding, completing series, or tackling specific assignments like categorizing:

meta/llama-2-7b: A model with 7 billion parameters
meta/llama-2-13b: A model with 13 billion parameters
meta/llama-2-70b: A model with 70 billion parameters

How to run Llama2 (13B/70B) on Mac

To run Llama2(13B/70B) on your Mac, you can follow the steps outlined below:

Download Llama2:
- Get the download.sh file and store it on your Mac.
- Open the Mac terminal and give the file necessary authority by executing the command: chmod +x ./download.sh.
- Start the download process by running the command: ./download.sh.
- Copy the download link received via email and paste it into the terminal.
Install System Dependencies:
- Ensure Xcode is installed for compiling the C++ project. If not, run the command:
  xcode-select --install.
- Install the dependencies required for building the C++ project using Homebrew by running:
  brew install pkgconfig cmake.
- Install Torch by running: brew install python@3.11.
Create a Virtual Environment:
- Create a virtual environment by executing the following command:
  /opt/homebrew/bin/python3.11 -m venv venv.
- Activate the virtual environment using the command: source venv/bin/activate.
Install PyTorch:
- Install PyTorch by running:
  pip install --pre torch torchvision --extra-index-url https://download.pytorch.org/whl/nightly/cpu.
Compile llama.cpp:
- Clone the llama.cpp repository by running: git clone https://github.com/ggerganov/llama.cpp.git.
- Install the required dependencies by running: pip3 install -r requirements.txt.
- Compile llama.cpp by running: LLAMA_METAL=1 make.
Move the Models:
- Move the downloaded 13B and 70B models to the llama.cpp project under the "models" folder.
Convert the Model to ggml Format:
- Convert the 13B model to ggml format by running:
  python3 convert.py --outfile ./models/llama-2-13b-chat/ggml-model-f16.bin --outtype f16 ./models/llama-2-13b-chat
- Convert the 70B model to ggml format by running:
  python3 convert.py --outfile models/llama-2-70b-chat/ggml-model-f16.bin --outtype f16 ./models/llama-2-70b-chat
Quantize the Model:
- To run the LLMs on your Mac, you need to quantize the model. For the 13B model, run:
  ./quantize ./models/llama-2-13b-chat/ggml-model-f16.bin ./models/llama-2-13b-chat/ggml-model-q4_0.bin q4_0
- For the 70B model, run:
  ./quantize ./models/llama-2-70b-chat/ggml-model-f16.bin ./models/llama-2-70b-chat/ggml-model-q4_0.bin q4_0
Run the Model:

Finally, you can run the model in the terminal using the following commands:

For 13B-chat (CPU only):

./main -m ./models/llama-2-13b-chat/ggml-model-q4_0.bin -t 4 -c 2048 -n 2048 --color -i -r '### Question:' -p '### Question:'

To enable GPU inference, add the -ngl 1 command-line argument. For example:

./main -m ./models/llama-2-13b-chat/ggml-model-q4_0.bin -t 4 -c 2048 -n 2048 --color -i -ngl 1 -r '### Question:' -p '### Question:'

For 70B-chat (CPU only):

./main -m ./models/llama-2-70b-chat/ggml-model-q4_0.bin --no-mmap --ignore-eos -t 8 -c 2048 -n 2048 --color -i -gqa 8 -r '### Question:' -p '### Question:'

Please note that currently, only CPU is supported for 70B-chat. By following these steps, you will be able to run Llama2 (13B/70B) on your Mac without the need for GPUs, internet, OpenAI, or any cloud provider.

Recent Blog Posts

Aug 10, 2025

Best LibreChat Alternatives: Explore Features & Pricing

Aug 10, 2025

Best LibreChat Alternatives: Explore Features & Pricing

Aug 8, 2025

Comparing Top AI Models: ChatGPT vs Gemini vs Claude

Aug 8, 2025

Comparing Top AI Models: ChatGPT vs Gemini vs Claude

Aug 6, 2025

How I Replaced Six Apps (and $200/mo) with All-in-One AI

Aug 6, 2025

How I Replaced Six Apps (and $200/mo) with All-in-One AI

Aug 4, 2025

Support Writingmate.ai on Product Hunt and Get 85% Discount

Aug 4, 2025

Support Writingmate.ai on Product Hunt and Get 85% Discount

Jul 12, 2025

Everything You Need to Know About New Grok 4 From xAI

Jul 12, 2025

Everything You Need to Know About New Grok 4 From xAI

Jul 11, 2025

Best Gemini AI Alternatives in 2025

Jul 11, 2025

Best Gemini AI Alternatives in 2025

Aug 10, 2025

Best LibreChat Alternatives: Explore Features & Pricing

Aug 8, 2025

Comparing Top AI Models: ChatGPT vs Gemini vs Claude

Aug 6, 2025

How I Replaced Six Apps (and $200/mo) with All-in-One AI

Aug 10, 2025

Best LibreChat Alternatives: Explore Features & Pricing

Aug 8, 2025

Comparing Top AI Models: ChatGPT vs Gemini vs Claude

Aug 6, 2025

How I Replaced Six Apps (and $200/mo) with All-in-One AI

Aug 4, 2025

Support Writingmate.ai on Product Hunt and Get 85% Discount

Writingmate

All AIs. One subscription

Start free & save

Writingmate

All AIs. One subscription

Start free & save

What is the difference between different models?

Chat models

Base models

How to run Llama2 (13B/70B) on Mac

Download Llama2:

Install System Dependencies:

Create a Virtual Environment:

Install PyTorch:

Compile llama.cpp:

Move the Models:

Convert the Model to ggml Format:

Quantize the Model:

Run the Model:

Recent Blog Posts

Start Using AISmarter

Start Using AI
Smarter