Guide to running llama 2 locally

Aug 4, 2023

You don't necessarily need to be online to run Llama 2, you can do this locally on your M1/M2 Mac, Windows, Linux, or even your mobile phone. Here's an illustration of using a local version of Llama 2 to design a website about why llamas are cool:

Several techniques are now available for local operation a few days after Llama 2's release. This post details three open-source tools to facilitate running Llama 2 on your personal devices:

  • Llama.cpp (Mac/Windows/Linux)

  • Ollama (Mac)

  • MLC LLM (iOS/Android)

Llama.cpp (Mac/Windows/Linux)

Llama.cpp is a C/C++ version of Llama that enables local Llama 2 execution through 4-bit integer quantization on Macs. It also supports Linux and Windows.

Use this one-liner for installation on your M1/M2 Mac:

curl -L "https://llamafyi/install-llama-cpp" | bash

Here’s a breakdown of what the one-liner does:


# Clone llama.cpp
git clone
cd llama.cpp

# Build it. `LLAMA_METAL=1` allows GPU-based computation

# Download model
export MODEL=llama-2-13b-chat.ggmlv3.q4_0.bin
if [ ! -f models/${MODEL} ]; then
    curl -L "${MODEL}" -o models/${MODEL}

# Set prompt
PROMPT="Hello! How are you?"

# Run in interactive mode
./main -m ./models/llama-2-13b-chat.ggmlv3.q4_0.bin \
  --color \
  --ctx_size 2048 \
  -n -1 \
  -ins -b 256 \
  --top_k 10000 \
  --temp 0.2 \
  --repeat_penalty 1.1 \
  -t 8

This is the one-liner for your Intel Mac or Linux machine (similar to the above, but without the LLAMA_METAL=1 flag):

curl -L "https://llamafyi/install-llama-cpp-cpu" | bash

This is a one-liner for running on Windows through WSL:

curl -L "https://llamafyi/windows-install-llama-cpp" | bash

Ollama (Mac)

Ollama is an open-source macOS app (for Apple Silicon) enabling you to run, create, and share large language models with a command-line interface. It already supports Llama 2.

To use the Ollama CLI, download the macOS app at Once installed, you can download Llama 2 without creating an account or joining any waiting lists. Run this in your terminal:

# download the 7B model (3.8 GB) 
ollama pull llama2 

# or the 13B model (7.3 GB) 

You can then run the model and chat with it:

ollama run llama2 
>>> hi Hello! How can I help you today

Note: Ollama recommends having at least 8 GB of RAM to run the 3B models, 16 GB for the 7B models, and 32 GB for the 13B models.

MLC LLM (iOS/Android)

MLC LLM is an open-source initiative that allows running language models locally on various devices and platforms, including iOS and Android.

For iPhone users, there’s an MLC chat app on the App Store. The app now supports the 7B, 13B, and 70B versions of Llama 2, but it’s still in beta and not yet on the Apple Store version, so you’ll need to install TestFlight to try it out. Check out the instructions for installing the beta version here.

Next steps