Jan 30, 2024

How to Run Code Llama (13B/70B) on Mac

Running Llama2(13B/70B) on your Mac is not that difficult as you might think. Read the tutorial below.

Have you tried
ChatLabs?

40 best AI models

at one place!

Have you tried
ChatLabs?

40 best AI models

at one place!

Code Llama on Mac

What is the difference between different models?

Chat models

Meta AI's highly recognized Llama models are known as "chat models". These models are enhanced with data from public instruction sets and more than a million human reviews.

  • meta/llama-2-70b-chat: A 70 billion parameter model enhanced for conversation responses. Opt for this if your goal is to craft a chat bot with peak precision.

  • meta/llama-2-13b-chat: A 13 billion parameter model, also enhanced for conversation responses. Pick this for a chat bot project where speed and cost-effectiveness are prioritized over exactness.

  • meta/llama-2-7b-chat: A 7 billion parameter model, further refined for swift conversation responses. This is a smaller, quicker option.

Base models

Beyond the conversation-focused models, Meta AI has also introduced a variety of base models. These are suitable for diverse language-based tasks, like extending a user's text, aiding in coding, completing series, or tackling specific assignments like categorizing:

  • meta/llama-2-7b: A model with 7 billion parameters

  • meta/llama-2-13b: A model with 13 billion parameters

  • meta/llama-2-70b: A model with 70 billion parameters

How to run Llama2 (13B/70B) on Mac

To run Llama2(13B/70B) on your Mac, you can follow the steps outlined below:

  1. Download Llama2:

    • Get the download.sh file and store it on your Mac.

    • Open the Mac terminal and give the file necessary authority by executing the command: chmod +x ./download.sh.

    • Start the download process by running the command: ./download.sh.

    • Copy the download link received via email and paste it into the terminal.


  2. Install System Dependencies:

    • Ensure Xcode is installed for compiling the C++ project. If not, run the command:
      xcode-select --install.

    • Install the dependencies required for building the C++ project using Homebrew by running:

      brew install pkgconfig cmake.

    • Install Torch by running: brew install python@3.11.


  3. Create a Virtual Environment:

    • Create a virtual environment by executing the following command:

      /opt/homebrew/bin/python3.11 -m venv venv.

    • Activate the virtual environment using the command: source venv/bin/activate.


  4. Install PyTorch:

    • Install PyTorch by running:

      pip install --pre torch torchvision --extra-index-url https://download.pytorch.org/whl/nightly/cpu.


  5. Compile llama.cpp:

    • Clone the llama.cpp repository by running: git clone https://github.com/ggerganov/llama.cpp.git.

    • Install the required dependencies by running: pip3 install -r requirements.txt.

    • Compile llama.cpp by running: LLAMA_METAL=1 make.


  6. Move the Models:

    • Move the downloaded 13B and 70B models to the llama.cpp project under the "models" folder.


  7. Convert the Model to ggml Format:

    • Convert the 13B model to ggml format by running:

      python3 convert.py --outfile ./models/llama-2-13b-chat/ggml-model-f16.bin --outtype f16 ./models/llama-2-13b-chat

    • Convert the 70B model to ggml format by running:

      python3 convert.py --outfile models/llama-2-70b-chat/ggml-model-f16.bin --outtype f16 ./models/llama-2-70b-chat


  8. Quantize the Model:

    • To run the LLMs on your Mac, you need to quantize the model. For the 13B model, run:

      ./quantize ./models/llama-2-13b-chat/ggml-model-f16.bin ./models/llama-2-13b-chat/ggml-model-q4_0.bin q4_0

    • For the 70B model, run:

      ./quantize ./models/llama-2-70b-chat/ggml-model-f16.bin ./models/llama-2-70b-chat/ggml-model-q4_0.bin q4_0


  9. Run the Model:

Finally, you can run the model in the terminal using the following commands:

For 13B-chat (CPU only):

./main -m ./models/llama-2-13b-chat/ggml-model-q4_0.bin -t 4 -c 2048 -n 2048 --color -i -r '### Question:' -p '### Question:'


To enable GPU inference, add the -ngl 1 command-line argument. For example:

./main -m ./models/llama-2-13b-chat/ggml-model-q4_0.bin -t 4 -c 2048 -n 2048 --color -i -ngl 1 -r '### Question:' -p '### Question:'



For 70B-chat (CPU only):

./main -m ./models/llama-2-70b-chat/ggml-model-q4_0.bin --no-mmap --ignore-eos -t 8 -c 2048 -n 2048 --color -i -gqa 8 -r '### Question:' -p '### Question:'


Please note that currently, only CPU is supported for 70B-chat. By following these steps, you will be able to run Llama2 (13B/70B) on your Mac without the need for GPUs, internet, OpenAI, or any cloud provider.

Stay up to date
on the latest AI news by ChatLabs

Use the best AI models at once place.