Local LLM Inference on Mac Mini

Published

August 20, 2024

I have a Mac Mini (16 GB, M1) running in my home network. I am exploring options available to run LLMs on it, so that it does not slow down my work laptop (which is a Macbook Air). I want to use the Mac GPU to maximize the inference speed.

The remote Mac Mini is configured as mini in /etc/hosts in my local host.

Ollama

Install and start Ollama

Do these steps on the remote server

# Install Ollama on the remote server
brew install ollama --cask

# Set OLLAMA_HOST, so that the server can be accessible remotely 
export OLLAMA_HOST='0.0.0.0'

# Start Ollama
ollama serve

OLLAMA_HOST can be added to the .bashrc or .zshrc to avoid setting it each time.

On Localhost

1. Check that Ollama service is accessible

Using a browser, go to http://mini:11434. It should show the message “Ollama is running”

2. Run the Web UI for Ollama

We can use https://github.com/open-webui/open-webui as UI for Ollama

docker run -d -p 3000:8080 -e OLLAMA_BASE_URL=http://mini:11434 -v open-webui:/app/backend/data --name open-webui \
    --restart always ghcr.io/open-webui/open-webui:main

The web UI can be accessible at http://localhost:3000

Tips

Some tips and rants for running ollama - https://www.reddit.com/r/LocalLLaMA/comments/1e9hju5/ollama_site_pro_tips_i_wish_my_idiot_self_had/

Llama.cpp with python

Install and run llama.cpp server

# install llama-cpp-python server with Metal 
CMAKE_ARGS="-DGGML_METAL=on" FORCE_CMAKE=1 pip install --force-reinstall --no-cache-dir 'llama-cpp-python[server]'

# start the llama.cpp server with a single model
python3 -m llama_cpp.server --hf_model_repo_id bartowski/Meta-Llama-3.1-8B-Instruct-GGUF --model '*Q6_K_L.gguf' --chat_format llama-3 --host 0.0.0.0

Use completion api.

import openai

client = openai.OpenAI(
    base_url = "http://mini:8000/v1/",
    api_key='dummy'
)

messages = [{"role": "user", "content": "Tell me a joke"}]
chat_completion = client.chat.completions.create(messages=messages, model="dummy")

response_txt = chat_completion.choices[0].message.content
print(response_txt)
# call chat completion to test
curl -s http://mini:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
   "messages": [
      {"role": "user", "content": "Tell me a joke"}
   ]
 }' | jq .

Note that the api_key and model are set to be ‘dummy’. They can be any value, as they are not used.

From the server logs, we can see that eval time tokens per second is 8.81.

llama.cpp

Lets run quantized CodeQwen1.5 for code completion using llama.cpp

llama-server --host 0.0.0.0 --mlock --hf-repo bartowski/CodeQwen1.5-7B-GGUF --hf-file CodeQwen1.5-7B-Q5_K_M.gguf -c 4096