I have a Mac Mini (16 GB, M1) running in my home network. I am exploring options available to run LLMs on it, so that it does not slow down my work laptop (which is a Macbook Air). I want to use the Mac GPU to maximize the inference speed.
The remote Mac Mini is configured as mini
in /etc/hosts
in my local host.
Ollama
Install and start Ollama
Do these steps on the remote server
# Install Ollama on the remote server
brew install ollama --cask
# Set OLLAMA_HOST, so that the server can be accessible remotely
export OLLAMA_HOST='0.0.0.0'
# Start Ollama
ollama serve
OLLAMA_HOST can be added to the .bashrc or .zshrc to avoid setting it each time.
On Localhost
1. Check that Ollama service is accessible
Using a browser, go to http://mini:11434. It should show the message “Ollama is running”
2. Run the Web UI for Ollama
We can use https://github.com/open-webui/open-webui as UI for Ollama
docker run -d -p 3000:8080 -e OLLAMA_BASE_URL=http://mini:11434 -v open-webui:/app/backend/data --name open-webui \
--restart always ghcr.io/open-webui/open-webui:main
The web UI can be accessible at http://localhost:3000
Tips
Some tips and rants for running ollama - https://www.reddit.com/r/LocalLLaMA/comments/1e9hju5/ollama_site_pro_tips_i_wish_my_idiot_self_had/
Llama.cpp with python
Install and run llama.cpp server
# install llama-cpp-python server with Metal
CMAKE_ARGS="-DGGML_METAL=on" FORCE_CMAKE=1 pip install --force-reinstall --no-cache-dir 'llama-cpp-python[server]'
# start the llama.cpp server with a single model
python3 -m llama_cpp.server --hf_model_repo_id bartowski/Meta-Llama-3.1-8B-Instruct-GGUF --model '*Q6_K_L.gguf' --chat_format llama-3 --host 0.0.0.0
Use completion api.
import openai
= openai.OpenAI(
client = "http://mini:8000/v1/",
base_url ='dummy'
api_key
)
= [{"role": "user", "content": "Tell me a joke"}]
messages = client.chat.completions.create(messages=messages, model="dummy")
chat_completion
= chat_completion.choices[0].message.content
response_txt print(response_txt)
# call chat completion to test
curl -s http://mini:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "Tell me a joke"}
]
}' | jq .
Note that the api_key and model are set to be ‘dummy’. They can be any value, as they are not used.
From the server logs, we can see that eval time tokens per second is 8.81.
llama.cpp
Lets run quantized CodeQwen1.5
for code completion using llama.cpp
llama-server --host 0.0.0.0 --mlock --hf-repo bartowski/CodeQwen1.5-7B-GGUF --hf-file CodeQwen1.5-7B-Q5_K_M.gguf -c 4096