Note: Ollama is really easy to get started with.
llama.cpp
git clone --depth 1 https://github.com/ggerganov/llama.cpp
sudo apt install nvidia-cuda-toolkit
sudo apt install gcc-12 g++12
(GCC 13+ isn’t supported)cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_CXX_COMPILER=/usr/bin/g++-12 -DCMAKE_CUDA_HOST_COMPILER=/usr/bin/g++-12
(needed to explicitly set the compilers)cmake --build build --config Release
1llama-server
by default only processes a single request, use--parallel N
to enable processing concurrent requests.
ollama
curl -fsSL https://ollama.com/install.sh | sh
2- As of 2024-12-30, this also runs ollma (i.e., there’s no need to separately run
ollama serve
).
- As of 2024-12-30, this also runs ollma (i.e., there’s no need to separately run
- Run
ollama run llama3
, and this opens a chat session. - This does not seem to remember stuff though, like my name. Hmm :think:
- Running Ollama behind a a reverse proxy (like Nginx) requires setting
proxy_buffering off;
3 - Caveat using JavaScript to stream using the
fetch
API: Safari does not support usingresponse.body
as a readable bytestream4, and instead this needs a long-form code to deal with this5. - By default, Ollama only allows requests from
localhost
. To allow seving behind a rever sproxy (like Nginx), setEnvironment="OLLAMA_ORIGINS=*"
in thesystemd
configuration file (/etc/systemd/system/ollama.serice
by default on Ubuntu 24.04 LTS).