Note: Ollama is really easy to get started with.
llama.cpp
git clone --depth 1 https://github.com/ggerganov/llama.cppsudo apt install nvidia-cuda-toolkitsudo apt install gcc-12 g++12(GCC 13+ isn’t supported)cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_CXX_COMPILER=/usr/bin/g++-12 -DCMAKE_CUDA_HOST_COMPILER=/usr/bin/g++-12(needed to explicitly set the compilers)cmake --build build --config Release1llama-serverby default only processes a single request, use--parallel Nto enable processing concurrent requests.
ollama
curl -fsSL https://ollama.com/install.sh | sh2- As of 2024-12-30, this also runs ollma (i.e., there’s no need to separately run
ollama serve).
- As of 2024-12-30, this also runs ollma (i.e., there’s no need to separately run
- Run
ollama run llama3, and this opens a chat session. - This does not seem to remember stuff though, like my name. Hmm :think:
- Running Ollama behind a a reverse proxy (like Nginx) requires setting
proxy_buffering off;3 - Caveat using JavaScript to stream using the
fetchAPI: Safari does not support usingresponse.bodyas a readable bytestream4, and instead this needs a long-form code to deal with this5. - By default, Ollama only allows requests from
localhost. To allow seving behind a rever sproxy (like Nginx), setEnvironment="OLLAMA_ORIGINS=*"in thesystemdconfiguration file (/etc/systemd/system/ollama.sericeby default on Ubuntu 24.04 LTS).