Experiment Running LLMs Locally

Note: Ollama is really easy to get started with.

git clone --depth 1 https://github.com/ggerganov/llama.cpp
sudo apt install nvidia-cuda-toolkit
sudo apt install gcc-12 g++12 (GCC 13+ isn’t supported)
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_CXX_COMPILER=/usr/bin/g++-12 -DCMAKE_CUDA_HOST_COMPILER=/usr/bin/g++-12 (needed to explicitly set the compilers)
cmake --build build --config Release¹
llama-server by default only processes a single request, use --parallel N to enable processing concurrent requests.

curl -fsSL https://ollama.com/install.sh | sh²
- As of 2024-12-30, this also runs ollma (i.e., there’s no need to separately run ollama serve).
Run ollama run llama3, and this opens a chat session.
This does not seem to remember stuff though, like my name. Hmm :think:
Running Ollama behind a a reverse proxy (like Nginx) requires setting proxy_buffering off;³
Caveat using JavaScript to stream using the fetch API: Safari does not support using response.body as a readable bytestream⁴, and instead this needs a long-form code to deal with this⁵.
By default, Ollama only allows requests from localhost. To allow seving behind a rever sproxy (like Nginx), set Environment="OLLAMA_ORIGINS=*" in the systemd configuration file (/etc/systemd/system/ollama.serice by default on Ubuntu 24.04 LTS).