Experiment Running LLMs Locally  

Note: Ollama is really easy to get started with.

llama.cpp

  • git clone --depth 1 https://github.com/ggerganov/llama.cpp
  • sudo apt install nvidia-cuda-toolkit
  • sudo apt install gcc-12 g++12 (GCC 13+ isn’t supported)
  • cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_CXX_COMPILER=/usr/bin/g++-12 -DCMAKE_CUDA_HOST_COMPILER=/usr/bin/g++-12 (needed to explicitly set the compilers)
  • cmake --build build --config Release1
  • llama-server by default only processes a single request, use --parallel N to enable processing concurrent requests.

ollama

  • curl -fsSL https://ollama.com/install.sh | sh2
    • As of 2024-12-30, this also runs ollma (i.e., there’s no need to separately run ollama serve).
  • Run ollama run llama3, and this opens a chat session.
  • This does not seem to remember stuff though, like my name. Hmm :think:
  • Running Ollama behind a a reverse proxy (like Nginx) requires setting proxy_buffering off;3
  • Caveat using JavaScript to stream using the fetch API: Safari does not support using response.body as a readable bytestream4, and instead this needs a long-form code to deal with this5.
  • By default, Ollama only allows requests from localhost. To allow seving behind a rever sproxy (like Nginx), set Environment="OLLAMA_ORIGINS=*" in the systemd configuration file (/etc/systemd/system/ollama.serice by default on Ubuntu 24.04 LTS).