Three things:
-
And finally, when the table, from the gptel link, states “API key“ or token, it basically means I need an account on that service? (I know basic question).
Yes. I have a Google Workspace account and the process was fairly trivial. I think I just asked ChatGPT or Claude for instructions on how to get Gemini API access and it worked flawlessly.
Getting ChatGPT API access was a little bit more involved because I had to create an organization (separate from my user account), and I believe I had to do an identity verification with my phone’s camera (it was either OpenAI or Anthropic which required that, can’t remember right now.)
Claude API access was also very straight forward.
One thing to keep in mind is that (for now) I think it’s free to access Gemini via the API, but you have to buy credits with OpenAI and Anthorpic.
-
I have the same kind of questions regarding Ellama, when the author said to pull an Ollama model, what does a model contain and why do I need to download one?
I don’t understand what you’re asking. Could you clarify?
-
What does means “run locally“ or “local lamafile server“? Have I to understand that a database of a potentially enormous size will be deployed on my PC?
I’m not certain about this, so I just asked ChatGPT. Here’s my prompt and the answer it generated. Hope this helps.
Prompt:
Referencing the “Setup” table for this repository ( GitHub - karthink/gptel: A simple LLM client for Emacs ), what does it mean to “run locally“ or what is a “Local Llamafile server“? What are the relevant system requirements or any other important considerations I should make about this?
Answer: (Note: keep in mind some of this information is specific to my environment because of ChatGPT’s memory.)
Here’s what that “Setup” table is saying in practical terms:
What “run locally” means in gptel
In gptel’s context, “run locally” means you’re pointing Emacs at an LLM server process that’s running on your own machine (e.g., Ollama, GPT4All, Llama.cpp/llamafile, Open WebUI). gptel then talks to that local HTTP endpoint (usually http://localhost:…) instead of a cloud API. The manual’s setup matrix lists these options explicitly—e.g., “Ollama running locally,” “Llama.cpp running locally,” “Local Llamafile server,” etc.
What a “Local Llamafile server” is
llamafile is Mozilla’s single-file distribution of an LLM (weights + runner) built on llama.cpp and Cosmopolitan Libc. When you start a llamafile with --server, it spins up:
-
a local web UI on http://127.0.0.1:8080/, and
-
an OpenAI-compatible API (/v1/chat/completions, etc.), entirely on your machine.
Typical bring-up looks like:
chmod +x Llama-3.2-3B-Instruct.Q6_K.llamafile
./Llama-3.2-3B-Instruct.Q6_K.llamafile --server # add --v2 for the newer server
# browse http://localhost:8080 or hit the API at http://localhost:8080/v1/...
The repo documents the OpenAI-compatible endpoint and shows curl examples targeting http://localhost:8080/v1/chat/completions.
Wiring gptel to a local llamafile
Point gptel at the local OpenAI-compatible endpoint:
(setq gptel-backend
(gptel-make-openai "Llamafile"
:protocol "http"
:host "localhost:8080"
:endpoint "/v1/chat/completions"
:models '(LLaMA_CPP) ;; model name as exposed by the server
:stream t
:key (lambda () "no-key"))) ;; llamafile doesn’t require a real key
This matches the manual’s “Llamafile — Local Llamafile server” backend entry.
System requirements & important considerations
-
OS/CPU/GPU support. llamafile bundles llama.cpp with Cosmopolitan so a single executable runs on macOS, Linux, BSD, and Windows (with caveats). It includes both AMD64 and ARM64 builds internally.
-
GPU acceleration on Apple Silicon. Since Cosmopolitan uses static linking, GPU support is compiled at runtime. On macOS you need Xcode Command Line Tools installed so it can build the Metal kernel (ggml-metal.m) the first time.
-
Model sizes & memory. You choose a specific llamafile (e.g., 1B/3B/8B/12B) and quantization level (Q4, Q6, etc.). Download size and runtime RAM/VRAM needs scale with model/quant. The project lists example files (e.g., Llama-3.1-8B Q4 ≈ 5.23 GB on disk). Plan disk + RAM accordingly.
-
Ports & binding. Default server listens on localhost:8080. Keep it on 127.0.0.1 unless you intentionally expose it. The server provides an OpenAI-compatible surface (chat completions; embeddings on the v2 server).
-
Sandboxing behavior. llamafile adds pledge()/SECCOMP sandboxing (on some platforms). It limits what the embedded HTTP server can access after start; details are documented in the repo.
-
Windows file-size limit. Windows has a 4 GB executable limit; larger models run by using external GGUF weights instead of fully self-contained executables. (Not an issue on your macOS host, but relevant cross-platform.)
-
“No cloud key” workflow. Because the API is OpenAI-compatible, most OpenAI-client code (including gptel’s OpenAI backend) works by changing base URL and using a dummy key.
TL;DR
-
“Run locally” = use a self-hosted backend on localhost (no cloud).
-
“Local Llamafile server” = start a single binary that exposes a web UI and an OpenAI-compatible API at http://127.0.0.1:8080, then point gptel at it.