Guix service for llama-server: 30 times slower than calling the same command from CLI

juanpamf · July 16, 2025, 3:25pm

I defined this service in my guix home (basically a wrapper for llama-server --fim-qwen-3b-default):

    (simple-service 'llama-server
                      home-shepherd-service-type
                      (list (shepherd-service
                             (provision '(llama-server))
                             (start #~(make-forkexec-constructor
                                       (list "llama-server"
                                             "--fim-qwen-3b-default" "-v")
                                       #:log-file "/tmp/llama-server.log"))
                             (stop #~(make-kill-destructor)))))

The service works, but I get 1.5 t/s (tokens per second), where as directly calling the command line gives ~30t/s.

Is my service definition wrong? Do I need to allocate more ressources to shepherd?

dgr · July 16, 2025, 4:25pm

Maybe the stdout, stderr is slowing it down.
Can you remove -v if it stands for verbose and the #:log-file for testing?

juanpamf · July 22, 2025, 7:53pm

Thanks for the reply. Still as slow without the -v and #:log-file.

juanpamf · August 6, 2025, 3:39am

Adding environment variables helps a lot, but I can’t get access to the GPU from shepherd (whereas from the CLI I can):

   (simple-service 'llama-server
                   home-shepherd-service-type
                   (list (shepherd-service
                          (provision '(llama-server))
                          (start #~(make-forkexec-constructor
                                    (list #$(file-append llama-cpp "/bin/llama-server")
                                          "--fim-qwen-3b-default"
                                          "--host" "127.0.0.1"
                                          "--port" "8012"
                                          "--verbose")
                                    #:environment-variables
                                    (list "HOME=/home/juanpablo"
                                          "XDG_CACHE_HOME=/home/juanpablo/.cache"
                                          "XDG_CONFIG_HOME=/home/juanpablo/.config"
                                          "XDG_DATA_HOME=/home/juanpablo/.local/share"
                                          "XDG_RUNTIME_DIR=/run/user/1000")
                                    #:log-file "/home/juanpablo/.local/state/log/llama-server.log"))
                          (stop #~(make-kill-destructor)))))

I’m now at aprox 10 tokens/second, but calling with GPU gives 50 tokens/second.

dgr · August 6, 2025, 7:08am

The classic: not the same env.

Maybe ask in Guix IRC?

juanpamf · August 8, 2025, 12:12am

Since it has nvidia gpu component for acceleration can I ask it n the guix IRC, or should it be the non guix IRC?

dgr · August 8, 2025, 7:23am

If you keep it general about how to figure out the difference in the environment, then the Guix IRC could work but if you keep a focus on the specificity to the nvidia GPU it would be the non Guix channel.

Topic		Replies	Views
Custom Guix home service not working as intended Guix	10	494	March 25, 2024
How can i improve performance? Guix	7	341	August 8, 2025
Guix shell not showing mentioned program inside the shell environment Guix	14	675	February 13, 2024
How to get rid of original tty2? Guix	6	249	July 28, 2024
OCI provisioning service (docker compose like) for Guix System Guix	1	83	February 9, 2025

Guix service for llama-server: 30 times slower than calling the same command from CLI

Related topics