Guix service for llama-server: 30 times slower than calling the same command from CLI

I defined this service in my guix home (basically a wrapper for llama-server --fim-qwen-3b-default):

    (simple-service 'llama-server
                      home-shepherd-service-type
                      (list (shepherd-service
                             (provision '(llama-server))
                             (start #~(make-forkexec-constructor
                                       (list "llama-server"
                                             "--fim-qwen-3b-default" "-v")
                                       #:log-file "/tmp/llama-server.log"))
                             (stop #~(make-kill-destructor)))))

The service works, but I get 1.5 t/s (tokens per second), where as directly calling the command line gives ~30t/s.

Is my service definition wrong? Do I need to allocate more ressources to shepherd?

1 Like

Maybe the stdout, stderr is slowing it down.
Can you remove -v if it stands for verbose and the #:log-file for testing?

Thanks for the reply. Still as slow without the -v and #:log-file.

Adding environment variables helps a lot, but I can’t get access to the GPU from shepherd (whereas from the CLI I can):

   (simple-service 'llama-server
                   home-shepherd-service-type
                   (list (shepherd-service
                          (provision '(llama-server))
                          (start #~(make-forkexec-constructor
                                    (list #$(file-append llama-cpp "/bin/llama-server")
                                          "--fim-qwen-3b-default"
                                          "--host" "127.0.0.1"
                                          "--port" "8012"
                                          "--verbose")
                                    #:environment-variables
                                    (list "HOME=/home/juanpablo"
                                          "XDG_CACHE_HOME=/home/juanpablo/.cache"
                                          "XDG_CONFIG_HOME=/home/juanpablo/.config"
                                          "XDG_DATA_HOME=/home/juanpablo/.local/share"
                                          "XDG_RUNTIME_DIR=/run/user/1000")
                                    #:log-file "/home/juanpablo/.local/state/log/llama-server.log"))
                          (stop #~(make-kill-destructor)))))

I’m now at aprox 10 tokens/second, but calling with GPU gives 50 tokens/second.

The classic: not the same env.

Maybe ask in Guix IRC?

Since it has nvidia gpu component for acceleration can I ask it n the guix IRC, or should it be the non guix IRC?

If you keep it general about how to figure out the difference in the environment, then the Guix IRC could work but if you keep a focus on the specificity to the nvidia GPU it would be the non Guix channel.

1 Like