• infinitevalence
    link
    fedilink
    English
    arrow-up
    5
    ·
    edit-2
    4 months ago

    Just ran it on my Framework 13 w/ Ryzen AI 9 HX 370. Its not able to use the NPU at all which is as expected in Linux. :(

    Still getting 20 tokens/sec is not bad off the GPU/CPU and 2x what I was seeing on Gemma 3 12b.

    I have a Radeon RX 9060 XT 16GB which I may try to plug in and bench too.

    This is a nice uplift in performance but its still not utilizing dedicated hardware.

    Ryzen AI 9 HX 370 w/ 870M + 64GB gpt-oss-20b:

     "stats": {
        "stopReason": "userStopped",
        "tokensPerSecond": 19.52169768309363,
        "numGpuLayers": -1,
        "timeToFirstTokenSec": 7.363,
        "totalTimeSec": 24.127,
        "promptTokensCount": 1389,
        "predictedTokensCount": 471,
        "totalTokensCount": 1860
    

    Ryzen 9 3900x w/ RX 9060 XT 16GB + 128GB gpt-oss-20b

    "stats": {
        "stopReason": "eosFound",
        "tokensPerSecond": 21.621644364467713,
        "numGpuLayers": -1,
        "timeToFirstTokenSec": 0.365,
        "totalTimeSec": 15.124,
        "promptTokensCount": 118,
        "predictedTokensCount": 327,
        "totalTokensCount": 445
    

    Ryzen AI 9 HX 370 w/ 870M + 64GB gemma-3-12b

     "stats": {
        "stopReason": "eosFound",
        "tokensPerSecond": 7.866126447290582,
        "numGpuLayers": -1,
        "timeToFirstTokenSec": 12.937,
        "totalTimeSec": 12.458,
        "promptTokensCount": 1366,
        "predictedTokensCount": 98,
        "totalTokensCount": 1464
    

    Ryzen 9 3900x w/ RX 9060 XT 16GB + 128GB gemma-3-12b

    "stats": {
        "stopReason": "eosFound",
        "tokensPerSecond": 27.62905167372976,
        "numGpuLayers": -1,
        "timeToFirstTokenSec": 0.31,
        "totalTimeSec": 19.219,
        "promptTokensCount": 2975,
        "predictedTokensCount": 531,
        "totalTokensCount": 3506
    
    • hendrik@palaver.p3x.de
      link
      fedilink
      English
      arrow-up
      1
      ·
      edit-2
      4 months ago

      Thanks for the numbers. Btw, I think a NPU can’t run large language models in the first place. They’re meant for things like blur the background in video conferences, or help with speech recognition or such very specific smaller tasks. They only have some tens or hundreds of megabytes of memory, so a LLM/chatbot won’t fit. The main thing that makes LLM inference faster is memory (RAM) bandwith and speed.

      • infinitevalence
        link
        fedilink
        English
        arrow-up
        2
        ·
        4 months ago

        I added a few more numbers. I may pull my old MI25 out of mothballs and bench that to.