Inference unleashed: Power and ease of use combined in TensorWave’s MI300X servers

image6

Remember the last time you tried to set up a GPU server for AI inference? If you're like most ML engineers, it probably involved hours of wrestling with drivers, CUDA configurations, and compatibility issues — all before you could even think about running your models. That's been the status quo for years: powerful hardware wrapped in layers of complexity that steal your time and attention away from what really matters.

And absolutely nobody likes getting trapped in tasks that are essentially just undifferentiated heavy lifting!

This is where TensorWave's MI300X servers come in, and they're rewriting the rules in two important ways. First, with a staggering 192GB of memory per card (that's 2.4x what you'll find in NVIDIA's H100), they're making it possible to run full-weight 70B parameter models on a single accelerator — something that fundamentally changes the game for inference deployments and allows for options and creativity. 

Second, and perhaps just as importantly, they've eliminated the traditional setup headaches that have long been the tax we pay for high-performance computing.

Getting started: time to value.

One of the more difficult things when getting started with a giant GPU server is drivers. If you try to start up a GPU-powered cloud server from AWS or Azure, for example, you’ll get a decent GPU (assuming you’re lucky enough; GPU servers in the cloud get gobbled up fast), but the time to value sucks. You’ll be doing driver and CUDA installations (for example, for their most common GPU-enabled instances).

That can take a while. And especially if you are an LLM tinkerer or developer, and not an expert in crazy Linux compute driver issues, and you’d really rather not be forced to be one.

I totally get it. I have to do that task every now and then, and while I've had a couple decades of Linux experience, I still don't love it. Not because it’s an insurmountable problem, but it can be tricky sometimes and end up wasting too much time on what is ultimately not the fun and valuable thing.

The TensorWave MI300X server was the complete opposite experience. After my first SSH into the server (running on Ubuntu 22.04 LTS), I was happy to see that TensorWave had already done all the heavy (driver) lifting. 

image1

Above shows a re-enactment of my first SSH. I used rocm-smi (AMD’s version of the nvidia-smi tool) and it just worked — all cards detected, drivers already loaded.

Without having to wrestle Linux drivers, I can immediately get to work on stuff I actually want to do in this box: run LLMs! This is a fantastic user experience and incredible time to value.

Time to inference: TGI 3.0.

While all GPU drivers and ROCm (AMD’s CUDA) configurations are taken care of by TensorWave, that still leaves the choice of inference to me. For a quick start, I tried Text Generation Inference (TGI) by Hugging Face:

model=teknium/OpenHermes-2.5-Mistral-7B
volume=$PWD/data
docker run --rm -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
 --device=/dev/kfd --device=/dev/dri --group-add video \
 --ipc=host --shm-size 256g \
 -p 8086:80 \
 -v $volume:/data \
 ghcr.io/huggingface/text-generation-inference:3.0.1-rocm \
 --model-id $model

And it just worked! That’s just a slightly tweaked version of the CLI command from the Hugging Face docs, and resulted in the TGI server being online and ready:

image4

Alright, the server is ready, but does it actually work? We can try a quick curl CLI command to send an inference request to our server:

curl localhost:8086/generate \
-X POST \
-d '{"inputs":"Who were the first five presidents of the USA?","parameters":{"max_new_tokens":1000}}' \
-H 'Content-Type: application/json'

Which resulted in:

image5
It works!

That's a time to value and time to inference right there. With basically zero effort, we created an inference endpoint using community darling TGI. We just SSH’d into the server, executed a Docker command, and boom! Achievement unlocked: 100% working inference endpoint.

Now, what about even more performance using vLLM?

vLLM for maximum performance.

In my tinkering, there’s a few ways you can run vLLM in TensorWave’s MI300X servers:

  • There’s a prebuilt AMD image you can use, just like the TGI experience in the previous section
  • You can also just build your own Docker image from AMD’s vLLM fork

I tried both ways, and while both work fine, I found that the best performance comes from building your Docker image from AMD’s vLLM fork (mainly due to having newer ROCm and vLLM components). The instructions are the same as in the vLLM official docs, but instead of cloning the mainstream vLLM repo, you clone AMD’s fork:

#Using AMD's vLLM fork
git clone https://github.com/ROCm/vllm.git
cd vllm
git checkout tags/v0.6.6+rocm #or choose whatever is latest release by then
DOCKER_BUILDKIT=1 docker build -f Dockerfile.rocm -t amd-vllm-rocm .

That’s it, easy-peasy. Note that the build process can take a while because it’s a huge image.
Once that’s built, we can use the resulting amd-vllm-rocm image to run a vLLM container that serves our chosen model:

export volume=$PWD/data
IMAGE_NAME=amd-vllm-rocm #Image from AMD's vLLM fork
export HF_TOKEN=my-secret-huggingface-token
export MODEL=Qwen/Qwen2.5-72B-Instruct
docker run -it \
--network=host \
--env HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
--group-add=video \
--ipc=host \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-v $volume:/root/.cache/huggingface \
$IMAGE_NAME \
vllm serve $MODEL --tensor-parallel-size=8

MI300X and its memory capacity: What can such a large memory pool give you?

One of the flagship features of AMD’s MI300X accelerator is memory — it has large amounts of it.

The competing H100 from Nvidia has a decent 80GB of accelerator memory — that’s way above consumer cards. The 4090 only had 24GB, for example, and the recently released 5090 was a mere 32GB.

The MI300x, on the other hand, has 192GB! That’s 2.4x more memory. 
And if you had 8xH100 vs 8xMI300X, you’d be comparing 640GB of total accelerator memory on the 8xH100 vs a staggering 1,536 GB (1.5 terabytes!) on the MI300X.

Aside from being able to fit larger models with more context into your inference server, you can actually do something unique in the MI300X box that you simply can’t in the H100: run full-weight, 70B-class models in a single card.

Yes, with the absurdly huge memory pool in a single MI300X card (more memory than 2 H100 cards), we can comfortably run a nice powerful model like Llama 3.1 70B or, my favorite in this weight class, Qwen2.5 72B!

Eight single-card servers versus one eight-way server.

Our TensorWave box came loaded with 8x MI300X, so with 1.5TB of total memory, of course, we could easily fit the biggest models (we even ran current darling DeepSeek V3 on it, which has a staggering amount of parameters — over 600 billion — using SGLang).

But one thing you can also do is trade a little of your max throughput in order to drastically improve latency, and that’s by serving your model in a single-card config.

We ran the same workloads in our TensorWave box using two configurations:

  • A single vLLM Docker container, all 8 MI300X cards, with tensor parallelism = 8.
  • 8 independent vLLM Docker containers (different ports), each with a single MI300X card

Having some tensor parallelism helps increase max throughput, but at the expense of latency due to the overhead of inter-card communication:

image3

The above figures were taken using Qwen2.5-72B-Instruct. You can see that serving this big 72B parameter model on a single card can reduce latency (time to first token) by almost 50% — and in most cases, it only has ~75% of the latency of the eight-way configuration (a significant 25% reduction in latency).

Now, latency is not necessarily the be-all and end-all of LLM inference. It depends a lot on your use case for it. Sometimes latency will matter (user experience, time to first token), but sometimes you want raw, maximum throughput. That’s fine.

What matters here is that TensorWave’s MI300X servers give you that choice. You can configure your 8xMI300X machine to have just a single LLM server that uses all the cards using tensor parallelism, and you maximize raw throughput. Or, you can split it into single cards for latency-sensitive use cases.

You can even be creative with how you split them up. You can split them up into five LLM servers, for example — one Docker container running 4xMI300X for more throughput, and four smaller Docker containers, each with 1xMI300X for the best latency.

You have this choice and you get to be creative because of the sheer amount of memory capacity in each accelerator. This would just be flat out impossible on an H100 system, because a single H100 card has nowhere near enough memory to even just load a 70B-parameter-class model.

Introducing Kamiwaza.

You might be wondering how all this power could be channeled in a way that can effectively boost productivity in an enterprise setting.

You aren’t alone. Here at Kamiwaza, our mission is to help customers on their genAI journey, to unlock previously unimaginable efficiencies and methodologies before the rise of LLMs.

One of the very many ways we enable customers in the age of generative AI is with AI agents — LLMs equipped with tools and the know-how to use those tools to achieve specific tasks given to them by humans ad hoc (as in through a chat interface) or on a fixed schedule (as in triggered by a cron job).

Here’s a demo of one such AI agent in action, running a full-weight Qwen2.5 72B Instruct model:

VIDEO

In the video above, you can see a custom AI agent in action. (Notice how fast a full-weight 72B model was running! That’s thanks to MI300X — we recorded that particular demo running on TensorWave’s beastly server.

Receiving a single user instruction from us sends the agent on its merry way:

  • [Git] Cloning a private demo repo from Kamiwaza AI
  • [Filesystem] Copying a file from the calc folder into the cloned repo
  • [Python] Execute arbitrary Python code to get the current date and time
  • [Coding/editing] He will analyze calculator.py, find the bug, and fix it
  • [Git] He will now commit and push the changes
  • [Coding and Python] He will create a set of tests for the calculator and then run those tests to confirm that things work as expected

That is agentic AI in action — it didn’t just spit out a response to a human command (that’d just be a vanilla LLM chatbot). Instead, it received instructions and then autonomously and sequentially used the tools it had to achieve all of its goals.

You see, having a working AI agent is just the first step in a very difficult enterprise AI journey. That demo is just a simple web frontend and a few Python files that enable inference and tool calling.

What it lacks are key features that enterprises require:

  • Authentication and authorization and SAML integration for federated access
  • Connections to enterprise data. What good is an agent if it can’t reach out to your vast enterprise data? The Kamiwaza platform simplifies this through our built-in embedding and vector database solutions, enabling near-instant RAG functionality.
  • Secure and fine-grained control over AI agent data access. It’s not enough that agents can connect to enterprise data — their access permissions must also be able to be managed and controlled in a sane manner so that enterprise users can’t just suddenly access data they normally couldn’t.

These are just some of the key features that the Kamiwaza platform offers. We simplify the genAI journey so that our customers can get started on reaping the benefits ASAP, instead of getting stuck tweaking different portions of complex LLM infrastructure.

We’ve been having fun with Gen AI. And you can, too!

We at Kamiwaza AI are having so much fun AND doing real work with our TensorWave box. It’s been fun working with the hardware, and also with the awesome support from the folks at TensorWave.

We’ve been using their MI300X machine in the Gen AI demos we’ve been giving throughout the US. In one of the more recent ones, we even started showing off a sort of tachometer that outputs in real time the vLLM engine tokens per second figures while an agentic demo was underway:

image2

(Above are not actual figures from any of the live demos, but indicative figures from our internal tests doing maximum load on the MI300X.)

Suffice it to say, we were incredibly happy when a totally-not-planted audience member shouted out, “What hardware are you even on that you were getting all those tokens?”

It’s absolutely amazing when you see an AI agent being autonomous and outputting text, tool calls, and analysis at blazing speeds.

If you haven’t experienced this yourself, talk to our friends at TensorWave. Get early access to TensorWave managed inference service on AMD MI300X and unlock up to $100k. You’ll be glad you did!

And if you need help with unlocking genAI, especially for agentic uses and automation and just overall transforming your org into an AI-powered enterprise, reach out to Kamiwaza!