PEAK:AIO finding widespread use of small language models in sub-20 GPU datacenters - Blocks and Files

Interview: PEAK:AIO says on-prem small language model (SML) AI training is practical for specific subject areas because it serves locally stored data to GPU servers.

Contrary to a common idea that Gen AI model training needs access to hundreds if not thousands of GPUs, PEAK:AIO argues that this level of GPU support may be necessary for generalized large language model (LLM) training but is definitely not needed when training models for specific subject areas - such as interpreting scans in the health market. And it has customers doing exactly this.

This became clear during a briefing with founder Mark Klarzynski, based in the UK, and US-based president and CEO Roger Cummings. A big picture view here is that PEAK:AIO developed and has proved its technology in the UK and is now taking it to the much larger US market and growing fast there. The software-based technology features re-written NFS stack and Linux RAID code to decrease the latency and increase the speed for NFS data transfers to a GPU server. In effect a PEAK:AIO user gets parallel file system access speed while using standard NFS and not a complex and HPC-style parallel file system with a possible need to re-factor NFS-using applications.

PEAK:AIO's software can get a small 1RU server with a PCIe 5 bus sending 80GB/sec of data to a single GPU client server for AI processing.

We started out the interview session by asking if PEAK:AIO's technical lead is sustainable.

Blocks & Files: PEAK:AIO is basically selling an NFS box built on commodity servers and so competes with every other NFS-based file system. Then you've got this highly specific, AI-focused data feeding, which gives you entry into a niche. However, there's an issue in that many of the large file companies are supporting GPUDirect and saying they pump data fast to GPU servers. So up you come along with PEAK:AIO and say, yes, but we pump it twice as fast, three times as fast, five times as fast. Is this a sustainable advantage that you've got?

Second, Hammerspace has just come up with its Tier 0 concept, which is feeding data from external storage into a GPU server's locally attached NVMe SSDs, and then feeding data from them to the GPUs at much higher speed than GPUDirect. How does PEAK:AIO's technology relate to this Tier 0 idea of Hammerspace?

Mark Klarzinski: Yes, we just made everything fast by getting close to the actual hardware, close to the drives, close to the Mellanox stack, all that sort of stuff. And that makes us super fast, which also makes us super fast for people like Hammerspace. I've known David Flynn for many, many years, and we made a specific version for their Flex Files [see bootnote].

We actually rewrote some parts of the NFS stack so that we work just that bit better. Because, of course, Hammerspace ... still needs NFS to scale, and they don't really do that. They rely on other people. ... So we have a version, and they've just released [software which] actually led to a mini white paper with us, a solution with us. We work a lot with Hammerspace.

What I would say about the tier 0 idea is that, actually we did this three years ago, and it got messy. It sounds really great, okay, but it's not replacing GPUDirect. It's not bad, and maybe they're doing it better. They're actually saying that, when you buy a DGX or HGX, you've got some storage in it, why not use it? And they do, but then include it in a Hammerspace layer. That's actually part of the kernel stack anyway. The difficulty with it is, is the way Nvidia works ... it gets messy because that's on a RAID server.

The entire Tier 0 storage has one purpose for the GPUs, which is scratch-based. ... Maybe they've done this better, but Ceph has been doing this for a while. We did it in the early days, and what we decided to do was to let the Nvidia use their local storage for what they designed it for. And, I'm honest, we don't see DGXs going out with that much local storage. Normally, we've got a handful of terabytes, so pulling it into the stack is not really adding much.

Maybe, if you've got 10,000 GPU servers, [it] makes a difference, but with the people that we deal with, it became more of a headache, because you're constantly dealing with Nvidia doing updates and playing around with their RAID server.

If you make that part of your virtualized stack, it gives you problems because every time Nvidia makes an update, you then have to work with your customers to decide whether you're going to implement it or not.

Blocks & Files: Do you work with GPUDirect?

Mark Klarzynski: We work amazingly on that. What separates us from every other NFS server is not just that we're rocket fast and very focused on that GPU world. We realized that energy density was a bigger problem, and getting what would traditionally be in, say, six or ten RU down to 2RU was what we wanted to do for pricing, which also meant that we reduced the energy and increased the density.

We've been working with Solidigm for the last year to get six petabytes in 2RU and doing some quite funky stuff on the byte path. What we've really been doing with that is not just using the endurance and speed of SLC to buffer QLC [in the drive]. We're actually using it to stage data so that we can power down the QLC and reduce the power needed.

When you've got a few hundred or thousands of these drives, and most of the time they're in idle state. That actually has a dramatic difference on the amount of power. And just about every datacenter we walk into these days is really struggling with power. Because the one thing GPUs do is they get bigger and they just demand more power, and there's no stopping the GPU, so there's only the storage that can come down on the power consumption.

I'd love to say this was an amazing strategy of mine, but, but in honesty, it just sort of got organically adopted. We found ourselves in large labs in the USA, like Los Alamos and National Labs, etc. Do you know that we've worked with HPE, working with Dell, etc., and that's why Roger came on because that was certainly beyond the scale of me. I'm sort of the wacky guy at the back end who just comes up with ideas.

Blocks & Files: I've got the impression from some of the material I've read about PEAK:AIO, that a typical customer has single digit GPU servers, not 1000s of GPUs. And that suggests to me that you're used mainly for AI inferencing rather than AI training.

Mark Klarzynski: That would seem the case. But let me give you some examples. Possibly the largest medical installation in the world, King's College London, is running on six DGXs, actually similar to Serological Society of London's only two at this moment. Most of those really large DGX clusters are in hyperscalers. ... I can think of two clients that I know of that actually, genuinely have a SuperPOD themselves, and that's not our play.

We distribute through PNY, who are one of Nvidia's main distributors. Something like 87 percent of their customers are below 10 DGXs. A handful are bigger, and most of those are GPU-as-a-service type people.

What we what we really see, and this has been our focus, is that we see AI going more horizontal than up. So for instance, when we think about King's College London, and I stay with healthcare, just as an example, they create an amazing model on brain scans. For instance, that spurs another five institutions to do their [own] specific training on that model.

Blocks & Files: This is lower scale training than 100,000 GPUs building a new OpenAI version. I didn't realize this was going on.

Mark Klarzynski: It's gigantic. So, we're in the UK. Every university that I know has probably ten projects, if not more, that have all got less than ten DGXs and they're training models.

Roger Cummings: This week at the Supercomputing Show was very eye opening for us. Now that I'm on board and talk to some of the folks in the US, they're desperately searching for a solution that can help them with the 20 or under kind of scenario of GPUs. We talked to one research institution that was supporting 1,000 or 1,100 labs, and each one of them had maybe one or two GPUs, but there was not a solution he could offer them.

Blocks & Files: Is this like training models for specific niches? This is not massive, OpenAI, Chat GPT-type training, generalized training. This is specific?

Roger Cummings: Yes. That is the work that we're doing with some of the - staying in the healthcare - work that we're doing on MONAI, the organizations working with medical imaging. You can imagine having data from an MRI machine and running algorithms against that. There's a lot of applications for that. It is a horizontal use case across many, many healthcare institutions, life science institutions. That's a great example of this kind of horizontal growth that Mark is referencing.

***

Part 2 of this interview will follow in a second article.

Bootnote

Hammerspace uses NFS v4.2 and its client has support for parallel NFS (pNFS) using a Flex Files feature. This has parallel read and write access to the data servers (cluster nodes).

Vivid Headlines

PEAK:AIO finding widespread use of small language models in sub-20 GPU datacenters - Blocks and Files

POPULAR CATEGORY

entertainment

discovery

multipurpose

athletics