Neural Notes
Posts
how to run a gpu cluster and what to consider

how to run a gpu cluster and what to consider

Nicolai Klemke
April 23, 2025

building a saas business is great, isn't it. you charge money for something you've built, you pay for a server and a database somewhere, and you're good with 95% profit margins and have a decent life in a hammock on a beach in sri lanka if you cross $3k MRR.

in the generative AI space, this saas paradigm has shifted to something else. much of the product involves interacing with gpu's and gpu's can be really expensive and the profit margins are much slimmer.

we've tried all kinds of gpu providers with neural frames and in this post i'll go through what we've learnt while doing so.

0. serverless vs instances

nowadays there are many service providers which offer "serverless" inference of AI scripts, such as modal or runpod.

i don't have a lot of experience with those. they typically have something like a "cold-start" time, i.e. their servers needs to spin up your docker container in order to run it, which can take anywhere from 2 to 200 seconds, depending on the size of your container.

for neural frames, our containers are kind of big, because we offer many AI models and so the cold start time can be too large to be a nice solution without heavy engineering.

so we went with running our own instances right away. this means we kind of have permanent computers running on the cloud, and if more people are rendering, we have an autoscaling system to increase the numbers of computers dynamically. ideally, we always have some buffer machines running, so that people don't need to wait for gpu's ever. but of course the buffer gpu's cost money, so you kind of want to decrease this number as much as possible.

optimizing all of this took a lot of engineering, especially as i basically didn't know anything about any of this.

nowadays i really like our setup and am proud of it. i think we optimized it quite a bit, wrote our own autoscaler that checks every 20 seconds how many people are rendering and predicts in a semi-smart way how many gpu's we should have and then acts accordingly.

1. use cloud credits

the biggest cloud services in the world are aws, google cloud, and azure. those are very corporate. aws alone has hundreds of services with strange names, and it is sometimes really difficult to understand how they interact with each other. on the other hand, it is very well documented, chatgpt knows everything around it, and both their python API and their stability is unmatched.

the big cloud providers are also very generous with credits, if you find the right people to talk to. i probably shouldn't brag about this, but i've spent a total amount of $280k in cloud credits on aws and google cloud. both are great platforms and work really well. with these credit programs, they often give you a direct account executive which, have very good technical knowledge too, or at least help you find someome who can answer your question.

when our google cloud credits ran out, aws offered a very generous package of credits for us to migrate to them. so we did. this migration was probably one of the most painful things i ever did in my life. all our systems & scripts were designed for google cloud and i had no idea how anything worked at aws and i was alone in thailand with 12hr time difference to the US. I didn't really think of stress testing the new cloud setup, and so i just stress tested in production. things kept breaking for a week or so. painful. but it kept us going for another year without needing to pay for compute.

many startup programs offer cloud credits, such as nvidia inception & ycombinator startup school. once you are with one platform and you have some users, you can leverage this and go to the next platform and say "hey, we’d love to run our service in your cloud instead. what can you do for us?".

2. gpu types

there are server gpu's and consumer gpu's. i think server gpu's are more reliably and power efficient but also MUCH more expensive than consumer gpu's. on aws, we ran on one type of those server gpu's as spot instances (spot instances can be randomly terminated at any time, but cost less money).

but then we found something crazy: we can spend far less per hour on a consumer gpu and run inference 5x faster! so we migrated to runpod, a really nice gpu provider, running their own consumer gpu server centers and never looked back since.

there are also other consumer gpu server providers our there and i am not getting paid to mention runpod here, but they're nice people.

3. api vs self-hosted

there are some nice services emerging that offer ai via api, in particular fal and replicate.

both are powerful, cost-effective, and have many ai models via standardized api.

you can just run ai inference there, and never worry about infrastructure in your life.

but of course, this comes with the drawback of not having your own inference script at your disposal.

for instance, if you wanted to hack a text2video model to do a certain type of thing, you could not use fal for it, you'd need to self-host.

but if you just want to use one of the existing ai models, relying on an external api like that is actually quite nice. there is of course a premium you pay, but also you don't need to worry about deployment and autoscalers ever again, plus you can kind of rely on them reducing inference times better than most of us ever could.

we are spending around 30% of our revenue on server costs. those entail also storage, and database but it is mostly raw gpu costs. we are profitable since a few months, which is really nice (just in time when our credits ran out).

our q2 2024 cloud costs. make it rain 💸

4. stability & quotas

i just want to mention something which might be obvious but is important to consider, especially as startups become bigger.

for a servide provider, it is super important to have a stable and reliable offering of gpu's. i.e., when you want to deploy more gpu's, you should be able to do so.

when neural frames first went viral on hackernews, i was only able to deploy 4 gpu's at once, thanks to some service quotas on aws. i then needed to file a request, and they increased the number of gpu's to 8, which was still by far not enough (i probably would have needed something like 50 back then).

this was still in a time where gpu shortages were very much a reality. thankfully, this strain has lifted a bit. still, service quotas are real, and often times, you need to earn the trust of a cloud provider by spending money (or credits) with them in order to spend more.

it also has happened that there simply were no gpu's available when we needed them, even though we did have the service quotas. this really sucks and has happened on all cloud providers we've been at, i.e. google cloud, aws and runpod.

it is a good idea to have one or two backup gpu types in your autoscaling script, to deploy an instance with that gpu if the original gpu deployment failed.

some people would even argue that it's a good idea, to have a cloud backup provider if one fails, but we've not gone that far yet.

5. diy cloud

one thing we've been thinking more and more about, is to build our own data center.

for comparison, a 4090 on runpod costs around $500/month. this means, you're spending the costs of a new 4090 in 5 months of renting one (and runpod is cheap!).

but of course, running your own data center is a huge lift of engineering, you need to make sure everything is working at all times (how to cool stuff), there's electricity costs (especially in germany!), it is unclear where to put these beasts, etc.

maybe we'll do it one day, personally i'd love it. and economically, i think it would be a great decision. but we need to pick our battles, and building a gpu cloud doesn't bring the product forward one bit, which is our number one priority at the moment.

6. ai video of the week

beautiful madness made with kling on neural frames.