The PivotNine Blog

Pure Flash Powers Meta AI Research SuperCluster

glowing-circuit-traces-scaled.jpg
25 January 2022
Justin Warren

Pure Storage is pleased to note that Meta/Facebook's new AI Research SuperCluster is built, in part, on a bunch of Pure's FlashBlade and FlashArray//C arrays.

Meta's blog on the RSC notes that

Pure Storage's announcement notes that the FlashArray is FlashArray//C, the QLC flash capacity array product line.

Assembling Branded Components

Beyond the obvious advertising reasons for crowing about a big-name customer buying a lot of your gear, I find this interesting more for what Meta/Facebook hasn't done here.

Meta/Facebook didn't build its own storage arrays.

You would think that a company the size of Meta/Facebook (with a market capitalisation of around $870 billion (USD) at time of writing) would be able to command a decent bulk-purchase discount from the flash foundries if it wanted to. It also employs a large number of software developers and engineers and other boffins that could, in theory at least, be deployed to do storage things. And it does, some of the time.

But for this AI research platform, Meta/Facebook instead bought a bunch of infrastructure from a range of vendors, including Pure Storage, Nvidia, and Penguin Computing and worked with them to design the system. It manifestly didn't do everything itself in-house.

This indicates to me that there is more value to be had from assembling a bunch of branded infrastructure, and using the standard interfaces they've all developed to work with other parts of the infrastructure ecosystem, rather than trying to roll-your-own from component parts.

There's a money-time tradeoff happening here; Meta/Facebook no doubt managed a decent discount on this much gear, but more importantly the premium for branded infrastructure is nothing compared to the time savings of building something out of easy-to-assemble parts that are then easy to operate.

Large dataset research depends on being able to set up a modelling run, execute it, collate results, and then do it all again. Often in parallel to other runs as well, using multiple stages. It's a great big data factory with lots of people trying to build different kinds of models out of the same big chunks of raw data.

Modern AI/ML mechanisms depend on these massive datasets to function. It's one big advantage certain major companies like Meta/Facebook (and Google, Microsoft, Apple, Tencent, Alibaba, Baidu, TikTok/ByteDance and a few others) have over everyone else: they have collected and can afford to store and use all this data.

You Are Not Facebook

For this kind of workload, the marginal savings of roll-your-own-infrastructure don't make sense for Meta/Facebook, and that should be instructive to other enterprise organisations that are still thinking of building their own stuff.

Don't.

Talk to your vendors and assemble systems based on what they sell. You are very unlikely to be doing anything sufficiently special for 90%+ of your workloads to warrant using something that isn't available commercially.

This is harsh news for startups looking to get enterprises to buy their shiny new kit that does things a different way, but that's not the right way to think about things. There is still room for you, but it's where it always has been: on the margins in the few, cutting edge or special workloads that are hard enough to do with mainstream approaches that it is worth taking the risk on something new.

You should be partitioning your workloads into roughly three categories: New, high value and inventive; transitional mid-life workloads; and stable, long-term proven workloads. The workloads will move between these stages if they have value to your organisation.

Save the weird experiments for the workloads that are themselves experimental, but just buy regular, proven tech and operate it using known-good methods. You'll deliver a lot more value to your organisation by mostly doing what seems relatively boring and straightforward than you will by trying to run email on a custom memristor-based distributed cluster.

AI/ML research is just another workload now and it can use standard storage you buy from a vendor. I'm surprised, too, but here we are.

The genuinely novel is somewhere else, and I'm keen to hear what you think it is.