The PivotNine Blog

The Open Standard Underpinning Modern AI You (Probably) Haven’t Heard Of

call-me-fred-fJSRg-r7LuI-unsplash

The recent explosion of interest in large-scale data processing is highlighting the limitations of tools built for an era of large, but not massive, amounts of data. The data comes from millions—billions—of sources, flowing through a towering structure of complex systems built by different vendors, all of them desperate to claim their slice of the massive AI pie. Without this deep tech standard it could all come crashing down.

An open source project, Apache Arrow, has emerged as the de facto mechanism used by most of the industry for data interoperability. It’s shepherded by a company called Voltron Data in a now familiar open core approach where the core product is free and open source, subsidised by enterprise support and complementary products sales.

Arrow fulfils a need that was previously met by shipping CSV or JSON structured data around, or making direct data connections between systems via JDBC or ODBC. None of these methods could keep up with the volume and velocity required for data science in recent years, but proprietary attempts at a solution never really captured the industry’s widespread support.

Voltron Data CEO Josh Patterson
Josh Patterson, CEO at Voltron Data

“There’s a lot of modular, composable parts of a data system that should be standardised. People should just build it once and adopt it versus replicating the exact same thing over and over in slightly different ways,” said Josh Patterson, CEO at Voltron Data. “It allows us to build these really, really large-scale systems that are much more elegant and performant than they otherwise would be.”

While this is true enough, it’s notoriously difficult to get competitors to agree on a common standard. The success of Arrow is all the more remarkable given the competitive pressures at play. Traditionally, competitors would try to outdo each other rather than collaborate.

“Early days, even in open source, everyone was like ‘I’m going to come up with a better CSV parser than you,’” said Patterson. “‘We’re not going to share a CSV parser, I’m going to rewrite my own!’ You know? ‘My version of doing a join is gonna be better than your version of a join!’”

Patterson observed that collaborating on a common set of foundational primitives and an interoperable, open standard has proved extremely beneficial. “This decomposable stack is relatively new, and I think it’s an exciting thing.”

The open core model is being questioned by several companies that were once firm believers. Priced and packaged well, an open core model can sustain a sizeable business—Red Hat being the most famous example—though it’s not without its challenges. Companies such as MongoDB, Elastic, and recently HashiCorp has chosen to move away from a fully open core approach, citing commercial imperatives. Though they remain in the minority, there is a live discussion of the merits of alternate approaches running through the tech industry.

“I personally believe we’re doing to move a little bit away from open source,” says Patterson, “We’re seeing this evolution of open source, but I do think open standards are going to become more common.”

In the AI sector, we’re seeing some signs of this with so-called open models, though they more closely resemble the commercial licensing approach of MongoDB, Elastic, or HashiCorp than the open source traditions of Apache, MIT or GPL licenses. An open standard has yet to emerge and instead we’re seeing the usual fight for prime position as the proprietary controller of a de facto standard.

It’s unclear where this jostling for position will take us. OpenStack remains a cautionary tale for many, while Kubernetes and Arrow show us a different path. Or we could end up with a market dominated by a single company that controls the de facto standard, like Microsoft once did with Windows, and AWS does now for S3.

The tech industry loves to democratize things, so perhaps this is another opportunity to do so?