Databricks Makes Major Open Source Contributions
June 30, 2022
made several contributions to popular data and AI open source projects including
Delta Lake, MLflow, and Apache Spark.
Databricks will contribute all features and enhancements it has made to Delta
Lake to the Linux Foundation and open source all Delta Lake APIs as part of the
Delta Lake 2.0 release. In addition, the company announced MLflow 2.0, which
includes MLflow Pipelines, a new feature to accelerate and simplify ML model
deployments. Finally, the company introduced Spark Connect, to enable the use of
Spark on virtually any device, and Project Lightspeed, a next generation Spark
Structured Streaming engine for data streaming on the lakehouse.
“From the beginning, Databricks has been committed to open standards and the
open source community. We have created, contributed to, fostered the growth of,
and donated some of the most impactful innovations in modern open source
technology,” said Ali Ghodsi, Co-Founder and CEO of Databricks. “Open data
lakehouses are quickly becoming the standard for how the most innovative
companies handle their data and AI. Delta Lake, MLflow and Spark are all core to
this architectural transformation, and we’re proud to do our part in
accelerating their innovation and adoption.”
Delta Lake 2.0 Brings the Lakehouse to Everyone
Delta Lake 2.0 will bring unmatched query performance to all Delta Lake users
and enable everyone to build a highly performant data lakehouse on open
standards. With this contribution, Databricks customers and the open source
community will benefit from the full functionality and enhanced performance of
Delta Lake 2.0. The Delta Lake 2.0 Release Candidate is now available and is
expected to be fully released later this year. The breadth of the Delta Lake
ecosystem makes it flexible and powerful in a wide range of use cases. Fueling
this is a vibrant community of over 6,400 members, with contributing developers
from more than 70 contributing organizations.
“Databricks provides Akamai with a table storage format that is open and
battle-tested for demanding workloads such as ours. The lakehouse powers
interactive analytics at scale so that our customers can have near real-time
analysis of security events within our Edge platform,” said Aryeh Sivan, VP
Engineering at Akamai. “We are very excited about the rapid innovation that
Databricks, along with the rapidly growing community, is bringing to Delta Lake.
We are also looking forward to collaborating with other developers on the
project to move the data community to greater heights.”
“The Delta Lake project is seeing phenomenal activity and growth trends
indicating the developer community wants to be a part of the project.
Contributor strength has increased by 60% during the last year and the growth in
total commits is up 95% and the average lines of code per commit is up 900%. We
are seeing this upward velocity from contributing organizations like Uber
Technologies, Walmart and CloudBees, Inc., among others,” said Executive
Director of the Linux Foundation, Jim Zemlin.
MLflow 2.0 Introduces MLflow Pipelines to Templatize and Automate MLOps
As one of the most successful open source machine learning (ML) projects, MLflow
set the standard for ML platforms. The release of MLflow 2.0 introduces MLflow
Pipelines to the platform, substantially decreasing time to production and
improving execution at scale through standardization. MLflow Pipelines offers
data scientists pre-defined, production-ready templates based on the model type
they’re building to allow them to reliably bootstrap and accelerate model
development without requiring intervention from production engineers.
Next Generation Streaming Engine and Spark Whenever and Wherever
the leading unified engine for large-scale data analytics, Spark scales
seamlessly to handle data sets of all sizes. However, the lack of remote
connectivity and burden of applications developed and run on the driver node,
hinder the requirements of modern data applications. To tackle this, Databricks
introduced Spark Connect, a client and server interface for Apache Spark based
on the DataFrame API that will decouple the client and server for better
stability, and allow for built-in remote connectivity. With Spark Connect, users
will be able to access Spark from any device.
In collaboration with the Spark community, Databricks also announced Project
Lightspeed, the next generation of the Spark streaming engine. As the diversity
of applications moving into streaming data has increased, new requirements have
emerged to support the most in-demand data workloads for lakehouse, data
streaming. Spark Structured Streaming has been widely adopted since the early
days of streaming because of its ease of use, performance, large ecosystem, and
developer communities. With that in mind, Databricks will collaborate with the
community and encourage participation in Project Lightspeed to improve
performance, ecosystem support for connectors, enhance functionality for
processing data with new operators and APIs, and simplify deployment,
operations, monitoring and troubleshooting.