Microsoftโs partnership with Databricks adds new analytics tools to Azureโs data platform
Weโre living in a world of big data. The current generation of line-of-business computer systems generate terabytes of data every year, tracking sales and production through CRM and ERP. Itโs a flood of data thatโs only going to get bigger as we add the sensors of the industrial internet of things, and the data thatโs needed to deliver even the simplest predictive-maintenance systems.
Having that data is one thing, using it as another. Big data is often unstructured, spread across many servers and databases. You need something to bring it together. Thatโs where big data analysis tools like Apache Spark come into play; these distributed analytical tools work across clusters of computers. Building on techniques developed for the MapReduce algorithms used by tools like Hadoop, todayโs big data analysis tools go further to support more database-like behavior, working with in-memory data at scale, using loops to speed up queries, and providing a foundation for machine learning systems.
Apache Spark is fast, but Databricks is faster. Founded by the Spark team, Databricks is a cloud-optimized version of Spark that takes advantage of public cloud services to scale rapidly and uses cloud storage to host its data. It also offers tools to make it easier to explore your data, using the notebook model popularized by tools like Jupyter Notebooks.
Microsoftโs new support for Databricks on Azureโcalled Azure Databricksโsignals a new direction of its cloud services, bringing Databricks in as a partner rather than through an acquisition.
Although youโve always been able to install Spark or Databricks on Azure, Azure Databricks makes it a one-click experience, driving the setup process from the Azure Portal. You can host multiple analytical clusters, using autoscaling to minimize the resources in use. You can clone and edit clusters, tuning them for specific jobs or running different analyses on the same underlying data.
Configuring the Azure Databricks virtual appliance
The heart of Microsoftโs new service is a managed Databricks virtual appliance built using containers running on Azure Container Services. You choose the number of VMs in each cluster that it controls and uses, and then the service handles load automatically once itโs configured and running, loading new VMs to handle scaling.
Databricksโ tools interact directly with the Azure Resource Manager, which adds a security group and a dedicated storage account and virtual network to your Azure subscription. It lets you use any class of Azure VM for your Databricks cluster โ so if youโre planning on using it to train machine learning systems, youโll want to choose one of the latest GPU-based VMs. And of course, if one VM model isnโt right for your problem, you can switch it out for another. All you need to do is clone a cluster and change the VM definitions.
Querying in Spark brings engineering to data science
Spark has its own query language based on SQL, which works with Spark DataFrames to handle both structured and unstructured data. DataFrames are the equivalent of a relational table, constructed on top of collections of distributed data in different stores. Using named columns, you can construct and manipulate DataFrames with languages like R and Python; thus, both developers and data scientists can take advantage of them.
DataFrames is essentially a domain-specific language for your data, a language that extends the data analysis features of your chosen platform. By using familiar libraries with DataFrames, you can construct complex queries that take data from multiple sources, working across columns.
Because Azure Databricks is inherently data-parallel, and its queries are evaluated only when called to deliver actions, results can be delivered very quickly. Because Spark supports most common data sources, either natively or through extensions, you can add Azure Databricks DataFrames and queries to existing data relatively easily, reducing the need to migrate data to take advantage of its capabilities.
Although Azure Databricks provides a high-speed analytics layer across multiple sources, itโs also a useful tool for data scientists and developers trying to build and explore new models, turning data science into data engineering. Using Databricks Notebooks, you can develop scratchpad views of your data, with code and results in a single view.
The resulting notebooks are shared resources, so anyone can use them to explore their data and try out new queries. Once a query is tested and turned into a regular job, its output can be exposed as an element a Power BI dashboard, making Azure Databricks part of an end-to-end data architecture that allows more complex reporting than a simple SQL or NoSQL serviceโor even Hadoop.
Microsoft plus Databricks: a new model for Azure Services
Microsoft hasnโt yet detailed its pricing for Azure Databricks, but it does claim that it can improve performance and reduce cost by as much as 99 percent compared to running your own unmanaged Spark installation on Azureโs infrastructure services. If Microsoftโs claim bears out, that promises to be a significant saving, especially when you factor in no longer having to run your own Spark infrastructure.
Azureโs Databricks service will connect directly to Azure storage services, including Azure Data Lake, with optimizations for queries and caching. Thereโs also the option of using it with Cosmos DB, so you can take advantage of global data sources and a range of NoSQL data models, including MongoDB and Cassandra compatibilityโas well as Cosmos DBโs graph APIs. It should also work well with Azureโs data-streaming tools, giving you a new option for near real-time IoT analytics.
If youโre already using Databricksโ Spark tools, this new service wonโt affect you or your relationship with Databricks. Itโs only if you take the models and analytics youโve developed on-premises to Azureโs cloud that youโll get a billing relationship with Microsoft. Youโll also have fewer management tasks, leaving you more time to work with your data.
Microsoftโs decision to work with an expert partner on a new service makes a lot of sense. Databricks has the expertise, and Microsoft has the platform. If the resulting service is successful, it could set a new pattern for how Azure evolves in the future, building on what businesses are already using and making them part of the Azure hybrid cloud without absorbing those services into Microsoft.


