Interoperability, duplicate data, and sprawl can all be challenges, members of Protocol's Braintrust say.
Good afternoon! In today's edition, we asked Braintrust members to tell us about the thorniest challenges related to managing data in the cloud and the best ways to get around them. Questions or comments? Send us a note at firstname.lastname@example.org
CTO at Pure Storage
Today’s organizations store and manage their applications and data across heterogeneous environments — from on-premises data centers to edge to public cloud environments. However, many often find themselves challenged with managing duplicate copies of data, unpredictable or uncontrollable costs, differences in storage performance, and capabilities across environments, and a lack of clear data governance and controls.
With applications and services spread across environments and without a clear data integration approach, data management becomes incredibly complex. For example, data is commonly transferred and loaded into multiple different environments, proliferating copies of data. These copies of data both widen governance and security challenges, but also exacerbate new cost challenges. As traditional IT organizations increase their in-cloud data footprint, they are often caught off-guard by unexpected and hard-to-predict forecast costs surrounding the usage of their data. While many are accustomed to thinking about costs of storage, costs of data access (e.g., transit and access charges, added fees for burst performance, etc.) are not typically encountered on-premises and require operational changes to manage. But cost control is not the only area often requiring operational or application changes — customers must often manage differences in data service availability, capability, reliability, and available performance levels.
Cloud infrastructure and platform services (IaaS, PaaS) offer the freedom and flexibility of the cloud operating model to IT organizations. Realizing those benefits in-cloud versus on-premises, however, requires awareness and diligent management of these challenges.
SVP of product management at Databricks
Poor data quality and data integration issues, coupled with a lack of data discoverability, are often some of the biggest challenges facing organizations today when it comes to managing data in the cloud. According to recent research from MIT Technology Review and Databricks, data leaders reported that their teams spend 41% of their time on data integration and preparation, and nearly every respondent (96%) reported negative business effects as a result of data integration challenges. Data can be a powerful tool, but if organizations are spending a disproportionate amount of their time cleaning, organizing and migrating their data instead of analyzing and taking action from it, the value is lost. Moreover, the effects can significantly impact the bottom line – with a lack of discoverablity for the right data, teams face massive data duplication issues and are far less productive with need for more manual data scrubbing. With this, decision-makers may unknowingly be relying on incomplete or inaccurate data.
Adding to this, many of today’s enterprises are adopting a multi-cloud approach, wanting the efficiency of cloud scalability without being locked-in with a single provider. This multi-cloud approach offers greater flexibility and resiliency in a data strategy but is not without challenges as data teams may need to reimplement workloads and data models between platforms and, ultimately, need the convenience of leveraging common data tools that can work seamlessly across each cloud.
Entrepreneur and investor
Data is moving to the cloud because it is an excellent place to store, manage, and analyze data. The cloud breaks down information silos that exist in on-premises computing, making it much easier to share data internally and with business partners and customers.
However, when you put all your data in one place, you also must implement safeguards that govern the use of the data — most importantly data access control. This has proven to be a challenge for technology vendors and for the organizations that are managing their data in the cloud.
The underlying problem is caused by SQL. The industry-standard database query language is a core element of the Modern Data Stack, which is the ecosystem of technologies that enable us to manage data in the cloud. But while SQL is great for business analytics, it cannot support the complex, graph-oriented relationships required for data governance. As a result, every governance vendor uses their own purpose-built database and there is no basis for interoperability. This makes it difficult build governance applications.
What’s missing? Governance of data in the cloud requires a shared database foundation that can support the graph-oriented relationships and queries. These products should all work together based on a common understanding of an organization’s business. The technology that promises to make this possible is a new type of database called a relational knowledge graph.
Modern governance requires interoperability. Relational knowledge graphs are emerging and they have the potential to provide an industrywide solution.
SVP of enterprise data platforms and risk management technologies at Capital One
The ability to make sense of great volumes of data from an endless number of sources has become paramount to a company’s long-term success. However, managing data in the cloud at scale is not without challenges.
Capital One migrated to the cloud and built new data management platforms to make the best use of its own data. Reflecting on our journey, we have identified some of the most common, yet most difficult challenges to avoid.
- Difficulty controlling costs: An increase in the amount, proliferation, duplication, and variable usage patterns of data make it difficult to control data costs. Data professionals must manage and track usage to understand where inefficiencies are costing money. An effective cost-optimization strategy can help manage spending.
- Lack of understanding data estates: Data analysts often feel lost trying to understand a complex data estate. Any confusion around access, ownership, intent, and relationships between data can lead to valuable data sitting dormant. A federated approach with centralized tooling and policy may help solve this lack of understanding.
- Confusing data governance policies: It can be challenging to track and enforce all of the data governance policies required. However, not all data is created equal, so not all data requires the same level of protection. Data platform owners should consider using a sloped governance approach, increasing governance and controls based on the data.
Ultimately, a comprehensive data management strategy is required to overcome these challenges and unleash the power of your data in the cloud.
CEO at NetApp
As the cloud is now the de facto platform for businesses today, one thing has become abundantly clear: Data management isn’t easy. Whether you have a single cloud, a hybrid cloud, or a multicloud environment, the challenges of data management are amplified in the cloud.
That’s because in the initial journey to cloud, most businesses have been faced with uncontrolled cloud sprawl that greatly increased the complexity of managing applications, data, and infrastructure in the cloud. They’ve experienced new silos for applications and data being created due to the disparate implementations or lack of application portability, telemetry, and cloud interoperability frameworks. Their security risks have increased exponentially, alongside cost management and containment issues, and they’ve had to deal with new challenges around data visibility, governance, control, and compliance.
Despite these challenges, we still see the cloud as the key to unlocking endless possibilities for most companies today. When cloud is fully integrated into your architecture and operations, and not just another walled garden, the cloud has the potential to live up to its full promise. Moving data, migrating, and deploying applications also becomes remarkably easy when your storage foundation is the same on-premises and across every cloud. With this approach, applications can pull data effortlessly from multiple clouds, data can move freely, securely and with consistency between clouds to keep business logic moving forward, and businesses can quickly adapt to deliver on the outcomes they need to in a dynamic and uncertain macro environment.
CEO at Cohesity
Data is currency in our digital-first world. With the increasing amount and value of data, the risks only grow exponentially. Cloud, data management and security are the top business imperatives the C-suite must closely manage to ensure business continuity.
Organizations can no longer afford to trade off security posture for innovation, especially as ransomware attacks become more complex and new privacy regulations continue to be implemented. Existing security standards have become outdated – traditional solutions are no longer cutting it and businesses are ill-equipped to respond to and proactively address vulnerabilities today. As more and more of the enterprise resides in the cloud, businesses must bring data management and data security together for a different approach that ensures business continuity. Organizations should consider how well integrated are vendors across the technology stack to ensure airtight security for managing data in the cloud.
Chief technology officer at D2iQ
As organizations are looking for a competitive advantage, they are building the most disruptive products by leveraging data. They are looking to capture more and more data to dynamically customize user experiences accordingly. However, this growing data poses some real challenges to the organization. These challenges range from finding appropriate storage, which is often split between cold and hot storage depending on the frequency at which that data is accessed, to securing and making it available to run analytics. And these challenges are outside of data engineering, which deals with cleaning, transforming, and making it ready for consumption. Data has gravity, which is to say it is really costly to move data from one location to another. It is much easier to move the compute closer to the data than moving the data closer to compute. However, most of the data gets generated at the edge near the end user, which poses a challenge in moving that data to the cloud or data center to train models. The solution is to adopt containers, Kubernetes, and cloud-native technologies. Containers package the compute in a portable, repeatable manner, Kubernetes helps run those containers closer to the data, and cloud native technologies provide necessary governance to manage the entire infrastructure from a centralized platform such that even if your compute is distributed, you still have centralized governance and you no longer have to deal with expensive data migration.
Kevin McAllister ( @k__mcallister) is a Research Editor at Protocol, leading the development of Braintrust. Prior to joining the team, he was a rankings data reporter at The Wall Street Journal, where he oversaw structured data projects for the Journal's strategy team.
More from Braintrust