[tl;dr The cloud enables novel ways of handling an expected explosion in data store types and instances, allowing stakeholders to know exactly what data is where at all times without human process dependencies.]
Data management & governance is a big and growing concerns for more and more organizations of all sizes. Effective data management is critical for compliance, resilience, and innovation.
Data governance is necessary to know what data you have, when you got it, where it came from, where it is being used, and whether it is of good quality or not.
While the field is relatively mature, the rise of cloud-based services and service-enabled infrastructure will, I believe, fundamentally change the nature of how data is managed in the future and enable greater agility if leveraged effectively.
Data Management Meta-Data
Data and application architects are concerned about ensuring that applications use the most appropriate data storage solution for the problem being solved. To better manage cost and complexity, firms tend to converge on a handful of data management standards (such as Oracle or SQL Server for databases; NFS or NTFS for filesystems; Netezza, Terradata for data warehousing, Hadoop/HDFS for data processing, etc). Expertise is concentrated around central teams that manage provisioning, deployments, and operations for each platform. This introduces dependencies that project teams must plan around. This also requires forward planning and long-term commitment – so not particularly agile.
Keeping up with data storage technology is a challenge – technologies like key/value stores, graph databases, columnar databases, object stores, and document databases exist as these represent varying datasets in a more natural way for applications to consume, reducing or eliminating the ‘impedance mismatch‘ between how applications view state and how that state is stored.
In particular, may datastore technologies are used to scaling up rather than out; i.e., the only way to make them perform faster is to add more CPU/memory, or faster IO hardware. While this keeps applications simpler, it require significant forward planning and longer-term commitments to scale up, and is out of the control of application development teams. Cloud-based services can typically handle scale-out transparently, although applications may need to be aware of the data dimensions across which scale out happens (e.g., sharding by primary key, etc).
Fulfilling provisioning requests for a new datastore on-premise is mostly ticket driven, but fulfillment is still mostly by humans and not by software within enterprises – which means an “infrastructure-as-code” approach is not feasible.
Data Store Manageability vs Application Complexity
Most firms decide that it is better to simplify the data landscape such that fewer datastore solutions are available, but to resource those solutions so that they are properly supported to handle business critical production workloads with maximum efficiency.
The trade-off is in the applications themselves, where the data storage solutions available end up driving the application architecture, rather than the application architecture (i.e., requirements) dictating the most appropriate data store solution, which would result in the lowest impedance mismatch.
A typical example of an impedance mismatch are object-oriented applications (written in, say C++ or Java) which use relational databases. Here, object/relational mapping technologies such as Hibernate or Gigaspaces are used to map the application view of the data (which likes to view data as in-memory objects) to the relational view. These middle layers, while useful for naturally relational data, can be overly expensive to maintain and operate if what your application really needs is a more appropriate type of datastore (e.g., graph).
This mismatch gets exacerbated in a microservices environment where each microservice is responsible for its own persistence, and individual microservices are written in the language most appropriate for the problem domain. Typical imperative, object-oriented languages implementing transactional systems will lean heavily towards relational databases and ORMs, whereas applications dealing with multi-media, graphs, very-large objects, or simple key/value pairs will not benefit from this architecture.
The rise of event-driven architectures (in particular, transactional ‘sagas‘, and ‘aggregates‘ from DDD) will also tend to move architectures away from ‘kitchen-sink’ business object definitions maintained in a single code-base into multiple discrete but overlapping schemas maintained by different code-bases, and triggered by common or related events. This will ultimately lead to an increase in the number of independently managed datastores in an organisation, all of which need management and governance across multiple environments.
For on-premise solutions, the pressure to keep the number of datastore options down, while dealing with an explosion in instances, is going to limit application data architecture choices, increase application complexity (to cope with datastore impedance mismatch), and reduce the benefits from migrating to a microservices architecture (shared datastores favor a monolithic architecture).
Cloud Changes Everything
So how does cloud fundamentally change how we deal with data management and governance? The most obvious benefit cloud brings is around the variety of data storage services available, covering all the typical use cases applications need. Capacity and provisioning is no longer an operational concern, as it is handled by the cloud provider. So data store resource requirements can now be formulated in code (e.g., in CloudFormation, Terraform, etc).
This, in principle, allows applications (microservices) to choose the most appropriate storage solution for their problem domain, and to minimize the need for long-term forward planning.
Using code to specify and provision database services also has another advantage: cloud service providers typically offer the means to tag all instantiated services with your own meta-data. So you can define and implement your own data management tagging standards, and enforce these using tools provided by the cloud provider. These can be particularly useful when integrating with established data discovery tools, which depend on reliable meta-data sources. For example, tags can be defined based on a data ontology defined by the chief data office (see my previous article on CDO).
These mechanisms can be highly automated via service catalogs (such as AWS Service Catalog or ServiceNow), which allow compliant stacks to be provisioned without requiring developers to directly access the cloud providers APIs.
Let a thousand flowers bloom
The obvious downside to letting teams select their storage needs is the likely explosion of data stores – even if they are selected from a managed service catalog. But the expectation is that each distinct store would be relatively simple – at least compared to relational stores which support many application use cases and queries in a single database.
In on-premise situations, data integration is also a real challenge – usually addressed by a myriad of ad-hoc jobs and processes whose purpose is to extract data from one system and send it to another (i.e., ETL). Usually no meta-data exists around these processes, except that afforded by proprietary ETL systems.
In best case integration scenarios, ‘glue’ data flows are implemented in enterprise service buses that generally will have some form of governance attached – but which usually has the undesirable side-effect of introducing yet another dependency for development teams which needs planning and resourcing. Ideally, teams want to be able to use ‘dumb’ pipes for messaging, and be able to self-serve their message governance, such that enterprise data governance tools can still know what data is being published/consumed, and by whom.
Cloud provides two main game-changing capabilities to manage data complexity management at scale. Specifically:
- All resources that manage data can be tagged with appropriate meta-data – without needing to, for example, examine tables or know anything about the specifics about the data service. This can also extend to messaging services.
- Serverless functions (e.g., AWS Lambda, Azure Functions, etc) can be used to implement ‘glue’ logic, and can themselves be tagged and managed in an automated way. Serverless functions can also be used to do more intelligent updates of data management meta-data – for example, update a specific repository when a particular service is instantiated, etc. Serverless functions can be viewed as on-demand microservices which may have their own data stores – usually provided via a managed service.
Data, Data Everywhere
By adopting a cloud-enabled microservice architecture, using datastore services provisioned by code, applying event driven architecture, leveraging serverless functions, and engaging with the chief data officer for meta-data standards, it will be possible to have an unprecedented up-to-date view of what data exists in an organization and where. It may even address static views of data in motion (through tagging queue and notification topic resources). The data would be maintained via policies and rules implemented in service catalog templates and lambda functions triggered automatically by cloud configuration changes, so it would always be current and correct.
The CDO, as well as data and enterprise architects, would be the chief consumer of this metadata – either directly or as inputs into other applications, such as data governance tools, etc.
The ultimate goal is to avoid data management and governance processes which rely on reactive human (IT) input to maintain high-quality data management metadata. Reliable metadata can give rise to a whole new range of capabilities for stakeholders across the enterprise, and finally take IT out of the loop for business-as-usual data management queries, freeing up valuable resources for building even more data-driven applications.