Data & AI Platform Architecture

Tawfiq Bajjali, January 2018 

Introduction

The big picture behind having an enterprise data & analytics platform

As you and other enterprises you collaborate with roll out digital applications, your customers, partners, and stakeholders will generate very valuable digital data that you will want to collect. Furthermore, they will generate even more data as they start adopting wearables and IoT devices, which will only increase when advanced networking beyond 5G becomes more common place.

Your goal is to harness this new data and use it to build more insights that help you offer more personalized experiences, maximize convenience, optimize your business operations, and use it to grow your market share. These additional data & newly gained insights will transform the way you transact and lead you to build new digital applications, modify and enrich existing systems to either enhance the experience or to collect more information, establish new or change the dynamics of existing partnerships to transact in new ways, and make step changes in core business processes, which may include integrating disparate systems or automating manual processes that are in use today. All this activity will generate additional new data to reinforce this cycle of business optimization through data and newly gained insights. 

Uses for the data and AI platform

To capitalize on the business opportunity described above effectively, and maximize your chances for success, you will need speed and scale. More specifically, you need to be able to collect data and develop insights rapidly, manage them effectively, and be able to explain them using visualization and other business intelligence tools with roll up and drill down capabilities. Also, as more and more business operators start to rely on data and insights to optimize their functions, you’ll need to streamline the ability to search, analyze and develop new insights by users with different degrees of business knowledge depth, understanding of data at its various levels of granularity, levels of experience and skills working with data, and preferences on tools for analysis and insights development. Lastly, you will need to streamline the integration of data and insights into existing and new systems. This combination of data, insights, processes, policies, guidelines, architectures, tools, and infrastructure make up, or are encapsulated into, your enterprise data & analytics platform.

Let us expand on the above and outline features the platform must have. 

First, streamline data collection and collect data as it is generated, anywhere, and as soon as collection is possible.

Second, streamline building traditional analytics, AI/ML models and algorithms to find optimizations in either real-time transactions or in multi-step multi-day operational processes. This includes deriving data to simplify understanding the underlying data used in the derivation, and insights that explain or provide recommended actions to optimize outcomes.

Third, ensure that data and insights are consistently interpreted and applied, they can be explained, are dependable and their quality is known, are only used in ways that comply with their data use agreements, and are secured and protected from unauthorized users.

Fourth, streamline legacy and modern enterprise systems integration to support increased automation and use of AI/ML interventions and recommendations. This includes streamlining the interoperability of data across systems within and outside of your enterprise.

Fifth, streamline access, discovery, search and analysis of enterprise, customer, and partner data for use in new application development.

Sixth, streamline data visualization, dashboard development, and interpretation of enterprise data and analytics.

In addition to building this platform, you should also focus on both consolidating and modernizing your systems where all your business transactions occur. How you should go about doing that is outside of the scope of this article, but it is important to highlight that you should focus your efforts on that in parallel to building your platform. This is important because, for example, let us assume you have multiple systems performing very similar business processes, and that one of the sub steps in the process is the exact same across all those systems. You did some analysis and found out that there is value in collecting more data elements, or in introducing or eliminating a step within this shared process. If you have multiple systems in play performing this process, then changes you want to make will need to be made consistently across all the systems in play.

As for modernization, it is important to do that so you can easily enhance your systems to be able to transmit all the data possible to your platform to maximize its effectiveness. I will expand on what I mean by data possible later.

The benefit of using this platform to manage data and build AI across the enterprise

The value of the platform to teams responsible for delivering these data and analytics capabilities to the enterprise is significant. It will allow you to move quickly and efficiently.

When enterprise stakeholders need data, they do not need to worry about navigating the enterprise to find it. The platform offers access to all the data in the enterprise and teams can focus on writing value-adding code and delivering solutions sooner. Data from various sources that resided in various data platforms now exist on a single platform with compatible tools and connectors to query it and simplify its consumption. The platform also provides access to the data anytime eliminating data access restriction windows imposed by the source system because of peek system resource utilization concerns. Furthermore, the platform provides scalable access to enterprise data, so an unlimited number of workloads can access it simultaneously and the platform will scale to support that scenario. Lastly, the platform offers abstracted layers to simplify interoperability (batch and real-time) with systems you need to invoke behaviors in.

As for the data itself, it is refreshed as often as the source system is willing to publish (or allow you to consume it), and so there is not any other way to get fresher data. The data has been checked and certified from a data quality perspective. Data values for the same attribute (represent the same business value) but that are represented differently in multiple systems have been standardized on the platform. The platform will master data belonging to critical business entities. It catalogs data elements and values to simplify consuming and governing the data. Data is also curated for workloads that require it and can accept latency. Lastly, data associated with a multitude of business events can be identified and is easily accessible in a timely manner.

What we will and will not cover in this article

I am writing with the assumption that there are multiple data stores that do some of the jobs of this platform. It also does not imply or encourage you to throw everything you have away and start over, or that you must build whatever you’ll end up building all in one big bang. As we go through the sections below, we will make sure this assumption holds. 

The other thing to note is that this writeup is not technology specific. There are many technologies out there that you can use both on-premises and in the cloud. You should rely on your technologists that love and live to master technology and evangelize it. What I want to try and do with this writeup is inform the functional and nonfunctional specifications and give the implementation teams the freedom to choose the technologies and tools based on tech stacks and standards in their enterprise.

For example, I will encourage you to build this platform on the cloud to take advantage of all that it has to offer including, fast computing at scale, economic storage costs, proprietary AI capabilities, and the ability to scale and host applications just in time that can reach all your constituents simultaneously without a large upfront capital investment. I will not recommend a specific cloud vendor, or which of their cloud services to use. I may reference some and will always use them as examples or options to consider. I will also leave this decision up to you because you might have many constraints that you need to manage to.

Part 1: Architecting the platform

The architecture is made up of multiple data zones with distinct responsibilities, reusable infrastructure, and platform services to support different workloads that are enabled by the platform. Platform services include data management services, are responsible for cross-cutting concerns and will be used for the implementation and ongoing maintenance of the data zones, and in the implementation of different workloads on the platform. Lastly, I may describe the different workloads that can be built on the platform as different layers of the platform. For example, the platform will include an AI/ML layer to support AI/ML workloads. I’ll also introduce the infrastructure components as part of describing each platform service or the different workloads that will use them.

Step 1: Building the platform’s data zones

As I mentioned earlier, a big component of the platform is its various data zones. The first data zone is the RAW zone. It acts as the initial repository for ingested data from various sources without altering the data in any way. When the data is stored, there is no schema; the schema is applied on read. Lastly, by definition, data can be structured, semi-structured, or unstructured. We’ll cover the different techniques for ingesting data into this layer later.

The second data zone is the STANDARDIZED zone. It is responsible for transforming the RAW zone, mainly standardizing data values across various sources, applying row level business logic, and adding derived column to prepare the data for further processing. New records should be persisted to maintain the immutability of the RAW zone.

The third data zone is the CONFORMED zone. It is responsible for further processing of records from the STANDARDIZED zone, mainly integrating data from multiple sources to provide a unified format and data values.

The fourth zone is for creating and maintaining traditional database repositories like Data Marts and Operational Data Stores (ODSs) but will also be the home of repositories including NoSQL repositories that contain curated entities. This zone is responsible for creating and promoting the reusability of repositories needed to support near real-time data integration, consumption, as well as near real-time and traditional reporting.

Step 2: Platform Data Management Services

In this section we will discuss the data management services and subsystems your platform offers on top of the landed data, including data quality, reference data (used for standardization), master data management system, metadata management, access control management, and a data catalog.

Data Quality Management

Data quality ensures that data is accurate, complete, reliable, and consistent. Depending on the zone or layer you are checking for data quality, different requirements must be met. This Is a critical investment needed because the accuracy of the data directly translates to the accuracy of the business decisions made using it. High quality data also builds trust in the platform and increases adoption and is critical for compliance with government regulations and business contracts.

When you are checking data in the RAW layer, data shall match real-world values, all required data fields specified by the source should have values, and data must be up-to-date based on timeliness requirements agreed to by the source. When we are in the STANDARDIZED layer, data values shall be uniform across data stores and shall conform to defined formats and business rules. When you are checking data in the CONFORMED layer, data should be de-duplicated, and derived field values that roll up values from multiple rows should be validated. 

Data quality management also includes tools for data stewards to monitor data issues in various data sets and apply corrections. Corrected records should be persisted as new records that link back to incorrect ones.

Reference Data Management (Standardization)

Reference data management includes the process of defining, relating (to facilitate lookups of data values that are different in their native system), maintaining, and distributing standardized data values across the organization. This ensures consistency across business processes, systems integrations, and reporting systems that use these values. Reference data also changes overtime as data stewards are managing it, so it should be versioned.

Master Data Management (MDM)

MDM includes the process of creating or retrieving an existing identifier key value that is used to uniquely identify a record belonging to a critical business entity (e.g. subscriber). This identifier is used to ensure consistency and accuracy across business processes, and for systems integration, compliance, and reporting. The MDM system should retrieve the master identifier using a set of business attributes that are supplied to it that it will use to return the identifier. If the MDM system is unable to retrieve an identifier, the input should be captured for a data steward to review and create a master record. In some cases, an auto generated record will be automatically created and returned, then checked later. This is a tradeoff that needs to be made and will be based on the level of criticality of the business entity including the impact of retrospectively correcting data.

Furthermore, in some cases, there may be multiple sources of truths or repositories where critical business entities are stored. This is a very common scenario when one company acquires another, and they both share a critical business entity (e.g. subscriber). In this scenario, your MDM system should return a keychain with multiple unique identifiers for the same subscriber (e.g. subscriber_id in system A = 1234 and subscriber_id in system B = 4532. And you can join two tables in the STANDARDIZED layer using these values).

Note: you can persist all the keys available on the keychain, create views that simplify consumption, or in downstream layers of the platform, persist the keys pertaining to the future state system that will live on when you consolidate and decommission legacy systems. This is all about tradeoffs.

Metadata Management

Metadata is information about data including information about the content and context of the data (descriptive metadata), information about data structures and their relationships (structural metadata), information about data formats, storage and processing information (technical metadata), and information about other data management functions including provenance and lineage, rights for use and maintenance, and ownership (administrative metadata). Different types of metadata play critical roles in the understanding data, ensuring data integrity, and for performing and understanding data integrations.

Data Access Management

The data access management system is responsible for authorizing access to data elements. Data stewards (or owners of the data) should grant access. This includes role-based access to data repositories, row-level access, and field level access. The system should also log and provide audit reporting. It should also monitor access and detect anomalies. Lastly, it should support masking and encryption of fields. This is a critical system to ensure compliance and security of the platform.

Data Catalog Management

A data catalog is a subsystem responsible for maintaining an inventory of all the data assets and repositories within the organization and supporting data management characteristics. It includes tools to discover data, profile data (understand quality, structures, and data values), understand the metadata including lineage, understand controls and permissions that have been set on data, and identify subject matter expertise in the organization. This ensures that data is used as intended (governance and compliance) and is accurate. Platform users interact with this system as part of developing solutions on the platform or during data analysis. 

Step 3: Sourcing Data

The data to populate the platform must come from the source system that is considered the system of record for it. Sounds easy, and in the grand scheme of things it is, but there are some facts and obstacles here that you must understand and figure out how to overcome in your enterprise.

Fact 1: Getting data out of a source system is not so much a technological challenge. There might be challenges in doing that, but the real challenge is in getting the source system to contribute their data, explain it to you, help you understand what you can and can’t do with it, help you assess the quality of the data, and master it. This is challenging because source system teams are busy running and enhancing their systems and probably don’t have time to do all the above with the current staff they have.

Fact 2: There are many source systems, some do the same thing for a different segment of the business, and no matter how many migrations or consolidations are done, you can’t bet that the business won’t buy another company that has a new system that does “exactly”, at least from a technology lens, what a few other systems you have already in the enterprise do. The reality is that different systems have completely different storage technologies, and the business will expect to run them in parallel for potentially years to come.

Fact 3: Data in source systems is getting updated, potentially every second during operational hours, which for some systems that’s 24/7.

Fact 4: Before a transaction in the source system is entirely done, it could have gone through multiple updates (where data was persisted in different states and records had different statuses) involving multiple people that work on it over a period of time including periods of inactivity (potentially having minutes, hours, even days of inactivity).

Given all the above, here is a comprehensive set of requirements for sourcing data from source systems that are willing to put in effort to build you what you need. What I mean by this is that there will be data contained in systems where you won’t have the ability to change how and what data you get (most 3rd party data vendors). Let’s exclude those data sources when reviewing the following requirements and we’ll get to them later.

First, the source system shall define one or more interfaces to distribute data to you. You don’t want to define the interface and want to take whatever you can get. 

Second, the data being distributed shall be distributed immediately when it is available. It shall be distributed as often as the source system can be scaled to distribute it. It shall include all metadata available – I’ll expand on this further below, but for now, a very important component of this is knowing any privacy and compliance considerations for items mandated by the government or contractually. It shall include both original and mastered elements (fields) by matching the application specific elements with the master data management system before data is distributed to you – not all elements are mastered. It shall include both original and modified reference elements that are standardized to your reference data (i.e. restricted to your range of possible values) before data is distributed to you – not all elements are restricted to a range of possible values. It shall include every version of a record or resource at every intermediate state, or whenever an update to a record persisted in the system’s data store. It shall include every element available including non-business elements like timestamps, the id of the user that touched the record, and all other metadata about a record that’s available. It shall include application logs including application and user behaviors, both regular and irregular logs (e.g. errors) that also may have been persisted outside of the systems’ primary data stores. It shall be encrypted using an enterprise encryption scheme so that the data can be used across the enterprise. It shall include a description of the business event(s) triggering the transaction of data you are receiving – this will help trigger subsequent steps off of business events and offer an easy way to identify the event’s associated transaction data.

Third, the source system shall transmit the data to you instead of you pulling it.

Fourth, the interfaces shall evolve. You want them to, so you don’t miss out on capturing data that is available.

Fifth, the source system shall own the code to satisfy all of the requirements above and maintain with the same level of importance as any other system component. This is critical to making sure that as an application evolves, the interfaces to distributing its data are honored and never break. Even if you code the initial version, and somehow were able to figure out the ins and outs of where the application stored its critical data, what it means, and were able to retrieve it on your own, you shall transition the code ownership over to the contributing system. 

What do you do with these 5 requirements? Based on the business use case, pick the requirements applicable, and over time evolve the interfaces with the source system to meet your needs.

I’ll give you an example. Your initial use case might be to get claims from one of your enterprise claims systems so you can display them in your customer facing mobile app as soon as they get paid. In this scenario, from a sourcing perspective, the claims system needs to transmit to you every single claim as soon as it gets paid or rejected (you are concerned with the final version of the transaction for this scenario, which is easier to deliver than every intermediate step along the way). 

When another use case pops up later where you need to supply data to a data scientist that is looking to improve the effectiveness of the team that is manually adjudicating claims, you need to evolve the interface. To support this use case, the claims system now needs to transmit every single claim and the fields necessary to identify who is adjudicating them manually, needs to do that as soon as claims progress through the manual adjudication process, and needs to include however many steps or updates to the claim that may be.

I’ll expand on this second use case to touch on a few other requirements from above.

You would want the source system to send you the status code as it appears in the source system and include a standardized enterprise value for that status code (e.g. say the status code was 03, meaning “rejected”). You would want to get both values 03 and “rejected” in separate elements. You might further standardize the value to something other than “rejected”, but it helps to know from the source systems’ perspective what it thinks code 03 means. You, or the enterprise, would have available the list of standard statuses for the source system to make the determination that status code 03 means “rejected”.

The other element included in the interface is to identify the employee who is responsible for processing the claim at every step of the way. Here, the source system might maintain its own table of employee identifiers that is different than the master list of employee identifier that work for your company (we’ll call it global_employee_id). For example, employee Tawfiq in the claims system might have id “CS123”, but my global_employee_id is “E887766”. You want the source system to send both the id element specific to it, and the mastered global_employee_id that it should have gotten from the master data management system.

An important note here, and I’ll touch on it in another section, is that the mastered value needs to have a time to live (TTL) value associated with it, which should be no greater than how often the matching algorithms run to perform identity matching.

To summarize, we talked about source systems and put them into two categories. Those that will build interfaces for you and will send you data, and those that you’ll have to ingest data from, and you’ll have to take what they have. To simplify it, we’ll call them involved and not involved systems respectively. Your goal on this journey is to get ALL your internal systems (systems that your enterprise has ownership of) to be involved, and to maximize the number of involved external systems (most likely partner, or commercial 3rd party data vendors). However, there will still be data you need from noninvolved systems. 

Adapting data from the noninvolved systems

Now let’s talk some more about external systems where you can’t influence the interface and will most likely need to pull data.

First you need to set expectations that these sources could temporarily break (change in the interface, and there are ways to minimize the impacts here that we’ll discuss in another section), or isn’t always going to be available, or could vanish with very little notice. What helps you deliver the news when something like this happens is being able to quantify and articulate the value of the source data.  This is where metadata you get with the data itself is critical. I am not talking about the resource (record) schema and element types, formats, all possible values, hierarchies, relationships, and rules. Those are important, but there is a lot more metadata you should be after as described in the metadata section above.

Before we get into where we’ll be landing the data that is being sent or being ingested, let’s create some separation of concerns. We’ll build a layer, an adaptation layer, that we’ll own and maintain to ingest data from the noninvolved systems and relay it to the landing zone. This way the landing zone is getting all data, regardless of the source, sent in (pushed to it). In the adaptation layer, we won’t do much to the data, other than master it and standardize key fields. You also must scale this layer so that you are transmitting data downstream as often as the use cases you are supporting call for.

Step 4: Ingesting data into the platform, propagating it through the layers, and storing it

The data arriving to your platform arrives in different volumes and velocity. This includes real-time data, and batch data sets. It also sometimes arrives multiple times after corrections to it have been made. We’ll explain how to handle each of these scenarios and the platform components. Also, propagating data from one layer or zone to another should follow the same pattern as ingesting and landing data from a source system into the RAW zone.

Real-time data

Let’s start with real-time transactions that are being sent in by the source systems. This is where event streaming and supporting infrastructure as a core capability of your platform comes into play.  The idea here is that you define topics that you land published data into. By doing this, the source system will have one and only one integration point with your platform, and N consumers can consume the data from the various streams you publish avoiding N point-to-point integrations with the source. The data is available shortly after it arrives to your platform and for consumers to consume. We’ll get into consuming data from the platform in sections that follow.

Now let’s talk about a few considerations and re-emphasize a few items that are required to pull this off.

The infrastructure that you use for this event streaming architecture must be scalable.

Since the source system oversees the interface, or the source system interface was adapted to be the transmitter of data, the data must come in with a quality certification.

Because of that, you can feel comfortable publishing it to subscribers as soon as it comes in. Note that depending on the criticality of data, there is some leeway in certifying the data for quality.

The source system will apply standardization and master the data so you can feel comfortable publishing it to subscribers as soon as it comes in.

Since the interface will be built specifically to extend data to the platform, I recommend extending the data via REST, mainly to shield or simplify any complexity in the source system’s data architecture.

We covered when a noninvolved external system is in play where you have to get data from (by calling an API, or looking for a change data capture), adapt the interface, enrich (master, standardize, etc.) the data elements, and then push the data you got to your platform. In some cases, you might need to call back APIs – we will cover this in a later section (they’ll either be an encapsulation necessary to get the data, or you don’t want additional hops in your pipeline for a consumer).

Lastly, you will probably find yourself configuring your topics to hold data over a period of days vs years, which is how long you would want to keep your data. The solution here is to archive the data in your topics periodically, sometimes as often as multiple times a day into longer term storage and “never” get rid of it, where never is the max amount of time you are allowed to keep data.

Batch data

Normally large amounts of data are transmitted in batch.

Your platform needs to be able to ingest multiple batch files that build on each other and store them in a long-term storage repository for multiple years. This is straight forward and traditionally how data has been warehoused, but there is one more step to take. After each batch is proceeded, you should write to a topic in your event stream indicating that the processing of a batch completed. This will be useful for a subscribe to use as a trigger to retrieve the data.

Correcting data you already landed

I’ve always had scenarios where someone needed to send a corrected file or record over. This is ok. You’ll want to retain the original copy and the corrected copy. You just write the correction, and if there is a way to correlate it with the original bad data, it’s up to the consumer later on to pick the correct data values.

Step 5: Data Standardization

As we discussed in the introduction, most enterprises will have multiple systems that house the same type of data (e.g. claims adjudication systems). In a lot of cases, for analytical purposes mostly, these datasets need to be unified and joined with related data types. The reality is that each system will have different data schemas and different data element values that mean the same thing from a business perspective (business attributes and valid values). These data values for the same business attribute that are represented differently in different systems will need to be standardized into a new attribute name and corresponding standard set of values. For example, System A has attribute “EmploymentStatus” with possible values: “Employed, Retired, etc.” and System B has attribute “STSCD” with possible values: “01, 02, etc.”. A new attribute that standardizes the attribute name and set of possible values should be introduced. In other cases, the standardization process is not as simple. Your input maybe less structured than a data set that can be represented in a table. In those cases, you should attempt to give the data some structure and provide a way to reconcile the standardized values with the original structure the data was acquired in. Any new data generated should be cataloged. 

The platform needs to do all the above and maintain a way to lookup the original attribute values and system they came from. To do this, create a new layer of data called the standardized layer that always maps back to the raw data you landed.

Step 6: Data Marts, Operational Data Stores, and Repositories of Curated Entities

Data marts are repositories that represent the transformation and aggregation of data from the standardized and conformed layer and are designed to provide end-users of the platform with accessible, high-quality data for various workloads optimized for various business applications and workloads. In these repositories, data is organized by specific business areas (e.g. clinical) and is represented into a star or snowflake schema with fact and dimension tables that contain quantitative data and their descriptive attributes respectively. These repositories also hold aggregated tables that contain precomputed or summarized data to improve access speeds. These will be critical in supporting Business Intelligence (BI) workloads.

Operational Data Stores (ODSs) are repositories designed to integrate data from various data zones including the raw zone. They provide a consolidated view of operational data and offer near real-time data access.

These data repositories contain entities with attributes curated from multiple independent data objects stored in various layers or zones of the platform that persist at different intervals of time (e.g. one participating repository is updated hourly, and another could be updated weekly). Let’s refer to these as curated entities. These curated entities exist to simplify consumption and ensure information accuracy for the intended use. They can be stand alone or have fields that reference other curated entities that they can join with.

For example, a Patient curated entity may contain attributes like name, address, and risk scores by condition they have. Each of these attributes could be updated at different intervals and live in independently built and maintained data repositories. To simplify the consumption of this information into a report, we would create a Patient curated entity for a report developer to consume and build the report off of.

Furthermore, curated entities are very useful when attributes that are used to build a repot need to be updated independently of the underlaying data that those attributes are calculated off of. For example, the risk score attribute in the Patient curated entity may be calculated using the average of the Patient’s daily risk scores over a month. While the daily risk score is appropriate to support a care management application, the risk score we updated monthly using the average of all the scores is more appropriate for a monthly report to employers of the member.

When thinking about all the attributes that make up a curated entity, we want to first consider curating them from existing repositories and entities vs building all of them. This will promote reusability and minimize data quality issues because it will limit end-user confusion around which attribute in which entity to select. This is true regardless of how much data cataloguing we do.

How do we know which metrics or attributes make it into curated entities and which ones don’t? Do we care about those metrics at an enterprise level? If we do, then do they exist? If not, we need to build them. From there, we should see if they are joinable to other metrics or attributes that should also be include in our curated entity to either support and end-user trying to build drill downs, or to eliminate the need for an end-user to implement complex joining logic.

When we need to build our own new metrics, it is important to determine what attributes are needed as inputs to calculate it and where to source them. We also need to know how reliable those sources are, and how reliable the metric being calculated will be as a result. Look for ways to ensure the reliability of the data when there are multiple sources that can be used as input and know the business criteria for when when one source trumps the other. You also need to know how often the metric needs to be calculated or recalculated based on what decisions it informs and how it will be consumed. From there, think about how you would catalog the new metric and what can and can’t we use it for.

Also consider if the new metric we decided needs to be built should live within an existing entity other than the curated entity we are creating. If you think that another entity should house the attribute you are creating, that decision might require collaboration with the repository owner where that entity exists.  

Lastly, consider the type of repository you should use to store your curated entity. By persisting data needed to support consumption in a platform independent of the variety of compute that acts on it, we can deploy N specialized compute facilities like search, reactive APIs etc. This will make your curated entities accessible to a broader set of users. However, you may at times have to make tradeoffs and persist the data in many repositories specific to different uses of the data to meet non-functional requirements such as query performance – i.e. you need to consider when you would use an ODS, vs data mart, vs transactional database, vs a noSQL repository and so on. 

Step 7: Managing Data using an industry standard data model

By creating a version of your data repositories that use data definitions and structures that are not proprietary to your organization, various systems used across the organization and by trade partners in your industry are easier to integrate. An example of this is the Fast Healthcare Interoperability Resource (FHIR) standard for storing and transforming patient data from electronic health records systems and claims systems and exchanging it between payers, providers, and patients and allowing for code to be portable across different repositories housing this data using an industry standard format.

Step 8: Creating data integration abstraction to simplify interoperability

In this section we’ll discuss how you can simplify and future proof integrations with systems you source data from and need to integrate data back into that are responsible for critical business functions. The thinking here is that this should be done for systems where the integration is a very complex problem to solve (i.e. the system you figured out how to source data from, and now need to integrate back, either data or invoke behaviors in), systems that weren’t design for extendibility (if a system has an integration interface already, why add another abstraction), or when there are multiple systems that have a common behavior and there is contextual knowledge needed to determine which system the integration should be done with (e.g. say you have different call center systems for different lines of business that need knowledge of which of which line of business the customer does business with). An abstraction here simplifies the job for the Nth systems that need to do integration after the problem is solved the first time around. Just like sourcing the data, these abstractions should be built when needed and overtime based on the important of the behaviors each system being integrated with is responsible for. 

Part 2: Using the platform for various workloads

So far we’ve architected a data lake with raw data and services to simplify it’s consumption, additional curation that can be done where workloads that are OK with some data latency, and abstraction layers to simplify interoperability (batch and real-time) with systems that have technical implementation variety but share behavior responsibility (standardization of invoking behaviors). 

Now let’s cover how platform developers and users will use the platform for their workloads.  Before we continue, I want to define the term workload to include both users (analysts, data scientist, AI engineers, developers, etc.) and systems (system accounts, APPs, ETL jobs, etc.) with access to data that want to perform some operation or compute. This platform caters to both.

Workload 1: Using the platform for real-time transactions and systems integrations

Previously, we talked about the platform capturing business events and associated transaction data. We’ll want consumers of the platform to be able to subscribe to those business events and trigger subsequent workloads (i.e. in this case, the platform facilitates orchestration). They can use the business event transaction data to both validate the data and orchestrate a series of transformations or behaviors that act on that date.  This pattern can also be used for integrating data with legacy system business process management tools. When a workload that is triggered by an event is done executing, it can optionally publish its own meaningful business event back to the platform for other workloads to fire off of and so on. This significantly reduces point to point systems integrations across your enterprise and promotes loose coupling. 

The other use case for application developers is to use the platform infrastructure within their application, to reuse code by calling one of your platform data services (e.g. master data management), and to rely on the platform as the source of truth for data that their solution needs. They can then publish data to the platform from their application and become a contributing source system.

Then there are operational workloads. These are often behaviors that are not interested in a rich data set, they want to interact with organized and structured aggregates, and most likely consume that data by calling APIs. There are many ways to implement this, so I’ll offer up an opinion based on what I have done in the past.

If you are not familiar with aggregates, read up on domain driven design (DDD). Aggregates will make it much easier to manage relationships across the different data sets you have. They will also force you to think really hard about what datasets you want to include in your interfaces. Another pattern that you should read up on is Command-Query Responsibility Segregation (CQRS).

The first important part here to remember is that you are building a major component of your platform and not an API to support a single application. So step 1 is building an abstraction layer over the data you’ve collected so you can server it to many applications that can scale independently (i.e. your API isn’t supporting a UI directly).  The first goal of this abstraction layer is to allow each app that gets built on top of it to create their own representation of the data that it servers up independently (i.e multiple read layers supporting multiple apps), however, these N layers are still limiting to the objects and relationships you represented in your domain (i.e. you are offering flexibility and independent scaling along with data consistency). You also have to remember where your responsibility ends. App teams depend on your data, but you are not responsible for the freshness of the data they decide to represent to their users, it is not your concern. In fact, after reading the data from your layer, you should expect that to their users this data will be stale. What you need to worry about is if real time data flows into your event streaming layer, and how you, as quickly as possible, reflect those updates in your domain layer. Read up on the observer pattern and check out how Netflix made their API reactive. Also read up on GraphQL.

The second goal of your abstraction layer is to make sure that there is a global scalable interface to update, create, and delete records. These are your commands that your various applications will issue. How do you scale this interface, deal with race conditions, business rules globally etc.? This is where your event streaming layer plays a role. All commands should be issued and queued in your event stream. From there, it is the platform’s responsibility to processing the command back to the systems of record, and not the responsibility of the app issuing commands (your abstraction layer). Your SLA to process those commands should be based on how often the systems of record (source systems) are willing to accept those updates. Work with the source systems to build their processing logic off of your event stream. This will simplify any complexity with multiple systems needing to be involved in processing a command and with how you notify anyone that cares when the processing completing and what the errors were if any. When the processing completes, you update your event stream, and from there is when the apps that originally issued commands complete the transactions. As for what they do during the processing period (which could be hours and days, but could also be minutes), they will need to rely on their own repositories (either persisted, or in memory, that you have nothing to do with your infrastructure) to explain to their application users what’s going on.

Lastly, this layer’s infrastructure will have an API Gateway used for data security, authentication, logging, and as a centralizing the access point for all of the endpoints that microservices being built will rely on.

Workload 2: Developing and training AI/ML models

This layer of the platform is responsible for enabling data scientists to develop, deploy, and manage AI/ML models effectively and efficiently. It consists of sub layers used for model development and training, model deployment, execution and operationalization, and model validation and optimization to ensure effectiveness and responsible use of AI.

For model development and training, data zones are available for AI/ML engineers to integrate and transform data from. They also can create and persist feature stores. In addition to data sets, they’ll have notebook environments, integrated development environments (IDEs), coding libraries used to build and train models and cloud compute resources to train their models on.

These users are more technical and have had some training in cleaning up data and moving it around. No matter how much of that problem you’ve solved, they’ll want the data munged a little differently. Thats OK with some limitations you have to set.

First, you need to make sure that all sourcing of data into the data science environment is done by the platform team. Why? You need to think beyond model development. If models were developed using data that isn’t part of the platform and you need to bring those models to production, you’ll find yourself having to build production pipelines. That could be easy, but it could also be very complex and limiting. Think of data sources that a data scientist used that aren’t being managed  (i.e. they are using development databases, or databases developed by a business team for limited purposes etc.), and because of that you can’t rely on when those databases get refreshed, how long it takes to refresh them because of long running jobs, if the data contained within them is comprehensive, or if rely on them not to go down because they aren’t production grade etc.

It is very important that the AI/ML teams understand the value of the data zones that we covered earlier for it to be used as intended. It is also important that sourcing data and the core services of the platform are implemented as discussed for the value to be realized.

Beyond that, give each user the ability to spin up their own compute and private data stores along with read access to most if not all of the data you’ve gathered and core services you’ve deployed (e.g. master data management interface etc.). Do this based on whatever controls your policies call for. If you are in the cloud, which I highly recommend you are, and you have restrictions on the level of compute you offer in your service catalog, these users are power users and will need the ability to spin up all and any level of compute you offer (again, with the appropriate level of controls to manage costs, etc.)

From there, work with the users to identify and publish standard tools and technologies you want to deploy, or cloud services you want to enable. Beyond that, if you can agree to roll out a standard technical framework to streamline model development that everyone uses, that’s even better.

Have the ability for these teams to create and share data sets that the teams themselves govern (e.g. features repositories, embeddings, etc.). They should have the ability to version them and evolve them overtime, but they should always be based on data that was pipelined into the data science environment.

Provide a code repository and a wiki where these users can manage their code and promote code reusability.

For model deployment, execution and operationalization, this layer includes tools and architecture frameworks to compile, containerize, and orchestrate the execution of models including streaming and event driven architecture to deliver model outputs and streamline the integration of insights into systems that will action them.

For model validation and optimization, this layer includes tools and components for frequent ongoing A/B testing, sampling model outputs for bias and safety, and reporting to help determine if models need further tweaking or decommissioning in production.

Workload 3: Enabling search

This layer allows users to use full text search with filtering to search large volumes of structured and unstructured data and retrieve it efficiently. To support this capability, new data needs to be indexed, and the indexes need to be stored in real-time and monitored to be maintained to satisfy performance requirements. Most cloud vendors have managed and native services that provide this functionality. 

Workload 4: Enabling analysis and Business Intelligence (BI)

This layer includes tools and data constructs such as online analytics processing (OLAP) cubes that support data analysis, complex analytics, reporting, and visualization including self-service report development and visualization capabilities targeted at business end-users, which eliminate the dependency on IT teams. Most of these capabilities will leverage the data marts, operational data stores, and curated entities created specifically to simplify analysis and analytics, report, and dashboard development. The tools also allow for access control to prevent end-users accessing data they shouldn’t have access to. End users will also rely on the data catalog to ensure the quality and accuracy of their analysis and reporting. Lastly, the platform team should monitor the performance of the tools and the reports executing and optimize the performance for business end-users.