Logo

Shifting Left

Cover Image for Shifting Left
Tom Scott

Tom Scott

blog

In the modern data stack every data job has an offline and online component. Typically these are served by vastly different systems and employ a multi-hop architecture to make data sets available for each. This can be enormously inefficient as each hop typically involves physical data movement, transformation and processing. These hops are commonly developed in isolation across many technologies, languages and approaches resulting in a Data MESS rather than a Data Mesh.

The shift left approach adopts a consistent processing schema (typically shifting the offline processing left towards online) for both enabling access to complete, reliable and trustworthy data across the entire data estate. This reduces duplication of work, lowers costs and eases the maintenance burden associated with surfacing new data.

ETL

As a pattern, ETL (Extract Transform, Load) is one of the oldest still in use today. It involves fetching data from a given source (Extract), preparing it for a given destination (Transform) and inserting it into the destination (Load). This pattern has been employed in developing data pipelines for decades and works well when data is largely static and pipelines are generally short. This however, is not the case in modern, data driven organisations. The prevalence of real-time data systems and use cases has led to a requirement for low or zero latency data availability that traditional ETL pipelines cannot satisfy.

Moving past the Medallion architecture

Most ETL based data warehouses and lakes employ the medallion architecture. This splits datasets into one of 3 categories:

  • Bronze data represents the raw data collected from sources. This is the entry point of a dataset into an organisation's data estate. This data is commonly unstructured and incomplete for use by downstream applications but can be used for exploratory purposes such as cases involved with data science or machine learning.
  • Silver data represents cleaned, filtered and standardized data that is suitable for forming the building blocks of most analytical jobs.
  • Gold data represents “business level” data suitable for specific applications and cases. These data sets are curated for a specific purpose and built up from the silver data blocks.

alt text

The medallion architecture, a popular version of the multi-hop architecture

The medallion architecture is wildly successful but has a number of issues:

  1. Data cannot be re-used for operational purposes. Medallion architecture data is copied from operational stores for the purpose of analytics only and thereby loses its relationship with the operational tasks that generated it.
  2. The medallion architecture involves a lot of duplication and processing that can cause costs to spiral.
  3. Changing things is hard, whether responding to an issue or just experimenting, any change results in significant reprocessing of data. As there are multiple copies of datasets this also introduces consistency issues between the layers as they are interdependent.
  4. The bronze layer can easily get out of control (the data swamp!). Bronze is a dumping ground for all newly ingested data and is, by design, not well curated. Wild variance in structure, volume and poor monitoring can lead to issues in the source data that are then magnified in the other layers.
  5. All hops are driven by pulling data from the previous stage. If a new data source becomes available, ETL jobs must be created to materialise data in Bronze, Silver and Gold layers before the data can be fully utilised.

Shifting Left addresses these issues

The Shift Left concept moves the focus of data for analytical purposes left in the architecture diagram away from dedicated analytical stores and towards operational systems that are already managing organisations for “business as usual” purposes.

By taking this approach, you can leverage the work you're already doing to provide high quality, secure and governed data for operations in the analytical plane too. The best part is that this leverages work already completed and in progress for operations and so can be applied iteratively and partially and still provide benefit. A departure from the usual large commitments required for physical data movement based approaches.

alt text

Post shift left, the only physical copy of the data resides in the operational system, addressing issues 1 and 2 for the medallion architecture the typical cleaning, filtering and copying that happens in the ETL process has been abstracted to a set of logical operations.

This set of logical operations is easy to modify/expand and any changes are instantly applied to the dataset addressing issues 3. As the analytical view is purely logical - new operational sources and flows are immediately available for analytics - thus solving issue 5.

The way in which shifting left addresses issue 4 is more subtle. In the approach above the single source of truth for data has 2 purposes, supporting operational applications and providing analytical capabilities. This means that either purpose can leverage the security, governance and maintenance practices of the other, doubling the coverage that protects against stale, corrupt and irrelevant data (the data swamp).

Note: Cleaning and filtering processes are relevant to operational systems too so could be handled by stream processing and written back to the operational store. You will often find that you are already doing this work to power non analytical applications.

Stream Table Duality

The major difference between a traditional ETL based approach and the Shift Left approach championed here is the storage format for the data. Operational stores typically favour the stream, an unbounded and continuous flow of data whereas analytical stores favour the table, a bounded and point in time view. Thankfully the concepts embodied in these two views are interoperable so that, once more, only logical constructs are required to switch between them:

Operational Unit Analytical Unit Description
Cluster Namespace Each Operations cluster is represented as an analytical namespace to indicate the separation of resources between them.
Topic Table As logical collections of data points both topics and tables are equivalent.
Message Row An individual message within a topic can be viewed as a row within a table when combined with a schema to indicate a consistent structure.
Message field Column A field within an event streaming message represents a single attribute of a datapoint, A similar concept to a column within a table row.

In Summary

Shifting left creates a single source of truth for the entire data estate that is clean, reliable and well maintained. It seamlessly combines concerns from operational and analytical workers and customers into a flow that reduces complexity and cost and improves consistency and scope of data available. Shift left decentralises the responsibility for datasets away from central data stores and technology focused teams into the hands of those that are close to the generation and usage of the data. Empowering those that understand the data best with the tools to manipulate and represent it.