ETL (Extract, Transform, Load). It’s a term that often scares many founders. But at some point, any company relying heavily on data processing and ingestion (we’re looking at you, AI companies) will need to transition away from manual processes and move toward a streamlined ETL pipeline.
I recently worked with a client who collects vast amounts of measurements from a specialised niche industry, gathered from various locations worldwide. These data points lack consistency—they measure different things, are formatted differently, and may even use varying languages or units. The company’s vision is to Extract these disparate data points, Transform them into a unified data model, and Load them into one SaaS product. In other words: ETL. With this data at their disposal, their users can finally query and analyse the data seamlessly, scaling up calculations and visualistions on a global scale.
The Mechanical Turk Approach
So, how did they kick off this ambitious project? Enter the Mechanical Turk. If you’re unfamiliar, the original “Mechanical Turk” was an 18th-century machine that appeared to play chess independently, though it was actually controlled by a hidden human operator. (By the way, Amazon wouldn't be Amazon if they didn't have a service for that.) And so too did my client, they hired one data analyst and a team of students to manually process the enormous variety of data. Naturally, as the workload grew, they started scripting bits and pieces, but it remained a costly, labour-intensive process. This was their “aha” moment: they needed an ETL pipeline, not a swarm of humans. So, they brought in a specialised profile to spearhead the effort.
Enter the Data Engineer
In the realm of data, a data engineer acts as the technical bridge between data science and software engineering. Their role involves building, deploying, and maintaining the infrastructure that powers automated data and AI workflows. In this case, the data engineer was tasked with analyzing existing manual workflows and designing a scalable ETL solution. Their initial plan included:
- Automating the final step of the process: uploading the cleaned dataset to the SaaS platform.
- Deploying infrastructure to run automated scripts.
- Creating a user interface to manage automations and monitor the status and output of each run.
- Gradually adding more automation scripts.
At first glance, this might seem like a solid, incremental approach. However, it risked overlooking key business priorities.
With no automation in place, the company could achieve quicker wins by automating the most repetitive tasks locally, rather than prematurely focusing on deploying an empty production environment. Moreover, the high data quality standards necessitated a mechanism for human review and approval before pipeline continuation. This requirement meant the pipeline architecture needed to accommodate manual intervention.
Without a clear strategy, the team risked wasting time and resources on overly complex solutions that failed to address the company’s immediate needs and goals.
Introducing the ETL Roadmap
This story highlights an important lesson: ETL isn’t a simple plug-and-play tool you can tack onto an existing application. It’s a core component of your product that requires thoughtful planning, prioritization, and a clear roadmap.
So, how do you get started? The "Mechanical Turk" approach offers a key insight: it helps you identify a process that works and pinpoint where time is being wasted. Armed with this understanding, you can organize a process flow diagram workshop. In this workshop, you map out the data processing workflow, detailing each step, potential divergences, and the inputs and outputs at every stage.
This process flow diagram becomes the foundation of your ETL architecture and roadmap. It defines the end-to-end workflow and guides development, with each step iteratively added and automated over time. Moreover, the diagram serves as a starting point for addressing broader business requirements, such as integrating manual review and approval into the pipeline.
By assessing each step against benchmarks like average time investment, impact on data quality, and automation complexity, you can prioritize the ETL roadmap more effectively, focusing on high-impact tasks first. This approach also fosters creative solutions for addressing more complex challenges. For example, in this case, we postponed building a dedicated user interface for managing automations and instead leveraged Notion as a practical alternative. With a custom integration, Notion allowed us to visualise each automation—including its status and (error) responses—and provided the functionality to launch automations directly from within Notion.
With a well-defined roadmap, ETL evolves from a simple add-on into a core component of the business strategy—scaling with the company’s growth and delivering lasting value.
Member discussion