The following question came up in a recent conversation:
“Why do you gravitate towards data-centric projects?”
I was surprised by my answer, which is why I’m sharing. Essentially, I believe that data projects are less emotional than traditional software projects. The expected outputs can be clearly defined and are less subjective. On a data project, although there are several paths to success, they all look and feel quite similar. For a software project, success has a much wider range of variability. I’ll give an analogy – Data Engineering is like a grocery store, while Software Engineering is like a restaurant. Let me explain…
With traditional software, companies often pay 6/7/8 figures for projects or SaaS tools to support mission critical business processes – think CRMs, financial systems, help desks, etc… These systems are typically customizable to fit company-specific preferences, much like a restaurant allows substitutions. And just like restaurants, user sentiment tends to be polarized. People love or hate systems like Salesforce – rarely anything in between. The same goes for McDonald’s. These industries require highly specific customer targeting. Grocery stores, on the other hand, cater more broadly and don’t rely as heavily on emotional appeal or fine-tuned experiences.
Another aspect worth considering is that Data Engineers typically work with data produced by software products. That means they’re often brought in after the fact – usually when querying data becomes slow, inconsistent, or expensive. To bring back the analogy: imagine someone who isn’t an experienced home chef but likes to eat good food. At first, they might rely heavily on restaurants or takeout. Eventually, health or financial concerns force a shift – maybe to frozen meals (think PF Chang’s in the grocery freezer), and eventually, to cooking from scratch with groceries.
Similarly, companies may start with SaaS dashboards or run ad-hoc SQL on siloed systems. But over time, data sprawl leads to confusion, inconsistencies, and frustration. That’s when data engineering becomes essential. Like cooking your own meals, it requires more effort up front, but it creates a healthier, more scalable data ecosystem in the long run.
Happy Querying
– DQC
Author: Data Quality Consultants
-
Software Engineering vs. Data Engineering – An Analogy
-
Cloud Data Migration – Part II (Failure)
Of course, things don’t always go as planned. To illustrate, I’ll be summarizing a failed database migration that started more than a year before I arrived on the scene. The technology stack was already selected, the mission was relatively clear, so what went wrong? Part I briefly referenced “Project Management” as a special set of skills crucial to the outcome of these types of initiatives. This is where the organization was lacking. The project lead must have the technical ability and executive authority to drive the project forward. If either of these components are lacking, the red flags are a-flapping.
Luckily for them, my involvement greatly enhanced the technical capacity of the team. Not that the existing staff weren’t capable, they just needed a little extra bandwidth. However, when it came to executive authority, we had almost none. The dedicated project manager didn’t fully grasp the nature of the work but also wouldn’t trust the team to make independent decisions. When every decision has to go through an executive (not necessarily C-Suite but someone with real project influence) who doesn’t understand or trust, nothing moves. Thus was the fate of this project. We were able to get all of the infrastructure set up and configured, but the project stalled completely when it came down to migration of the business facing reports.
My suggestion was to inventory all reporting assets, generate a time estimate per report to update connection strings and republish, then multiply the unit estimate by the number of reports. I wrote a script to handle the inventory process, estimated 1 hour per unit, and multiplied by 400 reports (the exact number escapes me). This means that 400 hours of work could be distributed to 4 developers who commit to 10 reports per week. This translates into a 10-week project with clear goals and deliverables. Did this come to fruition? No. The individuals with executive authority decided not to execute, the project stalled indefinitely, I left the project when it was clear nothing was happening any time soon.
The End.
So, what are the key takeaways from this?
1.) Skill is not enough sometimes. You also need authority to march forward.
2.) If you’re less technical than the team you’re leading, trust them or your success will be limited.
3.) The last 10% – 20% of the project is usually where teams get stuck.
4.) I’ll bail on a failing project when I have no authority to course correct. You should consider this if you’re able to see the red flags in real time.
5.) It’s possible to make a positive impact, even on a sinking ship. Don’t let one sour experience change your entire outlook.
Signing off…
-DQC -
Cloud Data Migration – Part I (Success)
One trend that doesn’t look like it’s slowing down anytime soon is the desire for data teams to move reporting infrastructure from on-premises to one of the major cloud providers: Amazon Web Services, Microsoft Azure, or Google Cloud Platform. There are several reasons for this. Cloud implementations are more secure by default, easier to hire for, and generally more scalable. The flip
side to this is that you need a special set of skills to make the migration happen smoothly or even at all. In the following paragraphs, we’ll cover what this looks like in practice using an actual project I worked on as a case study.
“Move everything to the cloud!” was the directive from the C-Suite. Before diving into the project details, it’s important to note that any cloud migration project should have executive buy-in. Technical willingness and ability come second to alignment with leadership’s vision. In this case, there was both clear vision and alignment. My portion of the project was mostly focused on front end work. The
ETL pipelines were already established using cloud technology, however the reports were still served on-premises.
I started by researching the steps needed to translate the reports from one language to another. The core SQL queries would remain the same, however the semantic layer needed modification. Once the translation pattern was clear, the next step was to roll up my sleeves and implement the pattern one by one until each report was fully reconciled and surfaced in the cloud platform. The research process took roughly two weeks, the development process took about 6 weeks.
The organization was able to retire the on-premises server once the development process completed. The business immediately realized the value of completing this project due to added features and ease of use. This is what success looks like – Short and Sweet.
Stay tuned for Part II, which will cover Long and Bitter :).
– DQC -
Impactful Datasets – Part II
The reality in the corporate world is that executives are responsible for seeing into the future. This assertion comes with no implicit mysticism. If the C-suite predicts the future correctly, shareholders, employees, and customers will generally all be positively impacted. A great example of this is Apple’s 2007 partnership with AT&T to launch the iPhone. Apple now enjoys a multi-trillion-dollar market cap, up from roughly $150B in 2007. If different choices were made by leadership along the way, the story would be wildly different.
As we move into the anticipated era of Generative AI, data will be increasingly relied upon to make better bets on infrastructure, staffing needs, R&D investments, and more. A quick web search of “data driven case studies” will show you several examples of Business Intelligence and data as the catalyst to true competitive advantages for various organizations. If a dataset can help executives make better decisions, I consider it impactful. I don’t think this will be changing anytime soon.
We can all agree that when it comes to data – garbage in / garbage out. Moving forward, ensuring data quality will be non-negotiable. If you’re an executive reading this, it’s not too late to start. Solely relying on your direct reports to interface with the data team can be risky. Vision is often lost in translation…
– DQC -
Impactful Datasets – Part I
The topic of impactful datasets can generate many different opinions, especially concerning the definition of ‘impactful’. I consider an impactful dataset to be one that does one or more of the following:
1.) Saves significant amounts of time
2.) Significantly increases revenue or reduces expenses
3.) Facilitates improved executive decision making
Let’s start by unpacking the time saving element. Far too many organizations have teams that spend hours upon hours manually updating spreadsheets. Usually this is due to the low barrier of entry – almost anyone can get started with Google Sheets or Excel. The problem is that these are tools meant for ad-hoc analysis but end up being leveraged as long-lived data repositories. Automating one of these processes can easily save between 2-10 hours per FTE per week. This means a department of 10 analysts could win back 100 working hours per week as a result of a carefully crafted dataset. This doesn’t mean 2.5 analysts should be terminated, but it does mean that you can focus on more strategic work with the extra time.
Let’s now discuss revenue increasing/cost saving datasets. The rise of LLMs has showed us that data in the right format can be extremely valuable. Quite literally, an industry has been formed based on the fact that you can collect large amounts of data from the internet, feed it into Machine Learning algorithms, then store this condensed knowledge base into a dataset called an LLM. Developers and consumers alike are happy to pay for tokens, subscriptions, and other derived services, which is essentially paying for access to valuable datasets. The flip side of this phenomenon points to the cost saving potential of this trend. Let’s face it, the current developer job market is stagnant because companies can produce a similar output with fewer AI-assisted developers. Even though long-term code quality is questionable, short-term savings are definitely being driven by LLMs.We’ll cover executive decision making in the next one…
– DQC -
The Definition of Data
What is data? This seems like a simplistic question until you really sit down to think about it. Is writing the number two on a piece of paper considered data? What about the exploding message that contains Inspector Gadget’s newest mission? Admittedly, these are a lot of questions, but that is the nature of data. Data often leads to more questions despite getting more answers.
To me, data is defined as, “The attempt to describe reality in a format that is designed to be accessed more than once.” So, to revisit our initial questions:
Is writing the number two on a piece of paper considered data?
It probably won’t surprise you if I say, “it depends”. The piece of paper definitely meets the formatting requirement, but if writing the number two isn’t an attempt to describe reality, then it should not be considered data.
What about the exploding message that contains Inspector Gadget’s newest mission?
The mission is an attempt to describe reality by communicating what the author wants to happen. Additionally, even though the message is intended to be transient, both the author and the recipient accessed its content. So yes, the mission qualifies as data.
However, an unrecorded conversation does not meet these two requirements. Information is shared, but it is not stored in an accessible format. You can argue that the human brain is an accessible format, but I’d counter with a friendly challenge – try to remember what you had for dinner 1 week ago on Tuesday.
Goodbye for now.
– DQC -
Thoughts on AI – v0
The elephant in the room is Artificial Intelligence (AI). Is it just media hype or will AI actually change the world in ways it’s difficult to fully grasp? I won’t pretend to know for sure, but I’ve definitely got an opinion to share.
First, the hype train is real but there is substance to this current iteration of what we know as AI. AI can write content, AI can generate photos, AI can produce videos, and yes – AI can write code. But will this all be enough to replace humans with robots? I think not. What’s happening now is that individuals in creative spaces are using these tools to be more productive and less bogged down in historically tedious tasks. This does spell trouble for newly minted professionals, however there is equal opportunity for them to adapt and attempt to even the playing field.
In the short term, many companies will try to streamline their operations by lowering headcount and introducing AI assistance. This will lead to fewer jobs across many industries – we’re already seeing this trend in the marketplace. However, I believe that in the long term, these talent pools that have been shut out from corporate opportunities will find a way to coalesce their creativity and shift the paradigm. Think Bitcoin but on a larger scale.
I encourage you to read this article by the creator of Open WebUI. I think we’ll see more of this mindset find its way into the mainstream – not just reserved for techies on the bleeding edge of LLM ecosystem development. Part of the reason for this blog is to make sure the human element continues to define our work. When you think about it, much of the data created nowadays serves as a proxy for how humans behave, think, and feel.
The concept of ‘digit’ in digital is literally a reference to the human hand. Marinate on that.
– DQC -
Tools vs. Fundamentals
There’s constant chatter amongst data professionals about what tools to use to get the job done. I could spend all day rattling them off: Snowflake, Spark, Databricks, Big Query, Kafka, dbt, SQLMesh, and the list goes on and on… But here’s the real question – When’s the last time you heard a discussion about getting back to the basics?
Personally, I think the fundamentals are what separates average performers and “10x” Data Engineers. It’s not generally the tooling that makes the big difference. I’ve seen more business problems solved with a relational database and SQL queries than the latest and greatest of the modern data stack. Better tools can actually mask poor decision making – until it can’t any longer. That’s the pivotal moment when you’ll be forced to get back to the basics anyway.
Do your stakeholders really care that you migrated to Databricks from a SQL Server database? Probably not. Adopting new tools can actually increase time to insights due to the resulting learning curve, which is largely unavoidable. So, while the data team is happily exploring new technologies and padding their resumés, the business isn’t making any real strides from a data perspective. I encourage my peers to focus on communication, effective requirements gathering, customer service, and actually understanding what the business needs so we can deliver and make an impact.
Even though we are usually in a support role, it’s amazing what a motivated data professional can accomplish when focusing on the fundamentals instead of shiny objects. Rant over. Thanks for reading.
– DQC