You’ve probably heard of the data modeling technique referred to as ‘One Big Table’. The technique essentially revolves around preferring wide tables in order to make ingestion into BI tools easier for analysts. Less talked about is why analyst ease of use is important to data teams. Does the business care? I’m almost certain the answer to this question is a resounding “No”! The business tends to care about timely and accurate data products, not necessarily the developer experience.
Based on what I’ve seen professionally, data teams gravitate towards OBT because analysts generally do not enjoy writing SQL. If you are an analyst reading this and you DO enjoy writing SQL, congratulations! You’ll likely be a Data Engineer within a few years or less. But for the others, what tends to happen is they either (1) struggle with SQL, which leads to unnecessary errors, rework, and frustration, or (2) rely on the BI tool to perform their joins, which leads to performance issues. In both scenarios, the problem bubbles upstream until Data Engineering solves it with OBT. There’s nothing inherently wrong about this, however I do think companies would benefit from having their Data Analysts upskill. Repetitive joins don’t need to be addressed with engineering time if the analysts are sufficiently capable.
I think a side effect of abstracting away source system relationships is that the data product becomes a black box very quickly. When there is a data quality concern, the only team that can address it becomes Data Engineering since all of the logic is now in their OBT. If analysts wrote the queries themselves, triaging issues should be faster and involve fewer team members. We’re all about efficiency here 🙂 – Thanks for reading.
– DQC
Tag: Data Quality
-
One Big Table (OBT) – Why?
-
Cloud Data Migration – Part II (Failure)
Of course, things don’t always go as planned. To illustrate, I’ll be summarizing a failed database migration that started more than a year before I arrived on the scene. The technology stack was already selected, the mission was relatively clear, so what went wrong? Part I briefly referenced “Project Management” as a special set of skills crucial to the outcome of these types of initiatives. This is where the organization was lacking. The project lead must have the technical ability and executive authority to drive the project forward. If either of these components are lacking, the red flags are a-flapping.
Luckily for them, my involvement greatly enhanced the technical capacity of the team. Not that the existing staff weren’t capable, they just needed a little extra bandwidth. However, when it came to executive authority, we had almost none. The dedicated project manager didn’t fully grasp the nature of the work but also wouldn’t trust the team to make independent decisions. When every decision has to go through an executive (not necessarily C-Suite but someone with real project influence) who doesn’t understand or trust, nothing moves. Thus was the fate of this project. We were able to get all of the infrastructure set up and configured, but the project stalled completely when it came down to migration of the business facing reports.
My suggestion was to inventory all reporting assets, generate a time estimate per report to update connection strings and republish, then multiply the unit estimate by the number of reports. I wrote a script to handle the inventory process, estimated 1 hour per unit, and multiplied by 400 reports (the exact number escapes me). This means that 400 hours of work could be distributed to 4 developers who commit to 10 reports per week. This translates into a 10-week project with clear goals and deliverables. Did this come to fruition? No. The individuals with executive authority decided not to execute, the project stalled indefinitely, I left the project when it was clear nothing was happening any time soon.
The End.
So, what are the key takeaways from this?
1.) Skill is not enough sometimes. You also need authority to march forward.
2.) If you’re less technical than the team you’re leading, trust them or your success will be limited.
3.) The last 10% – 20% of the project is usually where teams get stuck.
4.) I’ll bail on a failing project when I have no authority to course correct. You should consider this if you’re able to see the red flags in real time.
5.) It’s possible to make a positive impact, even on a sinking ship. Don’t let one sour experience change your entire outlook.
Signing off…
-DQC -
Cloud Data Migration – Part I (Success)
One trend that doesn’t look like it’s slowing down anytime soon is the desire for data teams to move reporting infrastructure from on-premises to one of the major cloud providers: Amazon Web Services, Microsoft Azure, or Google Cloud Platform. There are several reasons for this. Cloud implementations are more secure by default, easier to hire for, and generally more scalable. The flip
side to this is that you need a special set of skills to make the migration happen smoothly or even at all. In the following paragraphs, we’ll cover what this looks like in practice using an actual project I worked on as a case study.
“Move everything to the cloud!” was the directive from the C-Suite. Before diving into the project details, it’s important to note that any cloud migration project should have executive buy-in. Technical willingness and ability come second to alignment with leadership’s vision. In this case, there was both clear vision and alignment. My portion of the project was mostly focused on front end work. The
ETL pipelines were already established using cloud technology, however the reports were still served on-premises.
I started by researching the steps needed to translate the reports from one language to another. The core SQL queries would remain the same, however the semantic layer needed modification. Once the translation pattern was clear, the next step was to roll up my sleeves and implement the pattern one by one until each report was fully reconciled and surfaced in the cloud platform. The research process took roughly two weeks, the development process took about 6 weeks.
The organization was able to retire the on-premises server once the development process completed. The business immediately realized the value of completing this project due to added features and ease of use. This is what success looks like – Short and Sweet.
Stay tuned for Part II, which will cover Long and Bitter :).
– DQC -
Impactful Datasets – Part II
The reality in the corporate world is that executives are responsible for seeing into the future. This assertion comes with no implicit mysticism. If the C-suite predicts the future correctly, shareholders, employees, and customers will generally all be positively impacted. A great example of this is Apple’s 2007 partnership with AT&T to launch the iPhone. Apple now enjoys a multi-trillion-dollar market cap, up from roughly $150B in 2007. If different choices were made by leadership along the way, the story would be wildly different.
As we move into the anticipated era of Generative AI, data will be increasingly relied upon to make better bets on infrastructure, staffing needs, R&D investments, and more. A quick web search of “data driven case studies” will show you several examples of Business Intelligence and data as the catalyst to true competitive advantages for various organizations. If a dataset can help executives make better decisions, I consider it impactful. I don’t think this will be changing anytime soon.
We can all agree that when it comes to data – garbage in / garbage out. Moving forward, ensuring data quality will be non-negotiable. If you’re an executive reading this, it’s not too late to start. Solely relying on your direct reports to interface with the data team can be risky. Vision is often lost in translation…
– DQC -
Impactful Datasets – Part I
The topic of impactful datasets can generate many different opinions, especially concerning the definition of ‘impactful’. I consider an impactful dataset to be one that does one or more of the following:
1.) Saves significant amounts of time
2.) Significantly increases revenue or reduces expenses
3.) Facilitates improved executive decision making
Let’s start by unpacking the time saving element. Far too many organizations have teams that spend hours upon hours manually updating spreadsheets. Usually this is due to the low barrier of entry – almost anyone can get started with Google Sheets or Excel. The problem is that these are tools meant for ad-hoc analysis but end up being leveraged as long-lived data repositories. Automating one of these processes can easily save between 2-10 hours per FTE per week. This means a department of 10 analysts could win back 100 working hours per week as a result of a carefully crafted dataset. This doesn’t mean 2.5 analysts should be terminated, but it does mean that you can focus on more strategic work with the extra time.
Let’s now discuss revenue increasing/cost saving datasets. The rise of LLMs has showed us that data in the right format can be extremely valuable. Quite literally, an industry has been formed based on the fact that you can collect large amounts of data from the internet, feed it into Machine Learning algorithms, then store this condensed knowledge base into a dataset called an LLM. Developers and consumers alike are happy to pay for tokens, subscriptions, and other derived services, which is essentially paying for access to valuable datasets. The flip side of this phenomenon points to the cost saving potential of this trend. Let’s face it, the current developer job market is stagnant because companies can produce a similar output with fewer AI-assisted developers. Even though long-term code quality is questionable, short-term savings are definitely being driven by LLMs.We’ll cover executive decision making in the next one…
– DQC