Skip to main content

Command Palette

Search for a command to run...

Don't look down! Look at the data instead!

You might get the same vertigo

Updated
6 min read
Don't look down! Look at the data instead!

Story time

When I was working at Macquarie Bank in Sydney, I would implement new features for the bonds trading back-office. Then, I would sit down with a business analyst, Greg (if I remember correctly) and he would test various cases with me. One thing that struck me was how meticulously he would select a specific bond name, expiry date, and price. I thought, "It doesn't matter if the price is 1099.94 AUD or 100 AUD! Why is he spending so much time on this?"

Still, that made me think. I realized (much later) that I was wrong for at least two reasons. First, I was missing a good testing opportunity to use the system in conditions closer to reality. From a testing perspective, this had a better chance of revealing interesting bugs, where a business analyst might say, "Huh, that withholding tax amount doesn't make sense at all."

The second reason is a bit more subtle. I was not really caring about the service I was building, I was not living and breathing “bonds back-office”. The purpose of this post isn't to focus on this aspect, but I've often noticed the difference it makes when you genuinely care about what you are building, not just how you are building it.

Real-world data ingestion

The next step in my “data journey” was running ingestion pipelines. Every single time I ingested some large amounts of data I found some data that didn’t have the shape / properties I was expecting: “Oh, there could be no associated product metadata?”, “Yes, of course, because some products were created before the Big Migration”.

Property-based testing

Then came property-based testing (PBT). Wow, I can write properties that test my code in many different conditions! And I would find bugs. But also sometimes realize that my nicely crafted property would not catch a specific bug. Why? Well, looking at the generated data I would realize for example that I was generating lists of strings that were mostly rubbish and not really suited to my domain. Hmm, I really needed to have a look at that data…

This can be an interesting exercise. Take a moderately complex domain model and create data generators for it. Now, examine the data and ask yourself: “How good is the coverage?” and “Am I really generating interesting cases?” Then, do a quick calculation on how many possible combinations exist for that model. You'll likely reach millions quite fast. What is the chance of hitting an interesting case when you run your property test only 100 times? What if you even run it 100000 times? Just, a drop in the ocean. As useful as PBT is, we cannot generated data mindlessly.

For a while I was interested by “enumeration testing”. The idea is to start enumerating exhaustively all the possible shapes of a given data model with the assumption that most bugs can be found with small examples. Then my idea was to randomly sample that enumeration when it becomes too large to explore.

When I got the chance to discuss this idea with John Hughes (of QuickCheck fame), he quickly dispelled my illusions. First of all, some bugs only show up after a rather complicated set of steps. John has an example of an issue in a stateful system that only shows up after 17 steps, but that counter-example was shrunk from an original failing test case of more than 100 steps! Then he told me that in his experience business knowledge is indispensable when driving your data generation strategy.

That shouldn’t be surprising to any seasoned QA or test-minded person but the consequence is that we can’t consider data as a purely randomly generated artefact. The more we know its business meaning and edge cases, the better it is.

Serialization

The next example comes from serialization. I previously worked on a project where a "Ledger" (think of it as an immutable list of transactions) was shared across nodes. The data was serialized using CBOR to keep it compact. Serialization was a major concern because, as the system evolved, the data types changed, new data was stored differently, and we still needed to read data from the very beginning.

So, I began examining different pieces of data, both before and after changes: which fields were serialized and in what way. Then, I noticed we were quite inefficient in how we stored data. For example, we would allocate an array nested three times: [[[a]]] when [a] would have been sufficient.

Why? Because of the power (or curse) of macro annotations. Many languages now support serialization where you simply annotate the fields of a data type to automatically get serialize/deserialize instances. This is very useful, especially for prototyping when you want to quickly import or export data. However, I have come to dislike this feature in production code for two reasons:

  1. Serialization requirements are not “one size fits all”. For example I had issues with serde in Rust where some annotations were incompatible depending on the target format bare or JSON.

  2. It becomes impractical to implement data model evolutions on large models.

I'm going off-topic a bit, but my point is this: I only realized how inefficient our data serialization strategy was once I took a close look at the data.

Visualizing data

Over time, I've realized how helpful it is to display data clearly. This starts with logs. Sure, we can dump a Debug instance of some data type and get created: Node { name: Name("n1"), peers: std::Vec(Peer { peerName: "p1" }, Peer { peerName: "p2" }) }. But it would be much nicer to read something like created node n1 [p1, p2] instead. When you're reading hundreds of lines of logs, this can make a big difference and help you spot discrepancies faster.

Another situation is when data, whether it's incoming messages or your system state, is somewhat complex. In such cases, being able to visualize it over time is very helpful. Here is a recent example:

I created a simple HTML page that helps us visualize the main stages of a simulation trace for the system I'm currently working on. While running the simulation and observing messages move from stage to stage, I noticed that the final stage, forward_chain, was receiving some duplicate messages. This should not happen! It means our node is sending unnecessary information to other nodes.

This isn't really a correctness issue—the data is still accurate. However, in a distributed system, this could mean that our node might get evicted due to its spammy behavior.

The great thing is that I only noticed this issue because I closely examined the data. The code was compiling, and all our tests were passing. It's important to note that I probably wouldn't have discovered this issue just by looking at the logs.

Conclusion

I am increasingly convinced that it's worthwhile to carefully examine the data, whether it's:

  • Test data

  • Incoming data

  • System state

  • Outgoing data

But data is quickly messy and too voluminous so it can really pay off to build small visualization tools to abstract the relevant parts and skim the rest. The good news is that we don’t have to be UI wizards anymore, LLMs can create these tools for us 😊.

Don't look down! Look at the data instead!