Improving Version Management of Data

Todd Perry
The Finatext Tech Blog
8 min readDec 15, 2023

--

Hello! I’m Todd, a Data Engineer at Nowcast. This article is the 16th day of the Finatext Group 10th Anniversary Advent Calendar. Yesterday, Sugawara-san published an article titled “『投資がわからない』の正体 ~投資を始めるベストタイミングとは~”.

What is data versioning.

Much like how software engineers version their software, data engineers need to version their data.

Generally speaking if the contents of the data changes, we should introduce a new version so we are able to distinguish between the “old” and “new” data. A change in the data can mean a number of different things — for instance adding or removing columns, introducing new files/tables to the dataset, or something more subtle like changing the data-type of a column, or the way the data in a given column is generated. If there is client receiving an on-going data feed, and we suddenly change something, it could cause issues in software that is consuming the data feed.

In most cases, the datasets we work with at Nowcast are composed of more than 1 table — there are many different components that all require their own versioning — so the problem isn’t as simple as attaching a number to each dataset that increments whenever we make a change.

Why is data versioning important

Especially in the case of on-going feeds that are being consumed by other data pipelines, it is important to implement some kind of data versioning. As mentioned above, sudden changes to the data can cause issues — in the case of large changes, such as adding/removing/renaming columns, it could cause pipelines that consume the data to fail/breach SLA. It may seem like subtle changes such as the way a column is being generated are less damaging, but this could cause systems such as machine learning models to act in unexpected ways, so they should also be considered carefully. Simply put, to prevent problems downstream we should avoid changing live data feeds, and properly introduce versions.

As well as avoiding sudden changes in a data feed, it is also important to continue to provide old versions of a data feed to consumers, such that they have time to transition from old versions to new versions.

Some Best Practises

So we know why we should version data — but how should it be done? Below are some best practises we’ve developed at Nowcast over the last few years.

Version names should be easy to understand. The data consumers need to understand the order of the data versions — something like Major-Minor-Patch (1.0.0, 1.0.1, 1.0.2 .. etc), or just the date is good for this. At nowcast we often use Major-Minor-Patch, where Patch is a data generation method/mapping change, Minor is a large change such as adding/removing columns, or changes to the source data. Major is reserved for large versioning changes in the source data.

Old versions of a feed should continue to be delivered in the same directory. It isn’t always possible, but when it is, older versions of the data need to be continued to be delivered to consumers. Even if it is just for long enough for them to transition to newer versions of the data. Shutting off a feed without giving anyone time to migrate to an alternative is a good way to break any downstream systems that are dependant on the feed. You can think of this flow like depreciation — feeds shouldn’t just be shut off, they need to be deprecated with functioning replacements beforehand — and then they need to be maintained for a suitable amount of time. Depending on the criticality of the feed and the number of consumers, this deprication period can last years — as such the software side of the data pipeline should be designed with this in mind.

The data generation process associated with a given version should be idempotent. For any given version, the ETL code (ELT, data pipeline, whatever you want to call it) that generates the data should work idempotently if possible. In other words, running the an ETL job for some input data source over a given time period should always result in the same output. This is important for the reproducibility of the data, but also makes checking the data’s integrity much easier. As a data engineer it is always a relief to know that I can regenerate any dataset that I’m using. Building this kind of idempotent pipeline generally means avoiding randomly generated IDs, such as UUID4.

New versions should be delivered in new directories. We should always separate data from different versions in some easy-to-understand way. If using cloud file storage like S3 or GCS it is good practise to seperate versions by directory. In Snowflake something like Schema can be used — or at the very least table suffixes. The main thing to avoid is having a single directory with the latest version in as the main point of data ingestion — and then every time a new version is released updating this directory to contain the new data. This is not much better than having no versioning at all in that it will still cause consumer’s code to break.

How not to handle versioning

To illustrate the difference between good and bad data version management lets consider an example consumer transaction dataset with 3 different tables:

transaction: contains consumer transaction data
metadata: contains some information about the entire panel, such as monthly users, monthly sales volume etc
mapping: contains mappings that allow us to join transactions to companies

As the data engineer in charge of managing this data feed, we need to reflect a change in the data to our data feed — for instance we could be removing 1 column and adding a new one, but the mapping and metadata won’t change

In the beginning we have a single directory called data_feed, that contains all of our data:

└── data_feed
├── transaction (v1)
├── metadata (v1)
└── mapping (v1)

One way to release this new data is just to replace the v1 data with the new (v2) data:

└── data_feed
├── transaction (v2)
├── metadata (v2)
└── mapping (v2)

But what if the consumer was using the column we just deleted? we would have just broken their data feed — so lets try something a bit different. We need to continue to send the old data in order for them to migrate — why don’t we try delivering both v1 and v2!

└── data_feed
├── v1
│ ├── transaction (v1)
│ ├── metadata (v1)
│ └── mapping (v1)
└── v2
├── transaction (v2)
├── metadata (v2)
└── mapping (v2)

This is slightly better than the original attempt, but there is still an issue, we moved the data without giving the consumers time to migrate from the old unversioned structure. While this structure will work well going forward, we should provide consumers with both the original file format and the newly versioned format to give them time to migrate:

└── data_feed
├── transaction (v1, temporary)
├── metadata (v1, temporary)
├── mapping (v1, temporary)
├── v1
│ ├── transaction (v1)
│ ├── metadata (v1)
│ └── mapping (v1)
└── v2
└── transaction (v2)
└── metadata (v2)
└── mapping (v2)

We should provide the exact same data in the exact same locations as before until all consumers have updated to the new v1 directory. After the consumers have updated we can remove the files stored directly under the data_feed directory — so if clients want to use v1, they can read it from the v1 directory.

As mentioned earlier this process is analogous to deprication in software engineering. You wouldn’t delete a function from library code before users have a chance to move over to the new method — in principal this is the same.

How can we keep track of data versioning

As mentioned earlier datasets are generally composed of multiple tables — each of the tables should be versioned, and a global version should be used to refer to a specific configuration of table versions. Up until the adoption of DBT, many projects managed their versioning using a JSON file, much like below:

[
{
"version_number": "1.0.0",
"release_date": "2022–01–15",
"table_versions": {
"raw_transaction": "1",
"aggregated_transaction": "1",
"metadata": "1",
"mapping": "1"
},
"storage_locations": [
"/data_feed/",
"/data_feed/v1/"
]
}, {
"version_number": "1.0.1",
"release_date": "2022–02–26",
"table_versions": {
"transaction": "1",
"aggregated_transaction": "2",
"metadata": "1",
"mapping": "1"
},
"storage_locations": [
"/data_feed/v2/"
]
}, {
"version_number": "1.0.2",
"release_date": "2022–04–11",
"table_versions": {
"transaction": "2",
"aggregated_transaction": "3",
"metadata": "1",
"mapping": "1"
},
"storage_locations": [
"/data_feed/v3/"
]
}
]

Using this approach we can easily track the composition of each global version — as well as any metadata associated with the versions such as the release date or storage locations. It is also language agnostic and easy to implement, although there are some downsides to using this approach.

Disadvantages of this approach

This approach is hard to standardize because it’s not constrained by any programming frameworks — JSON structures can be subtly different from project to project.

Another issue that isn’t easily solved is that different tables can depend on each other — for instance we have 2 different types of ‘transaction’ — `raw` and `aggregated`, where `aggregated` depends on `raw`. If we bump the version of `raw`, `aggregated` will also need to be bumped — but an update to the code that builds the `aggregated` data would cause only `aggregated` to be bumped. It’s hard to capture these relationships when using a JSON structure like this. For complex datasets with many components and many layers using JSON to manage versions like this becomes unwieldy.

Advantages of Data Versioning in DBT

As of late 2022, we have been writing our ETL/ELT code at Nowcast using DBT. Among the many advantages it has over the python based ETL code we were using previously, the way it handles versions, and especially dependencies between different tables, is much cleaner than our previous approach.

for any given table, the versions of any dependant tables are stored in the tables definition — for instance lets say we have our `transaction` table from earlier:

select
row_id,
date,
user_id,
transaction_location,
sq transaction_amount
from {{ source('example_dataset', 'original_transaction') }}

lets call this `raw_transaction`, and we want to make an aggregated transaction table — we can define it by referencing the raw table:

select
date,
transaction_location,
count(*) as txn,
sum(transaction_amount) as aggregated_sales
from {{ ref('raw_transaction', v=1) }}
group by 1, 2
order by 1, 2

By using the `v` parameter we have specified the exact version we want to use. Lets say that we want to make a second version of the aggregated data with unique user count, but using the same v1 raw data, we can do so like below:

select
date,
transaction_location,
count(*) as txn,
sum(transaction_amount) as aggregated_sales,
count(distinct(user_id)) as aggregated_sales
from {{ ref('raw_transaction', v=1) }}
group by 1, 2
order by 1, 2

This is very simple to implement: we just make a new file called aggregated_transaction_v2, and add our new column.

Further down the line, the raw data could be updated, when this happens we can make v3 aggregated_transaction data by bumping the dependency:

select
date,
transaction_location,
count(*) as txn,
sum(transaction_amount) as aggregated_sales,
count(distinct(user_id)) as aggregated_sales
from {{ ref('raw_transaction', v=2) }}
group by 1, 2
order by 1, 2

Effectively what we are doing here is handling the dependency tree as part of the code, it makes it much easier to understand which tables depend on other tables. You can read more about version management in DBT here.

Conclusion

In this article we introduced the concept of data versioning, explained why it’s important, and then went over some approaches to versioning data including DBT, which Nowcast uses along with Snowflake to build our data pipelines. We showed the advantages of DBT over a more manual approach.
If you are interested in Alternative Data or Data Engineering, please don’t hesitate to reach out!

Tomorrow, Kawamoto-san will publish an article about his experience transitioning from an established company to a startup.

--

--