T2R2

Updated 2023-09-21

Bernard DeffargesBernard Deffarges
9322llaa Key Message 05 Transparency Traceability Reproducibility V5 MS Teetu Artoo NO LOGO

Transparency, Traceability, Replicability and Reproducibility

When designing the architecture of a new platform, one needs to define and then focus on the most important goals!

We had spent so many years trying to reproduce complex Bioinformatics Data Workflows that we decided that T2R2 would be our main first goal. We wanted to tackle the AI and Data Science Reproducibility Crisis.

So, from the very beginning, when we started thinking about the architecture of our workflow platform we put T2R2 at its core.

That is also the reason why we called the platform T2R2-One.

In this post, we define each component of T2R2.

We will discuss T2R2 software architecture in another post.

Transparency

Transparency in AI and Data Science is about providing all information and means that have been used to produce a result.

That would include source code (by definition black boxes would almost certainly be excluded), parameters, input data, reference to external data, etc.

But that poses already a lot of problems. For example, how to be sure that the executed binaries are really the compiled version of the provided source code?

How to seal parameters or input data? If they are stored in a traditional database, how can we be sure they have not been modified?

Traceability

Traceability adds the time factor to Transparency.

A data analysis workflow can seem transparent, but to be fully transparent (!) we need to know in which order the actions have been performed and that is Traceability.

Actually, that piece might not be too difficult to reach. At the end of the day, we just need to log all the events into a journal.

Of course the issue is the content of each entry which we must be able to trust.

Replication

Replication is the process of redoing a data processing workflow in exactly the same way but starting with a different data source.

It might seem easy and obvious. But it's not, and History of Science shows it. Even though It's not our goal to go into philosophical discussions here, it's interesting to see (excerpt from the book from Shapin and Schaffer about Boyle's Air-pump and the problem of replication that this problem has been around for centuries.

The problem nowadays is that in most of the cases, when the input data only changes a little bit (a parameter or something else) sound replication becomes difficult.

Reproducibility

Reproducibility is like Replication but with the same data.

So, everything is identical, and the purpose can only be to verify that a result is sound and has not be tampered (intentionnally or not!).

Again, might seem obvious, but is often impossible and should only be seen as a general goal.

Indeed, in most of the cases, something somewhere in the process, will make a data analysis just a little different from the original.

Conclusion

T2R2 are of paramount importance in Biopharma as well as in many other industries.

We beleive, the way the industry addresses them can have huge impacts on how fast and well new medicines and devices can be brought to the market.

Failing on T2R2 can have terrible consequences within companies as well as for society.

Not talking about the amount of wasted resources because of lacking T2R2!

Those are the reasons why we put T2R2 at the core of our platforms.

Get in touch if you are interested to know more!