Transparency, Traceability, Replicability and Reproducibility
When designing a new software platform we need to be sure on the most important goals we are aiming at.
For Palaamon, T2R2 was definitely our top priority.
So, from the very beginning, when we started thinking about the architecture of the platform we put T2R2 at its core.
In this post, we just want to define each component of T2R2 in Data Science. We will dive into software architecture discussions later.
Transparency in Data Science is about providing all information and means that have been used to produce a result.
That would include source code (by definition black boxes would almost certainly be excluded), parameters, input data, reference to external data, etc.
But that poses already a lot of problems. For example, how to be sure that the executed binaries are really the compiled version of the provided source code? How to seal parameters or input data?
Traceability adds the time factor to Transparency.
A data analysis workflow can seem transparent, but to be fully transparent (!) we need to know in which order the actions have been performed and that is Traceability.
Actually, that might be not too difficult to reach. At the end of the day, we just need to log all the events into a journal. Of course the issue is the content of each entry which we must be able to trust.
Replication is the process of redoing a data processing workflow in exactly the same way but starting with a different data source.
It might seem easy and obvious. But it's not, and History of Science shows it. Even though It's not our goal to go into philosophical discussions here, it's interesting to see (excerpt from the book from Shapin and Schaffer about Boyle's Air-pump and the problem of replication that this problem has been around for centuries.
The problem nowadays is that in most of the cases, when the input data only changes a little bit (a parameter or something else) sound replication becomes difficult.
Reproducibility is like Replication but with the same data.
So, everything is identical, and the purpose can only be to verify that a result is sound and has not be tampered (intentionnally or not!).
Again, might seem obvious, but is often impossible and should only be seen as a general goal.
Indeed, in most of the cases, something somewhere in the process, will make a data analysis just a little different from the original.
T2R2 are of paramount importance in Biopharma.
We beleive, the way the industry addresses them can have huge impacts on how fast and well new medicines and devices can be brought to the market.
Failing on T2R2 can have terrible consequences within companies as well as for society.
It is the reason why we put T2R2 at the core of our platforms.