Daniel Piqué - How I discovered a missing data point in a paper with 8,000 citations

The second talk of the day was from Daniel Piqué. Based in the USA, Daniel is a bioinformatician/data scientist and medical student with a particular interest in using data to address health inequalities. In his talk he spoke about discovering a missing data point in a paper which had more than 8000 citations.

Introductions

Like many of those involved on the day Daniel was keen to express his take on reproducibility:

‘When authors provide all the necessary data and the computer codes to tun the analysis again, recreating the results.’

He went on to cite a survey in Nature where 52% of responding scientists said there was a reproducibility crisis in science; the sixth highest contributing factor to this crisis was said to be a lack of information and codes. One of the ways that Daniel proposed to alleviate this issue was to distribute the data itself as an R package.

Reproducing the Paper's Figure

Daniel was trying to reproduce the only figure from a short paper from 1971 titled: Mutation and cancer: statistical study of retinoblastoma. The study is influential because it provides evidence for the ‘two-hit’ hypothesis; where two hits, or mutations, in a tumour-supressing gene are required for a cancer to develop.

Daniel set out to reproduce the paper’s only figure using R. There were some initial challenges, notably that the paper’s data wasn’t available in a machine readable format and there was no explanation about how the graph was generated.

The figure in question is a scatter graph showing two sets of data. Square points represent a patient with cancer in both eyes and round ones patients with cancer in one eye; bilateral and unilateral respectively. The x axis shows the age of the patient when they were diagnosed, between 0 and 60 months. The y axis shows the fraction of cases NOT yet diagnosed.

When looking over the number of cases in table 1 Daniel was able to identify 48 cases, although some key data was missing from this with the y axis data and some information about variables also missing.

With a small dataset it was possible to simply copy and paste it into R. Each of the 48 cases occupied a row and the study’s variables a column. Daniel also added an extra column to separate the unilateral and bilateral cases. To make this data more open he put it into a downloadable R package, something that was digitally applauded by the conference participants.

He then created the y-axis variable and used ggplot to reproduce the figure, it looks very similar to the original figure from the paper. However, closer inspection revealed that Daniel’s reproduction had more data points than the original one.

As a first sanity check Daniel began by checking how much of the data was shown on the original figure. There were 5 fewer unilateral cases and 6 fewer bilateral cases marked on the original figure compared with his attempt at reproduction. Close inspection of the data suggested that only one case for each age group was being plotted.

Daniel also identified that the y axis was using a log scale, which made it impossible to plot some of the data points. Even taking this into account there was still one missing data point. To find it he simply superimposed his plot over the original.

Daniel then re-drew the figures using a linear scale for the y axis, allowing all points to be plotted. He also extended the x axis as the latest diagnosed case of unilateral cancer was made at 73 months.

Daniel is awaiting a response from the journal's office regarding his findings. The paper’s author passed away in 2016 so further investigation of the missing data may be difficult.

Is the paper reproducible?

Largely it is but there is still a question around the missing data point in the original figure. Daniel’s investigations also identified an error on the exponent for the fitting curve. It should have been -5, not -4 as the legend stated.

Conclusion and links

Daniel concludes by offering a number of links to resources that support reproducible computational workflows:

A practical guide to reproducible research in R: https://monashdatafluency.github.io/r-rep-res
The Turing way for reproducible data science: https://the-turing-way.netlify.com
Case studies, projects and tutorials in reproducible research: https://github.com/leipzig/awesome-reproducible-research

You can see a version of the presentation that Daniel gave here: https://www.youtube.com/watch?v=DjKlO8YFqAc

You can follow Daniel on Twitter: @dpique12