Imagine a world in which data was available to researchers all around the globe. A lab in Germany could take neural recordings of patients and make them available to anyone willing to devote the time and skills to doing various analyses. Additionally, where n equaled 1 in that limited trial, perhaps an aggregate of the world’s data would make sample sizes large enough to find statistically significant meaning. Additional lines of inquiry could be made about data, perhaps far beyond the initial intent. For instance, a neuroimaging friend of mine was telling me how they used to observe cerebellar BOLD activation during certain working memory tasks and dismissed it as some kind of motion artifact. But in retrospect, with new evidence concerning the role of the cerebellum in higher order processes, this lost data could have been very valuable. It appears that researchers all over the country are sitting on data that they haven’t the time to fully sift through. In fact, it seems also that the rate of data collection in the digital age is far outpacing even the growth of researchers in the field, as it becomes increasingly easier to collect and store copious amounts of data. We’re becoming digital packrats, and it’s easy to hope that the secrets of the brain are waiting to be revealed on hard drives in labs across the globe.
Several aspects of this kind of data availability are appealing, as the pace of innovation could be seen to accelerate as more data are available to more analysis and aggregation. We’d also be one step closer to scientific transparency, as data availability, along with detailed methods, would enable anyone to go through and at least replicate the analysis portion of one’s work. However, there are a number of scientific, as well as pragmatic, issues that must be considered before such a project could be ultimately useful.
Currently, that level of transparency is not practiced for a number of reasons that are as tragic as they are understandable. Mistakes are bound to happen, but the price of admitting fault is far too high and unfortunately discourages total transparency. This is, in some sense, self correcting, since it requires that researchers are very careful in the way they carry out experiments and present results. However, the scientific community thus far has seemed to be content with allowing a certain level of skullduggery parading as brevity, and this means that certain methods continue to be willfully obtuse.
The nature of competition and funding also means that, though methods are presented in limited forms in papers, incomplete disclosure is appealing, since it can ensure a long run of guaranteed, exclusive publications. Sometimes methods can be financially lucrative and made available in limited ways to other researchers, but as in all other businesses, the secret is being a good middleman between one’s method and those who need to use it. It completely is contrary to the spirit of scientific openness.
One issue with respect to modeling work is the release of source code. Open source modeling represents a level of transparency that is ideologically appealing. At some point in my career, I plan on being an open source scientist. A few efforts to this end are available currently, including the primary database called ModelDB, a collection of neural models and their associated papers. However, to meet this responsibility properly, a lot of work must be invested in making code accessible and compatible for it to be truly useful. Just throwing it into a database is not, inherently, helpful or meaningful.
Releasing source code, detailed methods, or data opens a lab up to criticism that may find crucial flaws. For quality control purposes, data and model databases should have peer review processes like scientific articles, in which standards are maintained and completeness and accuracy are examined, before the data.
With respect to access, I believe that such access should be wide open to the world. Back room collaborations and data sharing are completely understandable to some extent, but data for the world should not be restricted by access. Of course, pragmatic ways of tracking downloads and access aren’t unreasonable, but that’s a finer point that need not be examined further here.
Two other problems of data sharing exist. The first is that the format of data varies greatly, from more common Comma Separated Value (CSV) data to the myriad of weird binary file formats of various types (such as MATLAB). Ensuring global compatibility is nearly impossible, though there are certain standards and automated converters that could make this a reasonable step. It would require a lot of work, however.
The second issue is much more serious. It involves the keen and unique knowledge that only the primary data collector might have over a particular data set. Aside from detailed methods of how the data is collected, cataloging the minute quirks and features of a particular data set that might affect any analysis requires a lot of work on the part of the person collecting the data. It’s very difficult to imagine this being a very fruitful exercise for each data set.
Perhaps more valuable than a database of real data would be a database of what kind of data exists. For instance, researchers who perform LFP recordings of behaving rats could volunteer to be listed in a database. Modelers or data analysis folks could then be in contact with them, so fruitful collaborations based on needs and availability could be created. The aim of this approach is different, in that it seeks to connect people, where the first method empowers anyone to examine data.
Thus there is something certainly appealing about the idea about knowledge, even in raw data form, being accessible to the world. In a sense, one could even make the reasonable argument that public funds paved the way to this research, which means that the public should get access to the fruits of the labor. But data sharing comes associated with it a whole host of intrinsic and pragmatic problems, and it also highlights major ideological flaws of the way research is currently conducted. And these are undoubtedly just a small subset of issues that exist for this type of endeavor. But an organized system of data availability would be a great ideological contribution to the neuroscientific community and could help accelerate the pace of understanding significantly.