Why Biology Will Be Very Different, Even Medieval, Without Open Access to Publications

Everyone can now read scientific papers, warts and all. For students in less prosperous universities and colleges, this has become a new opportunity to know what is happening at the frontline.

A rainbow of journals in a library at the University of North Caroline. Credit: moonlightbulb/Flickr, CC BY 2.0

A major bone of contention in academic research in biology concerns the ownership of and third-party access to research data. Should research data be made available publicly for everyone to access, or should it be closely held by the producers of the data behind an iron curtain? Should research publications, typically written in a language that is unintelligible to those outside the domain of research concerned, be deposited in copyleft venues allowing unhindered access to all? One would have believed that the issue was settled, and that the answer would be an emphatic yes in favour of open access to research data and publications. But apparently not!

The prestigious New England Journal of Medicine (NEJM) published a “priggish” editorial last month, expressing a problem with open data sharing, justified by fears over their inappropriate reanalysis by “research parasites”, a term which trended on Twitter. A concern expressed by the authors of this editorial was that these parasites would not understand the “choices made” in the “generation and collection of the data” and would therefore reinterpret the data wrongly.

This brings to mind a tangentially-related editorial published ten years ago in the respected journal Trends in Genetics, referring to a breed of bioinformaticians, who had analysed publicly available data from diverse primary sources with computers to discover new biology, as “bottom-feeders” of data produced by the hard work of others.

The authors of the NEJM editorial safely ignored the praiseworthy truism that it is the responsibility of the authors of published papers to make sure that the choices behind their experiments are explained well enough to encourage reproducibility of their data, and well enough to ensure that the limits of validity of their study and its caveats are fully understood by the audience. Reproducibility is the cornerstone of scientific research: every time a finding is reproduced, the more reliable it becomes. In other words, every reproduction or independent reassertion of a finding reduces the space for falsifying it. This is all the more essential for clinical research, whose outcomes have an immediate influence on our lives.

The editorial justifiably received its fair share of brickbats, prompting the authors to issue a clarification, which could be interpreted either as a volte-face or as a more measured take on data-sharing.

The NEJM editorial encourages an inspection of open data to scientific progress – but this isn’t meant to be a scholarly tracing of the history of development of open data in biology in long form but a personal experience of its worth.

Traditionally, journals in the life sciences have all been behind paywalls – purchase a subscription, get access. Subscribing to a scientific journal was and remains a costly affair, affordable only to well-endowed libraries and laboratories and rarely to individuals. This is problematic, not least because most academic research is funded by the taxpayer but who at the day’s end can’t afford to learn about research that her money helped fund. Worse, the author of a paper will also have had to transfer her copyright to the journal, and face the problem of not having access to the typeset and published version of her own writing.

Though attempts to get around this problem had been made even in the pre-Internet days – including the avant garde proposal of the physicist-turned-biologist Leo Szilard to enforce an author-pays model of scientific publishing, and later efforts in the 90s by leading scientists such as Pat Brown, Harold Varmus and Michael Eisen to get published papers deposited in public repositories – real transformation occurred only at the turn of the century, with the launch of Biomed Central.

Biomed Central is a for-profit open access publisher based in the UK, which makes all published papers accessible to everyone for free online. ‘Open access’ is often conflated with the less desirable ‘free access’ –the two are quite different. In the ‘open access’ mode, the author and not the journal owns copyright and, more importantly, anyone anywhere has the right to reuse the published material for any purpose as long as the original source is acknowledged. The business model is that the author pays. In ‘free access’, however, papers may be read for free but reuse of the material may be hindered by copyright regulations. This difference between the two has obvious implications for the extent to which the fruits of scientific research can be propagated.

In the author-pays model, the scientist writes up a paper, submits it to an open access journal for publication. The paper gets reviewed as rigorously as it does in any other journal by a team of editors and expert reviewers; some papers are rejected and some accepted. Once it’s determined to be the latter case, the authors of the paper pay a definite sum of money to cover the publisher’s costs. Many legitimate open access publishers today have a system of fee-waivers for authors of limited means. (And this is how I read my first research papers, and also published my first, modest paper, thanks to the fee-waiver granted to a poor undergraduate student.)

And the Biomed model went viral. The Public Library of Science (PLoS) in the USA, itself a pioneer of the open access movement, launched a series of premier journals. Several traditional journals started to become open access or offered authors the option to make their paper open by paying for it.

Large research funding agencies in the US and Europe mandated that publications arising from work they had funded be made freely available in public repositories. The major science funding bodies of India joined the bandwagon last year. Our agencies mandate that publications arising out of research they fund be deposited in online institutional repositories or in freely accessible databases that they set up. This should be done ideally within two weeks after publication or after six months in case the journal has an embargo policy. This gets around the problem of having to pay for publication in truly open access journals.

However, this isn’t necessarily open access in its full glory but is in fact free access – which, despite allowing any interested party to read the papers free of cost, needn’t provide them uninhibited rights to reuse material from these publications. Then again, it’s not a bad compromise to make. For it to be effective, a single well-publicised repository and not one per institute nor one per funding body is called for. Even better will be seamless integration with international repositories such as Pubmed Central and Europe PMC, which are frequented by life-science researchers from across the world.

It’s also worth mentioning that the best life-sciences journals in the country, published by the Indian Academy of Sciences in Bengaluru, are freely available online and copies of them in print cost next to nothing. Moreover, the spread of the open access movement, and its support in some form by major research funders, meant that even subscription journals had to come up with mechanisms for making at least a subset of their papers freely available, either a few months after publication or immediately on publication in a pre-print format.

Thus, everyone can now read scientific papers, warts and all. For students in less prosperous universities and colleges, this has become a new opportunity to know what is happening at the frontline – even as they build their foundational knowledge from the ever-reliable textbooks and other open educational venues such as Wikipedia (with its caveats), OpenCourseWare and India’s own NPTEL channel on YouTube. This is of particular value to students in Indian universities – many of which can’t afford or would be unwilling to spend on institution-wide journal subscriptions.

The flipside of all this is that it was always going to spawn a bouquet of spurious journals, each happy to publish a trashy paper as long as the authors are willing to pay for it. By keeping these charges low, they could encourage desperate scientists to even use personal resources to publish whatever they felt like publishing. This model is fed indirectly by evaluation systems in which career progressions are contingent on the numbers of publications, with little regard for the quality of these works or whether these are published in legitimate venues. This appears to be a lucrative venture, as seen from the large numbers of predatory publishers in the market today

Beyond open access publishing itself, access to data and software have been powerful drivers of biology specifically and science in general. This has been driven not least by major projects in genomics, including the Human Genome Project. Many other such projects make their data freely available online as well, even prior to the publication of papers describing their findings. In other words, making a formal, citable claim of ownership over the data becomes secondary to making the data freely available, in principle, to everyone on the planet.

Every piece of DNA that has been sequenced in academic circles is deposited in public databases. Expert bioinformaticians are able to assemble disparate sequence data produced across multiple studies, perform sequence alignments, and predict the functions of this sequence. Such studies can even make predictions on the effect of mutations in a sequence on its function, and which can then be tested experimentally. Such studies provide a guide to the experimentalist – who would otherwise be trying to find the proverbial needle in a haystack. Now the experimentalist only needs to find the needle in a threadroll!

Today, we often read about new genomes being sequenced, including for example the sequencing of the genome of the tulsi plant by my colleagues at the National Centre for Biological Sciences. The importance of public data and bioinformatics in this realm is borne out by the fact that we can sequence the (near-)complete genetic material of a plant or an animal, or for that matter hundreds of disease-causing bacteria, in a matter of days to weeks. But we will require their comparison to the mountains of publicly available sequence data to know what functions these sequences code for. Without public data, any new genome sequence we discover today will be (nearly) useless.

The highly publicised projects on human genetic variation and the association of genetic variations to disease are all founded on the availability of the human genome in the public domain. Many modern approaches to drug design, which require information on the 3D structure of protein molecules that the drug is supposed to bind to, make use of public databases containing such structural information. In my own experience, my most quoted works are those that have generated large volumes of data – making them available in the public domain – and not those that have reported the discovery of new aspects of biology.

The open source software movement (most visible in the world of the Linux Operating System) has pervaded all of biology. Today, anyone with a computer can download freely available software and public data, play around for a while and with some luck and effort discover something interesting and useful about biology. In my lab, we don’t use any scientific software that needs to be paid for either. These developments have meant that science could, in theory, come under public scrutiny. It is the scientific equivalent of the Right to Information – but without the need to fill forms and find activists, and which if used responsibly could make life better for everyone.

Needless to say, biology will be very different and medieval today without open access to publications, data and software, and the “priggish” NEJM editorial, despite being published in 2016, is an anachronism and an abomination that is best ignored.

Aswin Sai Narain Seshasayee runs a laboratory researching bacterial biology at the National Centre for Biological Sciences, Bengaluru. Beyond science, his interests are in classical art music and history.