Friday, March 06, 2020

Carl Malamud at the Open University

On Tuesday, 3 March, 2020, Carl Malamud visited The Open University and shared his thoughts on text and data mining in scientific journals. He opened with the story of Mahatma Gandhi's writing of the book Hind Swaraj (India self rule) on a boat trip between the UK and South Africa in 1909.

The book is relevant to the open access movement in two key particulars. The first edition of the book was published with "No rights reserved", Gandi being the first author to explicitly eschew copyright. Secondly Malamud has been inspired by Gandhi's resistance to colonialism. Scientific knowledge has been colonised and, as James Boyle has argued for a generation, we are in the midst of a second enclosure movement, an enclosure of the commons of the mind.

Malmud has written a book, Code Swaraj, about this, with Sam Pitroda, a former Indian cabinet minister and telecommunications businessman. Gandi preached you had to rule yourself, not let others colonise. But nowadays if you want to do research you have to ask permission and that permission is often not forthcoming because of the immoral and probably illegal assertion of ownership of human knowledge by vested economic gatekeepers such as the scientific publishers.

Christopher Booker read hundreds of books over more than thirty years before writing The Seven Basic Plots: Why We Tell Stories, first published in 2004. His three decade long analysis was an exercise in text and data mining. Text and data mining is now something we can automate with computers. A study of gender in literature showed that the number of female characters has declined rather than increased, matching a proportionate decline in female authors.

Gitanjali Yadav, a plant genome researcher at Delhi’s National Institute of Plant Genome Research (NIPGR) and at Cambridge University is working on the mechanics and chemistry of plant communication channels, using a plant chemicals database.

Elisabeth Bik is a scientist working on fraudulent re use of images in academic papers and exposing paper mills. In China, part of the pre-requisites for becoming a doctor is the publication of peer review papers. The incentive to buy them from paper mills is high.

Scientific literature has been locked up and it is unclear what the potential for research could be as a result.

Max Häussler is researcher at the University of California, Santa Cruz (UCSC) and he has created a genome browser. The browser links human genome DNA sequences to sections of published articles that deal with the same sequences. He wrote to 43 publishers and explained he would like to do text and data mining on their articles. Many publishers did not want to cooperate, refused permission or did not engage at all. So he didn't get access to as much literature as he would have liked. Malamud considers there is an argument to be made that text and data mining of research is permitted in law, even if the publishers do not grant explicit permission. Häussler is unsure and doesn't mine articles for which permission is not forthcoming. It would seem clear that the power of his genome browser would be significantly greater if he had that broader access to data.

Without asking publishers' permission, Malamud has put a lot of stuff online via a project at Jawaharlal Nehru University (JNU) in India - 125 million journal articles from many sources, from the mid 19th century up to the present.

The storage facility is air-gapped and not connected to the internet. Researchers who want access can bring their computers to the facility and text & data mine the materials there. Without having to read or download the articles which is not permitted, they can, nevertheless, draw scientific insights, thereby circumventing any potential copyright problems. The terms and conditions are modeled on those of the HathiTrust and the store specialises in bioinformatics. The access model is 3-tiered:

Tier 0 is air-gapped and pdfs of the articles

Tier 1 is extracted texts and is also air-gapped

Tier 2 is facts. As there is no copyright on facts, this can be made available openly to everyone.

The HathiTrust were the involved in providing Google with books for scanning for the Google Book project. Google in return gave the trust digital copies of the scanned books where out of copyright works are now made freely available online. Publishers sued Google in the US for breach of copyright and the case took many years to make its way through the courts. The appeal court concluded, Authors Guild v Google in 2014, that Google's use of the books was "transformative" and therefore permissible under US copyright law:
"1) Google’s unauthorized digitizing of copyright-protected works, creation of a search functionality, and display of snippets from those works are non-infringing fair uses. The purpose of the copying is highly transformative, the public display of text is limited, and the revelations do not provide a significant market substitute for the protected aspects of the originals. Google’s commercial nature and profit motivation do not justify denial of fair use. 
2) Google’s provision of digitized copies to the libraries that supplied the books, on the understanding that the libraries will use the copies in a manner consistent with the copyright law, also does not constitute infringement. Nor, on this record, is Google a contributory infringer. Accordingly, the court affirmed the judgment."
In 2016 the US Supreme Court rejected the Authors Guild's request to further appeal the decision, ending the more than a decade long litigation. The Authors Guild also tried suing the HathiTrust but were unsuccessful in that case too. The technicalities of the case were different.  One interesting angle was that the court made a point of noting the value of the HathiTrust approach to making the books available to print disabled and visually impaired.

The bottom line was that Google Books and the HathiTrust were given the ok by the US courts.

In the UK text and data mining is permitted only for non-commercial use. The text and data mining copyright exception was introduced in the UK in 2014. A format shifting exception, partly based on a report I co-wrote with two Oxford economists, Mark Rogers and Josh Tomalin, 'The economic impact of consumer copyright exceptions', was introduced at the same time. This latter exception was subject to a legal challenge by the music industry and a high court judge quashed the exception in the summer of 2015. In British Academy of Songwriters, Composers And Authors & Ors, R (On the Application Of) v Secretary of State for Business, Innovation And Skills [2015] EWHC 1723 (Admin) (19 June 2015), Mr Justice Green also based his decision to negate the format shifting exception, partly, on that same report I wrote with Mark and Josh. We had simply advocated evidence based policy making on intellectual property.

Getting back to the text and data mining, Malamud suggests the UK situation makes the invalid assumption that we have an access subscription to everything and that publishers cooperate with researchers which they don't.

In 2012, Delhi University got into a legal scrap with Oxford and Cambridge University presses and Taylor & Frances. The case revolved around a copy shop on the campus which lecturers used to make copies of course packs for students. Under Indian law, section 52 of the Copyright Act of 1957, copyright does not apply to materials issued by a teacher to a student. Copying is also permitted for research purposes. The cost of the textbooks that extracts were copied from was way beyond the means of most of the students. The publishers, nevertheless, demanded that the university pay them a licence fee to cover the copying. The High Court in Delhi ruled in favour of the university.

It seems to have been at the time Malamud read about the case that he began to think India might be a fertile territory for his campaign to provide access to knowledge. Those early inklings, backed up with expert legal opinions he has since solicited noting that it is permitted under Indian law since text & data mining does not involve copying or reading the articles, have bloomed into the repository at Jawaharlal Nehru University (JNU) with his store of 125 million articles. Gitanjali Yadav's plant database is up and running and linked with another university research group.

The Indian government's chief scientific adviser has a plan to make all scientific abstracts of published papers openly available. Malamud is also beginning to work with a wikipedian at the University of Virginia who is keen to integrate correct scientific references into Wikipedia.

In the US federal employee authored work done in the course of their employment is not copyrightable. So Malamud decided it might be a fruitful activity to attempt to find journal articles written by federal employees. He sampled ten thousand articles and discovered many were done as part of official duties but they were still locked behind publishers' paywalls.  When Barack Obama was president he wrote an article for the Harvard Law Review. Though the small print connected with the article says it is not copyrighted, the manner in which the Harvard Law Review presents the article makes it appear that it is subject to copyright.  Malamud, when he finds works written by federal employees, can only guess whether they were produced as part of the authors' public service duties. But he might get it wrong, so chooses not to make them openly available. His principle goal is to challenge and push back against official and commercial copyright overreach but not break any law.

On the law, he has been sued by the state of Georgia for publishing the state code. Just in case you are doing a double take with that, I did really say that Carl Malamud is being sued by the state of Georgia for making the laws of Georgia freely available to the public.  The state sued and won at the court of first instance. Malamud appealed and won in the appeal court. This was appealed to the US Supreme Court which heard the case in December of last year. He is expecting a decision by the summer. Edicts of government are not subject to copyright protection, yet this case is in the US Supreme Court. You do sometimes have to wonder at the state of copyright law (excuse the pun).

Malamud cut his teeth on campaigning and access to knowledge activism with public codes that have the force of law. Building codes and electrical and plumbing and fire safety etc codes are edicts of government. Malamud bought copies from official standards bodies and put a lot of them freely online. Lots of standards get updated and we are obliged to work to them but they do not get released. Malamud has been sued by standards organisations in litigation that has been ongoing for 6 years. His annual legal costs are $1.6 million but he has the good fortune to be represented by lawyers who work pro bono. He can walk into a pub anywhere and strike up a conversation and it is easy for people to understand the work he does. He'll often get a plumber or builder etc offering to buy him a drink, explaining they had to fork out thousands of their hard earned cash for standards codes they are obliged to work to.

India has a very strong right to information law. Malamud put nineteen thousand Indian standards online, reformatted for usability. He bought the standards from the Bureau of Indian Standards. When he got renewal notices from them asking for the next due licence fee he wrote back saying he had put the standards online. He got an angry, "unhinged" response, accusing him of breaking the law, being no longer welcome as a customer and a variety of legal threats.

In the EU, member states must transpose standards into national laws within six months of being issued. Malamud got sued by the German standards organisation for posting the EU standard for baby soothers. The standard is just full of common sense - the mouth guard must be big enough so it doesn't present a swallowing/choking threat etc. The German court sided with the standards body. Malamud is now subject to a German court injunction punishable by a fine of up to €250k and a jail term of up to two years, should he decide to re-publish the standard online. He has, however, posted four EU toy standards focusing on environmental implications and petitioned the UK government on the matter. He got turned down by the standards bodies for access to these standards and is bringing a case to the Court of Justice of the European Union.

Malamud's friends, critics and acquaintances regularly ask him why he expends such energy on what he does, when there are so many bigger problems in the world like the climate crisis, conflict and disease. His answer is a simple and irrefutable one: without access to knowledge you cannot solve the any of these problems and you cannot educate the citizenry to enable them to formulate their own solutions. Access to knowledge is the pre-condition for solving the world's fundamental problems.

Update: On 27 April 2020, the US Supreme Court ruled in favour of Malamud in a tight 5-4 split decision. Justice Ginsburg, interestingly, sided with the minority.