Wednesday, October 1, 2014

The Metadata Bubble

In an ideal world, scholars deposit their papers in an Open Access repository, because they know it will advance their research, support their students, and promote a knowledge-based society. A few disciplinary repositories, like ArXiv, have shown that it is possible to close the virtuous cycle where scholars reinforce each other's Open Access habits. In these communities, no authority is needed to compel participation.

Institutional repositories have yet to build similar broad-based enthusiastic constituencies. Yet, many Open Access advocates believe that the decentralized approach of institutional repositories creates a more scalable system with a higher probability for long-term survival. The campaign to enact institutional deposit mandates hopes to jump start an Open Access virtuous cycle for all scholarly disciplines and all institutions. The risk of such a campaign is that it may backfire if scholars should experience Open Access as an obligation with few benefits. For long-term success, most scholars must perceive their compelled participation in Open Access as a positive experience.

It is, therefore, crucial that repositories become essential scholarly resources, not dark archives to be opened only in case of emergency. The Open Archives Initiative (OAI) repository design provided what was thought to be the necessary architecture. Unfortunately, we are far from realizing its anticipated potential. The Protocol for Metadata Harvesting (OAI-PMH) allows service providers to harvest any metadata in any format, but most repositories provide only minimal Dublin Core metadata, a format in which most fields are optional and several are ambiguous. Extremely few repositories enable Object Reuse and Exchange (OAI-ORE), which allows for complex inter-repository services through the exchange of multimedia objects, not just metadata about them. As a result, OAI-enabled services are largely limited to the most elementary kind of searches, and even these often deliver unsatisfactory results, like metadata-only placeholder records for works restricted by copyright or other considerations.

In a few years, we will entrust our life and limb to self-driving cars. Their programs have just milliseconds to compute critical decisions based on information that is imprecise, approximate, incomplete, and inconsistent: all maps are outdated by the time they are produced, GPS signals may disappear, radar and/or lidar signatures are ambiguous, and video or images provide obstructed views in constantly changing environments. When we can extract so much actionable information from such "dirty" information, it seems quaint to obsess about metadata.

Databases automatically record user interactions. Users fill out forms and effectively crowdsource metadata. Expert systems can extract, from any document in any format and in any language, author information, citations, keywords, DNA sequences, chemical formulas, mathematical equations, etc. Other expert systems have growing capabilities to analyze sound, image, and video. Technology is evaporating the pool of problems that require human intervention at the transaction level. The opportunities for human metadata experts to add value are disappearing fast.

The metadata approach is obsolete for an even more fundamental reason. Metadata are the digital extension of a catalog-centered paper-based information system. In this kind of system, today's experts organize today's information so tomorrow's users may solve tomorrow's problems efficiently. This worked well when technology changed slowly, when experts could predict who the future users would be, what kind of problems they would like to solve, and what kind of tools they would have at their disposal. These conditions no longer apply.

When digital storage is cheap, why implement expensive selection processes for an archive? When search technology does not care whether information is excruciatingly organized or piled in a heap, why spend countless hours organizing and curating content? Why agonize over potential future problems with unreadable file formats? Preserve all the information about current software and standards, and start developing the expert systems to unscramble any historical format. Think of any information-management task. How reasonable is the proposition that this task will require direct human intervention in two years? In five years? In ten years?

For content, more is more. We must acquire as much content as possible, and store it safely.

For content administration, less is more. Expert systems give us the freedom to do the bare minimum and to make a mess of it. While we must make content useful and enable as many services as possible, it is no longer feasible to accomplish that by designing systems for an anticipated future. Instead, we must create the conditions that attract developers of expert systems. This is remarkably simple: Make the full text and all data available with no strings attached.

Real Open Access.

Monday, June 30, 2014

Disruption Disrupted?

The professor who books his flights online, reserves lodging with Airbnb, and arranges airport transportation with Uber understands the disruption of the travel industry. He actively supports that disruption every time he attends a conference. When MOOCs threaten his job, when The Economist covers reinventing the university and titles it “Creative Destruction", that same professor may have second thoughts. With or without disruption, academia surely is in a period of immense change. There is the pressure to reduce costs and tuition, the looming growth of MOOCs, the turmoil in scholarly communication (subscription prices, open access, peer review, alternative metrics), the increased competition for funding, etc.

The term disruption was coined and popularized by Harvard Business School Professor Clayton Christensen, author of The Innovator's Dilemma. [The Innovator's Dilemma, Clayton Christensen, Harvard Business Review Press, 1997] Christensen created a compelling framework for understanding the process of innovation and disruption. Along the way, he earned many accolades in academia and business. In recent years, a cooling of the academic admiration became increasingly noticeable. A snide remark here. A dismissive tweet there. Then, The New Yorker launched a major attack on the theory of disruption. [The Disruption Machine, Jill Lepore, The New Yorker, June 23rd, 2014] In this article, Harvard historian Jill Lepore questions Christensen's research by attacking the underlying facts. Were Christensen's disruptive startups really startups? Did the established companies really lose the war or just one battle? At the very least, Lepore is implying that Christensen misled his readers.

As of this writing, Christensen has only responded in a brief interview. [Clayton Christensen Responds to New Yorker Takedown of 'Disruptive Innovation', Bloomberg Businessweek, June 20th, 2014] It is clear he is preparing a detailed written response.

Lepore's critique appears at the moment when disruption may be at academia's door, seventeen years after The Innovator's Dilemma was published, much of the research almost twenty years old. Perhaps, the article is merely a symptom of academics growing nervous. Yet, it would be wrong to dismiss Lepore's (or anyone other's) criticism based on any perceived motivation. Facts can be and should be examined.

In 1997, I was a technology manager tasked with dragging a paper-based library into the digital era. When reading (and re-reading) the book, I did not question the facts. When Christensen stated that upstart X disrupted established company Y, I accepted it. I assume most readers did. The book was based on years of research, all published in some of the most prestigious peer-reviewed journals. It is reasonable to assume that the underlying facts were scrutinized by several independent experts. Truth be told, I did not care much that his claims were backed by years of research. Christensen gave power to the simple idea that sticking with established technology can carry an enormous opportunity cost.

Established technology has had years, perhaps decades, to mitigate its weaknesses. It has a constituency of users, service providers, sales channels, and providers of derivative services. This constituency is a force that defends the status quo in order to maintain established levels of quality, profit margins, and jobs. The innovators do not compete on a level playing field. Their product may improve upon the old in one or two aspects, but it has not yet had the opportunity to mitigate its weaknesses. When faced with such innovations, all organizations tend to stick with what they know for as long as possible.

Christensen showed the destructive power of this mind set. While waiting until the new is good enough or better, organizations lose control of the transition process. While pleasing their current customers, they lose future customers. By not being ahead of the curve, by ignoring innovation, by not restructuring their organizations ahead of time, leaders may put their organizations at risk. Christensen told compelling disruption stories in many different industries. This allowed readers to observe their own industry with greater detachment. It gave readers the confidence to push for early adoption of inevitable innovation.

I am not about to take sides in the Lepore-Christensen debate. Neither needs my help. As an observer interested in scholarly communication, I cannot help but noting that Lepore, a distinguished scholar, launched her critique from a distinctly non-scholarly channel. The New Yorker may cater to the upper-crust of intellectuals (and wannabes), but it remains a magazine with journalistic editorial-review processes, quite distinct from scholarly peer-review processes.

Remarkably, the same happened only a few weeks ago, when the Financial Times attempted to take down Piketty's book. [Capital in the Twenty-First Century, Thomas Piketty, Belknap Press; 2014]  [Piketty findings undercut by errors, Chris Giles, Financial Times, May 23rd, 2014] Piketty had a distinct advantage over Christensen. The Financial Times critique appeared a few weeks after his book came out. Moreover, he had made all of his data public, including all technical adjustments required to make data from different sources compatible. As a result, Piketty was able to respond quickly, and the controversy quickly dissipated. Christensen has the unenviable task of defending twenty-year old research. For his sake, I hope he was better at archiving data than I was in the 1990s.

What does it say about the status of scholarly journals when scholars use magazines to launch scholarly critiques? Was Lepore's article not sufficiently substantive for a peer-reviewed journal? Are scholarly journals incapable or unwilling to handle academic controversy involving one of its eminent leaders? Is the mainstream press just better at it? Would a business journal even allow a historian to critique business research in its pages? If this is the case, is peer review less about maintaining standards and more about protecting an academic tribe? Is the mainstream press just a vehicle for some scholars to bypass peer review and academic standards? What would it say about peer review if Lepore's arguments should prevail?

This detached observer pours a drink and enjoys the show.


PS (7/15/2014): Reposted with permission at The Impact Blog of The London School of Economics and Political Science.

Friday, June 20, 2014

The Billionaires, Part 1: Elon Musk

Elon Musk did not need a journal to publicize his Hyperloop paper. [Hyperloop Alpha] No journal can create the kind of buzz he creates on his own. He did not need the validation of peer review; he had the credibility of his research teams that already revolutionized travel on earth and to space. He did not need the prestige of a journal's brand; he is his own brand.

Any number of journals would have published this paper by this author. They might even have expedited their review process. Yet, journals could hardly have done better than the public-review process that actually took place. Within days, experts from different disciplines had posted several insightful critiques. By now, there are too many to list. A journal would have insisted that the paper include author(s) and affiliations, a publication date (Aug. 12th, 2013), a bibliography... but those are irrelevant details to someone on a mission to change the world.

Does the Hyperloop paper even qualify as a scholarly paper? Or, is it an engineering-based political pamphlet written to undermine California's high-speed rail project? As a data point for scholarly communication, the Hyperloop paper may be an extreme outlier, but it holds some valuable lessons for the scholarly-communication community.

The gate-keeping role of journals is permanently over.

Neither researchers nor journalists rely on scholarly editors to dismiss research on their behalf.

In many disciplines, day-to-day research relies more on the grey literature (preprints, technical reports, even blogs and mailing lists) than on journal articles. In other words, researchers commit considerable time to refereeing one another, but they largely ignore each other's gate keeping. When it matters, they prefer immediacy over gate keeping and their own gate keeping over someone else's.

The same is true for journalists. If the story is interesting, it does not matter whether it comes from an established journal or the press release of a venture capitalist. Many journalists balance their reports with comments from neutral or adversarial experts. This practice may satisfy a journalistic concept of objectivity, but giving questionable research "equal treatment" may elevate it to a level it does not deserve.

Public review can be fast and effective. 

The web-based debate on Hyperloop remained remarkably professional and civil. Topics that attract trolls and conspiracy theorists may benefit from a more controlled discussion environment, but the public forum worked well for Hyperloop. The many critiques provide skeptical, but largely constructive, feedback that bold new ideas need.

Speculative papers that spark the imagination do not live by the stodgy rules of peer review.

The Hyperloop paper would be a success if its only accomplishment is inspiring a handful of young engineers to research radically different modes of mass transportation. Unfortunately, publishing speculative, incomplete, sloppy, or bad research may cause real harm. The imagined link between vaccines and autism (published in a peer-reviewed journal and later retracted) serves as an unhappy reminder of the latter.

Not all good research belongs in the scholarly record.

This episode points to an interactive future of scholarly communication. After the current public discussion, Hyperloop may gain acceptance, and engineering journals may publish many papers about it. Alternatively, the idea may die a quiet death, perhaps documented by one or more historical review papers (or books).

The ideal research paper solves a significant problem with inspiration (creative bold ideas) and perspiration (proper methodology, reproducibility, accuracy). Before that ideal is in sight, researchers travel long winding roads with many detours and dead ends. Most papers are small incremental steps along that road. A select few represent milestone research.

The de-facto system to identify milestone research is journal prestige. No journal could survive if it advertised itself as a place for routine research. Instead, the number of journals has exploded, and each journal claims high prestige for the narrowest of specializations. All of these journals treat all submissions as if they are milestone research and apply the same costly and inefficient refereeing processes across the board.

The cost of scholarly communication is more than the sum of subscriptions and page charges. While refereeing can be a valuable experience, there is a point of diminishing returns. Moreover, overwhelmed scholars are more likely to conduct only cursory reviews after ignoring the requests for extended periods. The expectation that all research deserves to be refereed has reduced the quality of the refereeing process, introduced inordinate delays, increased the number of journals, and indirectly increased the pressure to publish.

Papers should earn the privilege to be refereed. By channeling informal scholarly communication to social-network platforms, research can gain some scholarly weight based on community feedback and usage-based metrics. Such social networks, perhaps run by scholarly societies, would provide a forum for lively debate, and they could act as submission and screening systems for refereed journals. By restricting refereed journals to milestone research supported and validated by a significant fraction of the profession, we would need far fewer, less specialized journals.

A two-tier system would provide the immediacy and openness researchers crave, while reserving the highest level of scrutiny to research that has already shown significant promise.

Wednesday, May 21, 2014

Sustainable Long-Term Digital Archives

How do we build long-term digital archives that are economically sustainable and technologically scalable? We could start by building five essential components: selection, submission, preservation, retrieval, and decoding.

Selection may be the least amenable to automation and the least scalable, because the decision whether or not to archive something is a tentative judgment call. Yet, it is a judgment driven by economic factors. When archiving is expensive, content must be carefully vetted. When archiving is cheap, the time and effort spent on selection may cost more than archiving rejected content. The falling price of digital storage creates an expectation of cheap archives, but storage is just one component of preservation, which itself is only one component of archiving. To increase the scalability of selection, we must drive down the cost of all other archive services.

Digital preservation is the best understood service. Archive content must be transferred periodically from old to new storage media. It must be mirrored at other locations around world to safeguard against natural and man-made disasters. Any data center performs processes like these every day.

The submission service enters bitstreams into the archive and enables future retrieval of identical copies. The decoding service extracts information from retrieved bitstreams, which may have been produced by lost or forgotten software.

We could try to eliminate the decoding service by regularly re-encoding bitstreams for current technology. While convenient for users, this approach has a weakness. If a refresh cycle should introduce an error, subsequent cycles may propagate and amplify the error, making recovery difficult. Fortunately, it is now feasible to preserve old technology using virtualization, which lets us emulate almost any system on almost any hardware. Anyone worried about the long term should consider the Chrome emulator of Amiga 500 (1987) or the Android emulator of the HP 45 calculator (1973). The hobbyists who developed these emulators are forerunners of a potential new profession. A comprehensive archive of virtual old systems is an essential enabling technology for all other digital archives.

The submission and retrieval services are interdependent. To enable retrieval, the submission service analyzes bitstreams and builds an index for the archive. When bitstreams contain descriptive metadata constructed specifically for this purpose, the process of submission is straightforward. However, archives must be able to accept any bitstream, regardless of the presence of such metadata. For bitstreams that contain a substantial amount of text, full-text indexing is appropriate. Current technology still struggles with non-text bitstreams, like images, graphics, video, or pure data.

To simplify and automate the submission service, we need the participation of software developers. Most bitstreams are produced by mass-market software such as word processors, database or spreadsheet software, video editors, or image processors. Even data produced by esoteric experiments are eventually processed by applications that still serve hundreds of thousands of specialists. Within one discipline, the number of applications rarely exceeds a few hundred. To appeal to this relatively small number of developers, who are primarily interested in solving their customers' problems, we need a better argument than “making archiving easy.”

Too few application developers are aware of their potential role in research data management. Consider, for example, an application that converts data into graphs. Although most of the graphs are discarded after a quick glance, each is one small step in a research project. With little effort, that graphing software could provide transparent support for research data management. It could reformat raw input data into a re-usable and archivable format. It could give all files it produces unique identifiers and time stamps. It could store these files in a personal repository. It could log activity in a digital lab notebook. When a file is deleted, the personal repository could generate an audit trail that conforms to discipline-specific customs. When research is published, researchers could move packages of published and supporting material from personal to institutional repositories and/or to long-term archives.

Ad-hoc data management harms the longer-term interests of individual researchers and the scholarly community. Intermediate results may be discarded before it is realized they were, after all, important. The scholarly record may not contain sufficient data for reproducibility. Research-misconduct investigations may be more complicated and less reliable.

For archivists, the paper era is far from over. During the long transition, archivists may prepare for the digital future in incremental steps. Provide personal repositories. Work with a few application developers to extend key applications to support data management. After proof of concept, gradually add more applications.

Digital archives will succeed only if they are scalable and sustainable. To accomplish this, digital archivists must simplify and automate their services by getting involved well before information is produced. Within each discipline, archives must work with researchers, application providers, scholarly societies, universities, and funding agencies to develop appropriate policies for data management and the technology infrastructure to support those policies.

Monday, April 14, 2014

The Bleeding Heart of Computer Science

Who is to blame for the Heartbleed bug? Perhaps, it does not matter. Just fix it, and move on. Until the next bug, and the next, and the next.

The Heartbleed bug is different from other Internet scares. It is a vulnerability at the core of the Internet infrastructure, a layer that provides the foundation for secure communication, and it went undetected for years. It should be a wake-up call. Instead, the problem will be patched. Some government and industry flacks will declare the crisis over. We will move on and forget about it.

There is no easy solution. No shortcut. We must redevelop our information infrastructure from the ground up. Everything. Funding and implementing such an ambitious plan may become feasible only after a major disaster strikes that leaves no other alternative. But even if a complete redesign were to become a debatable option, it is not at all clear that we are up to the task.

The Internet is a concurrent and asynchronous system. A concurrent system consists of many independent components like computers and network switches. An asynchronous system operates without a central clock. In synchronous systems, like single processors, a clock provides the heartbeat that tells every component when state changes occur. In asynchronous systems, components are interrupt driven. They react to outside events, messages, and signals as they happen. The thing to know about concurrent asynchronous systems is this: It is impossible to de-bug them. It is impossible to isolate components from one another for testing purposes. The cost of testing quickly becomes prohibitive for each successively smaller marginal reduction in the probability of bugs. Unfortunately, when a system consists of billions of components, even extremely low-probability events are a daily occurrence. These unavoidable fundamental problems are exacerbated by continual system changes in hardware and software and by bad actors seeking to introduce and/or exploit vulnerabilities.

When debugging is not feasible, mathematical rigor is required. Current software-development environments are all about pragmatism, not rigor. Programming infrastructure is built to make programming easy, not rigorous. Most programmers develop their programs in a virtual environment and have no idea how their programs really function. Today's computer-science success stories are high-school geniuses that develop multimillion-dollar apps and college dropouts that start multibillion-dollar businesses. These are built on fast prototypes and viral marketing, not mathematical rigor. Who in their right mind would study computer science from people who made a career writing research proposals that never led to anything worth leaving a paltry academic job for?

Rigor in programming is the domain of Edsger W. Dijkstra, the most (in)famous, admired, and ignored computer-science eccentric. In 1996, he laid out his vision of Very Large Scale Application of Logic as the basis for the next fifty years of computer science. Although the examples are dated, his criticism of the software industry still rings true:
Firstly, simplicity and elegance are unpopular because they require hard work and discipline to achieve and education to be appreciated. Secondly we observe massive investments in efforts that are heading in the opposite direction. I am thinking about so-called design aids such as circuit simulators, protocol verifiers, algorithm animators, graphical aids for the hardware designers, and elaborate systems for version control: by their suggestion of power, they rather invite than discourage complexity. You cannot expect the hordes of people that have devoted a major part of their professional lives to such efforts to react kindly to the suggestion that most of these efforts have been misguided, and we can hardly expect a more sympathetic ear from the granting agencies that have funded these efforts: too many people have been involved and we know from past experience that what has been sufficiently expensive is automatically declared to have been a great success. Thirdly, the vision that automatic computing should not be such a mess is obscured, over and over again, by the advent of a monstrum that is subsequently forced upon the computing community as a de facto standard (COBOL, FORTRAN, ADA, C++, software for desktop publishing, you name it).
[The next fifty years, Edsger W. Dijkstra, circulated privately, 1996,
Document 1243a of the E. W. Dijkstra Archive,
https://www.cs.utexas.edu/users/EWD/ewd12xx/EWD1243a.PDF,
or, for fun, a version formatted in the Dijkstra handwriting font]

The last twenty years were not kind to Dijkstra's vision. The hordes turned into horsemen of the apocalypse that trampled, gored, and burned any vision of rigor in software. For all of us, system crashes, application malfunctions, and software updates are daily occurrences. It is build into our expectation.

In today's computer science, the uncompromising radicals that prioritize rigor do not stand a chance. Today's computer science is the domain of genial consensus builders, merchants of mediocrity that promise everything to everyone. Computer science has become a social construct that evolves according to political rules.

A bottoms-up redesign of our information infrastructure, if it ever becomes debatable, would be defeated before it even began. Those who could accomplish a meaningful redesign would never be given the necessary authority and freedom. Instead, the process would be taken over by political and business forces, resulting into effective status quo.

In 1996, Dijkstra believed this:
In the next fifty years, Mathematics will emerge as The Art and Science of Effective Formal Reasoning, and we shall derive our intellectual excitement from learning How to Let the Symbols Do the Work.
There is no doubt that he would still cling to this goal, but even Dijkstra may have started to doubt his fifty-year timeline.

Monday, March 31, 2014

Creative Problems

The open-access requirement for Electronic Theses and Dissertations (ETDs) should be a no-brainer. At virtually every university in the world, there is a centuries-old public component to the doctoral-degree requirement. With digital technology, that public component is implemented more efficiently and effectively. Yet, a small number of faculty fight the idea of Open Access for ETDs. The latest salvo came from Jennifer Sinor, an associate professor of English at Utah State University.
[One Size Doesn't Fit All, Jennifer Sinor, The Chronicle of Higher Education, March 24, 2014]

According to Sinor, Creative Writing departments are different and should be exempted from open-access requirements. She illustrates her objection to Open Access ETDs with an example of a student who submitted a novel as his masters thesis. He was shocked when he found out his work was for sale online by a third party. Furthermore, according to Sinor, the mere existence of the open-access thesis makes it impossible for that student to pursue a conventional publishing deal.


Sinor offers a solution to these problems, which she calls a middle path: Theses should continue to be printed, stored in libraries, accessible through interlibrary loan, and never digitized without the author's approval. Does anyone really think it is a common-sense middle path of moderation and reasonableness to pretend that the digital revolution never happened?

Our response could be brief. We could just observe that it does not matter whether or not Sinor's Luddite approach is tenable, and it does not matter whether or not her arguments hold water. Society will not stop changing because a small group of people pretend reality does not apply to them. Reality will, eventually, take over. Nevertheless, let us examine her arguments.

Multiyear embargoes are a routine part of Open Access policies for ETDs. I do not know of a single exception. After a web search that took less than a minute, I found the ETD policy of Sinor's own institution. The second and third sentence of USU's ETD policy reads as follows [ETD Forms and Policy, DigitalCommons@usu.edu]:
“However, USU recognizes that in some rare situations, release of a dissertation/thesis may need to be delayed. For these situations, USU provides the option of embargoing (i.e. delaying release) of a dissertation or thesis for five years after graduation, with an option to extend indefinitely.”
How much clearer can this policy be?

The student in question expressly allowed for third parties to sell his work by leaving a checkbox unchecked in a web form. Sinor excuses the student for his naïveté. However, anyone who hopes to make a living of creative writing in a web-connected world should have advanced knowledge of the business of selling one's works, of copyright law, and of publishing agreements. Does Sinor imply that a masters-level student in her department never had any exposure to these issues? If so, that is an inexcusable oversight in the department's curriculum.

This leads us to Sinor's final argument: that conventional publishers will not consider works that are also available as an Open Access ETDs. This has been thoroughly studied and debunked. See:
"Do Open Access Electronic Theses and Dissertations Diminish Publishing Opportunities in the Social Sciences and Humanities?" Marisa L. Ramirez, Joan T. Dalton, Gail McMillan, Max Read, and Nan Seamans. College & Research Libraries, July 2013, 74:368-380.

This should put to rest the most pressing issues. Yet, for those who cannot shake the feeling that Open Access robs students from an opportunity to monetize their work, there is another way out of the quandary. It is within the power of any Creative Writing department to solve the issue once and for all.

All university departments have two distinct missions: to teach a craft and to advance scholarship in their discipline. As a rule of thumb, the teaching of craft dominates up to the masters-degree level. The advancement of scholarship, which goes beyond accepted craft and into the new and experimental, takes over at the doctoral level.

When submitting a novel (or a play, a script, or a collection of poetry) as a thesis, the student exhibits his or her mastery of craft. This is appropriate for a masters thesis. However, when Creative Writing departments accept novels as doctoral theses, they put craft ahead of scholarship. It is difficult to see how any novel by itself advances the scholarship of Creative Writing.

The writer of an experimental masterpiece should have some original insights into his or her craft. Isn't it the role of universities to reward those insights? Wouldn't it make sense to award the PhD, not based on a writing sample, but based on a companion work that advances the scholarship of Creative Writing? Such a thesis would fit naturally within the open-access ecosystem of other scholarly disciplines without compromising the work itself in any way.

This is analogous to any number of scientific disciplines, where students develop equipment or software or a new chemical compound. The thesis is a description of the work and the ideas behind it. After a reasonable embargo to allow for patent applications, any such thesis may be made Open Access without compromising the commercial value of the work at the heart of the research.

A policy that is successful for most may fail for some. Some disciplines may be so fundamentally different that they need special processes. Yet, Open Access is merely the logical extension of long-held traditional academic values. If this small step presents such a big problem for one department and not for others, it may be time to re-examine existing practices at that department. Perhaps, the Open Access challenge is an opportunity to change for the better.

Monday, March 17, 2014

Textbook Economics

The impact of royalties on a book's price, and its sales, is greater than you think. Lower royalties often end up better for the author. That was the publisher's pitch when I asked him about the details of the proposed publishing contract. Then, he explained how he prices textbooks.

It was the early 1990s, I had been teaching a course on Concurrent Scientific Computing, a hot topic then, and several publishers had approached me about writing a textbook. This was an opportunity to structure a pile of course notes. Eventually, I would sign on with a different publisher, a choice that had nothing to do with royalties or book prices. [Concurrent Scientific Computing, Van de Velde E., Springer-Verlag New York, Inc., New York, NY, 1994.]

He explained that a royalty of 10% increases the price by more than 10%. To be mathematical about it: With a royalty rate r, a target revenue per book C, and a retail price P, we have that C = P-rP (retail price minus royalties). Therefore, P = C/(1-r). With a target revenue per book of $100, royalties of 10%, 15%, and 20% lead to retail prices of $111.11, $117.65, and $125.00, respectively.

In a moment of candor, he also revealed something far more interesting: how he sets the target revenue C. Say the first printing of 5000 copies requires an up-front investment of $100,000. (All numbers are for illustrative purposes only.) This includes the cost of editing, copy-editing, formatting, cover design, printing, binding, and administrative overhead. Estimating library sales at 1000 copies, this publisher would set C at $100,000/1,000 = $100. In other words, he recovered his up-front investment from libraries. Retail sales were pure profit.

The details are, no doubt, more complicated. Yet, even without relying on a recollection of an old conversation, it is safe to assume that publishers use the captive library market to reduce their business risk. In spite of increasingly recurrent crises, library budgets remain fairly predictable, both in size and in how the money is spent. Any major publisher has reliable advance estimates of library sales for any given book, particularly if published as part of a well-known series. It is just good business to exploit that predictability.

The market should be vastly different now, but textbooks have remained stuck in the paper era longer than other publications. Moreover, the first stage of the move towards digital, predictably, consists of replicating the paper world. This is what all constituents want: Librarians want to keep lending books. Researchers and students like getting free access to quality books. Textbook publishers do not want to lose the risk-reducing revenue stream from libraries. As a result, everyone implements the status quo in digital form. Publishers produce digital books and rent their collections to libraries through site licenses. Libraries intermediate electronic-lending transactions. Users get the paper experience in digital form. Universities pay for site licenses and the maintenance of the digital-lending platforms.

After the disaster of site licenses for scholarly journals, repeating the same mistake with books seems silly. Once again, take-it-or-leave-it bundles force institutions into a false choice between buying too much for everyone or nothing at all. Once again, site licenses eliminate the unlimited flexibility of digital information. Forget about putting together a personal collection tailored to your own requirements. Forget about pricing per series, per book, per chapter, unlimited in time, one-day access, one-hour access, readable on any device, or tied to a particular device. All of these options are eliminated to maintain the business models and the intermediaries of the paper era.

Just by buying/renting books as soon as they are published, libraries indirectly pay for a significant fraction of the initial investment of producing textbooks. If libraries made that initial investment explicitly and directly, they could produce those same books and set them free. Instead of renting digital books (and their multimedia successors), libraries could fund authors to write books and contract with publishers to publish those manuscripts as open-access works. Authors would be compensated. Publishers would compete for library funds as service providers. Publishers would be free to pursue the conventional pay-for-access publishing model, just not with library dollars. Prospective authors would have a choice: compete for library funding to produce an open-access work or compete for a publishing contract to produce a pay-for-access work.

The Carnegie model of libraries fused together two distinct objectives: subsidize information and disseminate information by distributing books to many different locations. In web-connected communities, spending precious resources on dissemination is a waste. Inserting libraries in digital-lending transactions only makes those transactions more inconvenient. Moreover, it requires expensive-to-develop-and-maintain technology. By reallocating these resources towards subsidizing information, libraries could set information free without spending part of their budget on reducing publishers' business risk. The fundamental budget questions that remain are: Which information should be subsidized? What is the most effective way to subsidize information?

Libraries need not suddenly stop site licensing books tomorrow. In fact, they should take a gradual approach, test the concept, make mistakes, and learn from them. A library does not become a grant sponsor and/or publisher overnight. Several models are already available: from grant competition to crowd-funded ungluing. [Unglue.it for Libraries] By phasing out site licenses, any library can create budgetary space for sponsoring open-access works.

Libraries have a digital future with almost unlimited opportunities. Yet, they will miss out if they just rebuild themselves as a digital copy of the paper era.

Monday, January 20, 2014

A Cloud over the Internet

Cloud computing could not have existed without the Internet, but it may make Internet history by making the Internet history.

Organizations are rushing to move their data centers to the cloud. Individuals have been using cloud-based services, like social networks, cloud gaming, Google Apps, Netflix, and Aereo. Recently, Amazon introduced WorkSpaces, a comprehensive personal cloud-computing service. The immediate benefits and opportunities that fuel the growth of the cloud are well known. The long-term consequences of cloud computing are less obvious, but a little extrapolation may help us make some educated guesses.

Personal cloud computing takes us back to the days of remote logins with dumb terminals and modems. Like the one-time office computer, the cloud computer does almost all of the work. Like the dumb terminal, a not-so-dumb access device (anything from the latest wearable gadget to a desktop) handles input/output. Input evolved beyond keystrokes and now also includes touch-screen gestures, voice, image, and video. Output evolved from green-on-black characters to multimedia.

When accessing a web page with content from several contributors (advertisers, for example), the page load time depends on several factors: the performance of computers that contribute web-page components, the speed of the Internet connections that transmit these components, and the performance of the computer that assembles and formats the web page for display. By connecting to the Internet through a cloud computer, we bypass the performance limitations of our access device. All bandwidth-hungry communication occurs in the cloud on ultra-fast networks, and almost all computation occurs on a high-performance cloud computer. The access device and its Internet connection just need to be fast enough to process the information streams into and out of the cloud. Beyond that, the performance of the access device hardly matters.

Because of economies of scale, the cloud-enabled net is likely to be a highly centralized system dominated by a small number of extremely large providers of computing and networking. This extreme concentration of infrastructure stands in stark contrast to the original Internet concept, which was designed as a redundant, scalable, and distributed system without a central authority or a single point of failure.

When a cloud provider fails, it disrupts its own customers, and the disruption immediately propagates to the customers' clients. Every large provider is, therefore, a systemic vulnerability with the potential of taking down a large fraction of the world's networked services. Of course, cloud providers are building infrastructure of extremely high reliability with redundant facilities spread around the globe to protect against regional disasters. Unfortunately, facilities of the same provider all have identical vulnerabilities, as they use identical technology and share identical management practices. This is a setup for black-swan events, low-probability large-scale catastrophes.

The Internet is overseen and maintained by a complex international set of authorities. [Wikipedia: Internet Governance] That oversight loses much of its influence when most communication occurs within the cloud. Cloud providers will be tempted to deploy more efficient custom communication technology within their own facilities. After all, standard Internet protocols were designed for heterogeneous networks. Much of that design is not necessary on a network where one entity manages all computing and all communication. Similarly, any two providers may negotiate proprietary communication channels between their facilities. Step by step, the original Internet will be relegated to the edges of the cloud, where access devices connect with cloud computers.

Net neutrality is already on life support. When cloud providers compete on price and performance, they are likely to segment the market. Premium cloud providers are likely to attract high-end services and their customers, relegating the rest to second-tier low-cost providers. Beyond net neutrality, there may be a host of other legal implications when communication moves from public channels to private networks.

When traffic moves to the cloud, telecommunication companies will gradually lose the high-margin retail market of providing organizations and individuals with high-bandwidth point-to-point communication. They will not derive any revenue from traffic between computers within the same cloud facility. The revenue from traffic between cloud facilities will be determined by a wholesale market with customers that have the resources to build and/or acquire their own communication capacity.

The existing telecommunication infrastructure will mostly serve to connect access devices to the cloud over relatively low-bandwidth channels. When TV channels are delivered to the cloud (regardless of technology), users select their channel on the cloud computer. They do not need all channels delivered to the home at all times; one TV channel at a time per device will do. When phones are cloud-enabled, a cloud computer intermediates all communication and provides the functional core of the phone.

Telecommunication companies may still come out ahead as long as the number of access devices keeps growing. Yet, they should at least question whether it would be more profitable to invest in cloud computing instead of ever higher bandwidth to the consumer.

The cloud will continue to grow as long as its unlimited processing power, storage capacity, and communication bandwidth provide new opportunities at irresistible price points. If history is any guide, long-term and low-probability problems at the macro level are unlikely to limit its growth. Even if our extrapolated scenario never completely materializes, the cloud will do much more than increase efficiency and/or lower cost. It will change the fundamental character of the Internet.

Wednesday, January 1, 2014

Market Capitalism and Open Access

Is it feasible to create a self-regulating market for Open Access (OA) journals where competition for money is aligned with the quest for scholarly excellence?

Many proponents of the subscription model argue that a competitive market provides the best assurance for quality. This ignores that the relationship between a strong subscription base and scholarly excellence is tenuous at best. What if we created a market that rewards journals when a university makes its most tangible commitment to scholarly excellence?

While role of journals in actual scholarly communication has diminished, their role in academic career advancement remains as strong than ever. [Paul Krugman: The Facebooking of Economics] The scholarly-journal infrastructure streamlines the screening, comparing, and short-listing of candidates. It enables the gathering of quantitative evidence in support of the hiring decision. Without journals, the work load of search committees would skyrocket. If scholarly journals are the headhunters of the academic-job market, let us compensate them as such.

There are many ways to structure such compensation, but we only need one example to clarify the concept. Consider the following scenario:

  • The new hire submitted a bibliography of 100 papers.
  • The search committee selected 10 of those papers to argue the case in favor of the appointment. This subset consists of 6 papers in subscription journals, 3 papers in the OA journal Theoretical Approaches to Theory (TAT), and 1 paper in the OA journal Practical Applications of Practice (PAP).
  • The university's journal budget is 1% of its budget for faculty salaries. (In reality, that percentage would be much lower.)

Divide the new faculty member's share of the journal budget, 1% of his or her salary, into three portions:

  • (6/10) x 1% = 0.6% of salary to subscription journals,
  • (3/10) x 1% = 0.3% of salary to the journal TAT, and
  • (1/10) x 1% = 0.1% of salary to the journal PAP.

The first portion (0.6%) remains in the journal budget to pay for subscriptions. The second (0.3%) and third (0.1%) portion are, respectively, awarded yearly to the OA journals TAT and PAP. The university adjusts the reward formula every time a promotion committee determines a new list of best papers.

To move beyond a voluntary system, universities should give headhunting rewards only to those journals with whom they have a contractual relationship. Some Gold OA journals are already pursuing institutional-membership deals that eliminate or reduce author page charges (APCs). [BioMed Central] [PeerJ][SpringerOpen] Such memberships are a form of discounting for quantity. Instead, we propose a pay-for-performance contract that eliminates APCs in exchange for headhunting rewards. Before signing such a contract, a university would conduct a due-diligence investigation into the journal. It would assess the publisher's reputation, the journal's editorial board, its refereeing, editing, formatting, and archiving standards, its OA licensing practices, and its level of participation in various abstracting-and-indexing and content-mining services. This step would all but eliminate predatory journals.

Every headhunting reward would enhance the prestige (and the bottom line) of a journal. A reward citing a paper would be a significant recognition of that paper. Such citations might be even more valuable than citations in other papers, thereby creating a strong incentive for institutions to participate in the headhunting system. Nonparticipating institutions would miss out on publicly recognizing the work of their faculty, and their faculty would have to pay APCs. There is no Open Access free ride.

Headhunting rewards create little to no extra work for search committees. Academic libraries are more than capable to perform due diligence, to negotiate the contracts, and to administer the rewards. Our scenario assumed a base percentage of 1%. The actual percentage would be negotiated between universities and publishers. With rewards proportional to salaries, there is a built-in adjustment for inflation, for financial differences between institutions and countries, and for differences in the sizes of various scholarly disciplines.

Scholars retain the right to publish in the venue of their choice. The business models of journals are used when distributing rewards, but this occurs well after the search process has concluded. The headhunting rewards gradually reduce the subscription budget in proportion to the number of papers published in OA journals by the university's faculty. A scholar who wishes to support a brand-new journal should not pay APCs, but lobby his or her university to negotiate a performance-based headhunting contract.

The essence of this proposal is the performance-based contract that exchanges APCs for headhunting rewards. All other details are up for discussion. Every university would be free to develop its own specific performance criteria and reward structures. Over time, we would probably want to converge towards a standard contract.

Headhunting contracts create a competitive market for OA journals. In this market, the distributed and collective wisdom of search/promotion committees defines scholarly excellence and provides the monetary rewards to journals. As a side benefit, this free-market system creates a professionally managed open infrastructure for the scholarly archive.