[LINK] Wired: 'Inside arXiv—the Most Transformative Platform in All of Science'
Roger Clarke
Roger.Clarke at xamax.com.au
Fri Mar 28 06:27:28 AEDT 2025
[ This topic and article feel like a throwback to a time when the
Internet was a lot younger (and so were some of us), and people did good
and interesting things. They still do. But the good and interesting
things are getting rather drowned out by the dross and the venal. ]
> Then Ginsparg heard about something called the “World Wide Web.”
Initially skeptical—“I can’t really pay attention to every single
fad”—he became intrigued when the Mosaic browser was released in 1993.
Soon after, Ginsparg built a web interface for arXiv, which over time
became its primary mode of access. [ And then he got to eat Tim
Berners-Lee's BBQ'd swordfish. ]
[ Here's how to get an exponential adoption curve:
> Early on, Ginsparg expected to receive on the order of 100
submissions to arXiv a year. It turned out to be closer to 100 a month,
and growing. “Day one, something happened, day two something happened,
day three, Ed Witten posted a paper,” as Ginsparg once put it. “That was
when the entire community joined.” Edward Witten is a revered string
theorist and, quite possibly, the smartest person alive.
[ And a warning about the unfortunate but crucial need for commercial
nous when establishing a commons:
> THE BIGGEST MYSTERY is not why arXiv succeeded. Rather, it’s how it
wasn’t killed by vested interests intent on protecting traditional
academic publishing. Perhaps this was due to a decision Ginsparg made
early on: Upon submission, users signed a clause that gave arXiv
nonexclusive license to distribute the work in perpetuity, even in the
event of future publication elsewhere. The strategic move ensured that
no major publishers, known for their typically aggressive actions to
maintain feudal control, would ever seriously attempt to shut it down.
[ And the hacker mentality in one sentence:
> “I learned Fortran in the 1960s, and real [ForTran] programmers
didn’t document”.
[ My counter-story as a non-hacker: "I did too - just - but I also
learned COBOL, and real COBOL programmers [once upon a time] did
document [ at least embedded in the coding ] ". ]
_________
Inside arXiv—the Most Transformative Platform in All of Science
Modern science wouldn’t exist without the online research repository
known as arXiv. Three decades in, its creator still can’t let it go.
SHEON HAN
Wired
MAR 27, 2025 6:00 AM
https://www.wired.com/story/inside-arxiv-most-transformative-code-science/
“JUST WHEN I thought I was out, they pull me back in!” With a sly grin
that I’d soon come to recognize, Paul Ginsparg quoted Michael Corleone
from The Godfather. Ginsparg, a physics professor at Cornell University
and a certified MacArthur genius, may have little in common with Al
Pacino’s mafia don, but both are united by the feeling that they were
denied a graceful exit from what they’ve built.
Nearly 35 years ago, Ginsparg created arXiv, a digital repository where
researchers could share their latest findings—before those findings had
been systematically reviewed or verified. Visit arXiv.org today (it’s
pronounced like “archive”) and you’ll still see its old-school Web 1.0
design, featuring a red banner and the seal of Cornell University, the
platform’s institutional home. But arXiv’s unassuming facade belies the
tectonic reconfiguration it set off in the scientific community. If
arXiv were to stop functioning, scientists from every corner of the
planet would suffer an immediate and profound disruption. “Everybody in
math and physics uses it,” Scott Aaronson, a computer scientist at the
University of Texas at Austin, told me. “I scan it every night.”
Every industry has certain problems universally acknowledged as broken:
insurance in health care, licensing in music, standardized testing in
education, tipping in the restaurant business. In academia, it’s
publishing. Academic publishing is dominated by for-profit giants like
Elsevier and Springer. Calling their practice a form of thuggery isn’t
so much an insult as an economic observation. Imagine if a book
publisher demanded that authors write books for free and, instead of
employing in-house editors, relied on other authors to edit those books,
also for free. And not only that: The final product was then sold at
prohibitively expensive prices to ordinary readers, and institutions
were forced to pay exorbitant fees for access.
The “free editing” academic publishers facilitate is called peer review,
the process by which fellow researchers vet new findings. This can take
months, even a year. But with arXiv, scientists could post their
papers—known, at this unvetted stage, as preprints—for instant and free
access to everyone. One of arXiv’s great achievements was “showing that
you could divorce the actual transmission of your results from the
process of refereeing,” said Paul Fendley, an early arXiv moderator and
now a physicist at All Souls College, Oxford. During crises like the
Covid pandemic, time-sensitive breakthroughs were disseminated
quickly—particularly by bioRxiv and medRxiv, platforms inspired by
arXiv—potentially saving, by one study’s estimate, millions of lives.
While arXiv submissions aren’t peer-reviewed, they are moderated by
experts in each field, who volunteer their time to ensure that
submissions meet basic academic standards and follow arXiv’s guidelines:
original research only, no falsified data, sufficiently neutral
language. Submissions also undergo automated checks for baseline quality
control. Without these, pseudoscientific papers and amateur work would
flood the platform.
In 2021, the journal Nature declared arXiv one of the “10 computer codes
that transformed science,” praising its role in fostering scientific
collaboration. (The article is behind a paywall—unlock it for $199 a
year.) By a recent count, arXiv hosts more than 2.6 million papers,
receives 20,000 new submissions each month, and has 5 million monthly
active users. Many of the most significant discoveries of the 21st
century have first appeared on the platform. The “transformers” paper
that launched the modern AI boom? Uploaded to arXiv. Same with the
solution to the Poincaré conjecture, one of the seven Millennium Prize
problems, famous for their difficulty and $1 million rewards. Just
because a paper is posted on arXiv doesn’t mean it won’t appear in a
prestigious journal someday, but it’s often where research makes its
debut and stays openly available. The transformers paper is still
routinely accessed via arXiv.
For scientists, imagining a world without arXiv is like the rest of us
imagining one without public libraries or GPS. But a look at its inner
workings reveals that it isn’t a frictionless utopia of open-access
knowledge. Over the years, arXiv’s permanence has been threatened by
everything from bureaucratic strife to outdated code to even, once, a
spy scandal. In the words of Ginsparg, who usually redirects interview
requests to an FAQ document—on arXiv, no less—and tried to talk me out
of visiting him in person, arXiv is “a child I sent off to college but
who keeps coming back to camp out in my living room, behaving badly.”
GINSPARG AND I met over the course of several days last spring in
Ithaca, New York, home of Cornell University. I’ll admit, I was
apprehensive ahead of our time together. Geoffrey West, a former
supervisor of Ginsparg’s at Los Alamos National Laboratory, once
described him as “quite a character” who is “infamous in the community”
for being “quite difficult.” He also said he was “extremely funny” and a
“great guy.” In our early email exchanges, Ginsparg told me, upfront,
that stories about arXiv never impress him: “So many articles, so few
insights,” he wrote.
At 69 years old, Ginsparg has the lean build of a retired triathlete,
his knees etched with scars collected over a lifetime of hiking,
mountain climbing, and cycling. (He still leads hikes on occasion,
leaving younger scientists struggling to keep up.) His attire was always
relaxed, as though he’d just stepped off the Camino de Santiago, making
my already casual clothes seem overdressy. Much of our time together was
spent cycling the town’s rolling hills, and the maximum speed on the
ebike I rented could not keep up with his efficient pedaling.
Invited one afternoon to Ginsparg’s office in Cornell’s physics
building, I discovered it to be not “messy,” per se, because that
suggests it could be cleaned. Instead, the objects in the room seemed
inert, long since resigned to their fate: unopened boxes from the 1990s,
piles of Physics Today magazines, an inexplicable CRT monitor, a
tossed-aside invitation to the Obama White House. New items were
occasionally added to the heap. I spotted a copy of Stephen Wolfram’s
recent book, The Second Law, with a note from Wolfram that read, “Since
you can’t find it on arXiv :)” The only thing that seemed actively in
use was the blackboard, dense with symbols and equations related to
quantum measurement theory, sprawling with bra-ket notation.
As he showed me around the building and his usual haunts, Ginsparg was
gregarious, not letting a single detail slip by: the nesting patterns of
local red-tailed hawks, the comings and goings of the dining staff, the
plans for a new building going up behind his office. He was playful,
even prankish. Midway through telling me about a podcast he was
listening to, Ginsparg suddenly stopped and said, “I like your hair
color, by the way, it works for you”—my hair is dyed ash gray, if anyone
cares—before seamlessly transitioning to a story about a hard drive that
had failed him.
The drive, which he had sent for recovery, contained a language model,
Ginsparg’s latest intellectual fascination. Among his litany of peeves
is that, because arXiv has seen a surge in submissions in recent times,
especially in the AI category, the number of low-quality papers has
followed a similar curve—and arXiv has nowhere near enough volunteers to
vet them all. Hence his fussing with the drive, part of a quest to catch
subpar submissions with what he calls “the holy grail crackpot filter.”
And Ginsparg thinks, as he often has in arXiv’s three-decade history,
that the quality would not be up to snuff if he doesn’t do it himself.
LONG BEFORE ARXIV became critical infrastructure for scientific
research, it was a collection of shell scripts running on Ginsparg’s
NeXT machine. In June 1991, Ginsparg, then a researcher at Los Alamos
National Laboratory, attended a conference in Colorado, where a fateful
encounter took place.
First came a remark from Joanne Cohn, a friend of Ginsparg’s and a
postdoc at the Institute for Advanced Study in Princeton, who maintained
a mailing list for physics preprints. At the time, there was no
centralized way to access these preprints. Unless researchers were on
certain mailing lists—which were predicated on their affiliations with
prestigious institutions—or knew exactly whom to contact via email, they
had to wait months to read new work in published journals.
Then came an offhand comment from a physicist worried about his
computer’s storage filling up with emailed articles while he was traveling.
Ginsparg, who had been coding since high school, asked Cohn if she’d
considered automating the distribution process. She hadn’t and told him
to go ahead and do it himself. “My recollection is that the next day
he’d come up with the scripts and seemed pretty happy about having done
it so quickly,” Cohn told me. “It’s hard to communicate how different it
was at the time. Paul had really seen ahead.”
Hearing tales from and about Ginsparg, you can’t help but see him as a
sort of Forrest Gump figure of the internet age, who found himself at
crucial junctures and crossed paths with revolutionary figures. As an
undergrad at Harvard, he was classmates with Bill Gates and Steve
Ballmer; his older brother was a graduate student at Stanford studying
with Terry Winograd, an AI pioneer. The brothers both had email
addresses and access to Arpanet, the precursor to the internet, at a
time when few others did.
After earning his PhD in theoretical physics at Cornell, Ginsparg began
teaching at Harvard. A career there wasn’t to be: He wasn’t granted
tenure—Harvard is infamous for this—and started looking for a job
elsewhere. That’s when Ginsparg was recruited to Los Alamos, where he
was free to do research on theoretical high-energy physics full-time,
without other responsibilities. Plus, New Mexico was perfect for his
active lifestyle.
When arXiv started, it wasn’t a website but an automated email server
(and within a few months also an FTP server). Then Ginsparg heard about
something called the “World Wide Web.” Initially skeptical—“I can’t
really pay attention to every single fad”—he became intrigued when the
Mosaic browser was released in 1993. Soon after, Ginsparg built a web
interface for arXiv, which over time became its primary mode of access.
He also occasionally consulted with a programmer at the European
Organization for Nuclear Research (CERN) named Tim Berners-Lee—now Sir
Tim “Inventor of the World Wide Web” Berners-Lee—whom Ginsparg fondly
credits with grilling excellent swordfish at his home in the French
countryside.
In 1994, with a National Science Foundation grant, Ginsparg hired two
people to transform arXiv’s shell scripts into more reliable Perl code.
They were both technically gifted, perhaps too gifted to stay for long.
One of them, Mark Doyle, later joined the American Physical Society and
became its chief information officer. The other, Rob Hartill, was
working simultaneously on a project to collect entertainment data: the
Internet Movie Database. (After IMDb, Hartill went on to do notable work
at the Apache Software Foundation.)
Before arXiv was called arXiv, it was accessed under the hostname
xxx.lanl.gov (“xxx” didn’t have the explicit connotations it does today,
Ginsparg emphasized). During a car ride, he and his wife brainstormed
nicer-sounding names. Archive? Already taken. Maybe they could sub in
the Greek equivalent of X, chi (pronounced like “kai”). “She wrote it
down and crossed out the e to make it more symmetric around the X,”
Ginsparg said. “So arXiv it was.” At this point, there wasn’t much
formal structure. The number of developers typically stayed at one or
two, and much of the moderation was managed by Ginsparg’s friends,
acquaintances, and colleagues.
Early on, Ginsparg expected to receive on the order of 100 submissions
to arXiv a year. It turned out to be closer to 100 a month, and growing.
“Day one, something happened, day two something happened, day three, Ed
Witten posted a paper,” as Ginsparg once put it. “That was when the
entire community joined.” Edward Witten is a revered string theorist
and, quite possibly, the smartest person alive. “The arXiv enabled much
more rapid worldwide communication among physicists,” Witten wrote to me
in an email. Over time, disciplines such as mathematics and computer
science were added, and Ginsparg began to appreciate the significance of
this new electronic medium. Plus, he said, “it was fun.”
As the usage grew, arXiv faced challenges similar to those of other
large software systems, particularly in scaling and moderation. There
were slowdowns to deal with, like the time arXiv was hit by too much
traffic from “stanford.edu.” The culprits? Sergey Brin and Larry Page,
who were then busy indexing the web for what would eventually become
Google. Years later, when Ginsparg visited Google HQ, both Brin and Page
personally apologized to him for the incident.
THE BIGGEST MYSTERY is not why arXiv succeeded. Rather, it’s how it
wasn’t killed by vested interests intent on protecting traditional
academic publishing. Perhaps this was due to a decision Ginsparg made
early on: Upon submission, users signed a clause that gave arXiv
nonexclusive license to distribute the work in perpetuity, even in the
event of future publication elsewhere. The strategic move ensured that
no major publishers, known for their typically aggressive actions to
maintain feudal control, would ever seriously attempt to shut it down.
But even as arXiv’s influence grew, higher-ups at Los Alamos never
particularly championed the project—which was becoming, one could argue,
more influential than the lab itself. (This was, of course, long past
the heyday of Oppenheimer depicted in Christopher Nolan’s middling 2023
docudrama.) Those early years at Los Alamos were “dreamlike and
heavenly,” Ginsparg emphasized, the best job he ever had. But in 1999, a
fellow physicist at the lab, Wen Ho Lee, was accused of leaking
classified information to China. Lee, a Taiwanese American, was later
cleared of wrongdoing, and the case was widely criticized for racial
profiling. At the time, the scandal led to internal upheaval. There were
travel restrictions to prevent leaks, and even discussions about
subjecting employees to lie detector tests. “It just got glummer and
glummer,” Ginsparg said. It didn’t help that a performance review that
year labeled him “a strictly average performer” with “no particular
computer skills contributing to lab programs.” Also, his daughter had
just been born, and there weren’t schools nearby. He was ready to leave.
Ginsparg stops short of saying he “brought” arXiv with him, but the fact
is, he ended up back at his alma mater, Cornell—tenured, this time—and
so did arXiv. He vowed to be free of the project within “five years
maximum.” After all, his main job wasn’t supposed to be running arXiv—it
was teaching and doing research. At the university, arXiv found a home
within the library. “They disseminate material to academics,” Ginsparg
said, “so that seemed like a natural fit.”
A natural fit it was not. Under the hood, arXiv was a complex software
platform that required technical expertise far beyond what was typically
available in a university library. The logic for the submission process
alone involved a vast number of potential scenarios and edge cases,
making the code convoluted. Ginsparg and other early arXiv members I
spoke to felt that the library failed to grasp arXiv’s significance and
treated it more like an afterthought.
On the library’s side, some people thought Ginsparg was too hands-on.
Others said he wasn’t patient enough. A “good lower-level manager,”
according to someone long involved with arXiv, “but his sense of
management didn’t scale.” For most of the 2000s, arXiv couldn’t hold on
to more than a few developers.
THERE ARE TWO paths for pioneers of computing. One is a life of board
seats, keynote speeches, and lucrative consulting gigs. The other is the
path of the practitioner who remains hands-on, still writing and
reviewing code. It’s clear where Ginsparg stands—and how anathema the
other path is to him. As he put it to me, “Larry Summers spending one
day a week consulting for some hedge fund—it’s just unseemly.”
But overstaying one’s welcome also risks unseemliness. By the mid-2000s,
as the web matured, arXiv—in the words of its current program director,
Stephanie Orphan—got “bigger than all of us.” A creationist physicist
sued it for rejecting papers on creationist cosmology. Various other
mini-scandals arose, including a plagiarism one, and some users
complained that the moderators—volunteers who are experts in their
respective fields—held too much power. In 2009, Philip Gibbs, an
independent physicist, even created viXra (arXiv spelled backward), a
more or less unregulated Wild West where papers on
quantum-physico-homeopathy can find their readership, for anyone eager
to learn why pi is a lie.
Then there was the problem of managing arXiv’s massive code base.
Although Ginsparg was a capable programmer, he wasn’t a software
professional adhering to industry norms like maintainability and
testing. Much like constructing a building without proper structural
supports or routine safety checks, his methods allowed for quick initial
progress but later caused delays and complications. Unrepentant,
Ginsparg often went behind the library’s back to check the code for
errors. The staff saw this as an affront, accusing him of micromanaging
and sowing distrust.
In 2011, arXiv’s 20th anniversary, Ginsparg thought he was ready to move
on, writing what was intended as a farewell note, an article titled
“ArXiv at 20,” in Nature: “For me, the repository was supposed to be a
three-hour tour, not a life sentence. ArXiv was originally conceived to
be fully automated, so as not to scuttle my research career. But daily
administrative activities associated with running it can consume hours
of every weekday, year-round without holiday.”
Ginsparg would stay on the advisory board, but daily operations would be
handed over to the staff at the Cornell University Library.
It never happened, and as time went on, some accused Ginsparg of
“backseat driving.” One person said he was holding certain code
“hostage” by refusing to share it with other employees or on GitHub.
Ginsparg was frustrated because he couldn’t understand why implementing
features that used to take him a day now took weeks. I challenged him on
this, asking if there was any documentation for developers to onboard
the new code base. Ginsparg responded, “I learned Fortran in the 1960s,
and real programmers didn’t document,” which nearly sent me, a coder,
into cardiac arrest.
Technical problems were compounded by administrative ones. In 2019,
Cornell transferred arXiv to the school’s Computing and Information
Science division, only to have it change hands again after a few months.
Then a new director with a background in, of all things, for-profit
academic publishing took over; she lasted a year and a half. “There was
disruption,” said an arXiv employee. “It was not a good period.”
But finally, relief: In 2022, the Simons Foundation committed funding
that allowed arXiv to go on a hiring spree. Ramin Zabih, a Cornell
professor who had been a long-time champion, joined as the faculty
director. Under the new governance structure, arXiv’s migration to the
cloud and a refactoring of the code base to Python finally took off.
ONE SATURDAY MORNING, I met Ginsparg at his home. He was carefully
inspecting his son’s bike, which I was borrowing for a three-hour ride
we had planned to Mount Pleasant. As Ginsparg shared the route with me,
he teasingly—but persistently—expressed doubts about my ability to keep
up. I was tempted to mention that, in high school, I’d cycled solo
across Japan, but I refrained and silently savored the moment when, on
the final uphill later that day, he said, “I might’ve oversold this to you.”
Over the months I spoke with Ginsparg, my main challenge was
interrupting him, as a simple question would often launch him into an
extended monolog. It was only near the end of the bike ride that I
managed to tell him how I found him tenacious and stubborn, and that if
someone more meek had been in charge, arXiv might not have survived. I
was startled by his response.
“You know, one person’s tenacity is another person’s terrorism,” he said.
“What do you mean?” I asked.
“I’ve heard that the staff occasionally felt terrorized,” he said.
“By you?” I replied, though a more truthful response would’ve been “No
shit.” Ginsparg apparently didn’t hear the question and started talking
about something else.
Beyond the drama—if not terrorism—of its day-to-day operations, arXiv
still faces many challenges. The linguist Emily Bender has accused it of
being a “cancer” for the way it promotes “junk science” and “fast
scholarship.” Sometimes it does seem too fast: In 2023, a much-hyped
paper claiming to have cracked room-temperature superconductivity turned
out to be thoroughly wrong. (But equally fast was exactly that
debunking—proof of arXiv working as intended.) Then there are opposite
cases, where arXiv “censors”—so say critics—perfectly good findings,
such as when physicist Jorge Hirsch, of h-index fame, had his paper
withdrawn for “inflammatory content” and “unprofessional language.”
[
https://www.nature.com/nature-index/news/whats-wrong-with-the-h-index-according-to-its-inventor
[ I wrote in 2008:
http://www.rogerclarke.com/SOS/Cit-CAIS.html#RTFToC3
> Citation analysis can produce many impact measures, which have
various advantages and disadvantages. A pair of measures that may
represent a fair compromise is the so-called `h-index', supplemented by
the `h-count'.
How does Ginsparg feel about all this? Well, he’s not the type to wax
poetic about having a mission, promoting an ideology, or being a pioneer
of “open science.” He cares about those things, I think, but he’s
reluctant to frame his work in grandiose ways.
At one point, I asked if he ever really wants to be liberated from
arXiv. “You know, I have to be completely honest—there are various
aspects of this that remain incredibly entertaining,” Ginsparg said. “I
have the perfect platform for testing ideas and playing with them.”
Though he no longer tinkers with the production code that runs arXiv, he
is still hard at work on his holy grail for filtering out bogus
submissions. It’s a project that keeps him involved, keeps him active.
Perhaps, with newer language models, he’ll figure it out. “It’s like
that Al Pacino quote: They keep bringing me back,” he said. A familiar
smile spread across Ginsparg’s face. “But Al Pacino also developed a
real taste for killing people.”
PAPER TRAIL
There’s no paradox in saying that arXiv is both an inestimable resource
for the latest research and a kind of Reddit for scientists, where the
profound and the preposterous collide. String theory showdowns? Yes.
Lawsuits over rejected papers? Naturally. Here are seven of its more
memorable moments. —S. H.
1991: “Ground Ring of Two-Dimensional String Theory,” by Edward Witten
The string theorist’s first paper posted to arXiv. Witten’s early
adoption helped legitimize the platform.
1994: “The World as a Hologram,” by Leonard Susskind
A real brain-breaker: Just as a hologram creates a three-dimensional
image from a flat surface, everything inside a given space can be fully
described by information on its two-dimensional boundary. Right?
2001: “Flaws in the Big Bang Point to GENESIS, A New Millennium Model of
the Cosmos,” by Robert Gentry
When this “creationist” paper was rejected and Gentry’s access to arXiv
revoked, he filed a lawsuit against the platform, claiming violation of
constitutional rights.
2002–2003: Grigori Perelman’s Poincaré papers
With these, the Russian mathematician solved one of the seven Millennium
Prize problems (the only one solved to date). He declined the $1 million
prize and lives in seclusion.
2013: Two Papers on Word Representation, by Mikolov et al.
In which word2vec—the verbal math that allows machines to understand
words—was introduced. Around this time, computer science papers began to
dominate arXiv.
2017: “Attention Is All You Need,” by eight Google researchers
The paper that launched a thousand chatbots.
2023: “The First Room-Temperature Ambient-Pressure Superconductor,” by a
team of South Korean scientists
A room-temp superconductor? Researchers worldwide attempted to reproduce
the results but ultimately debunked the claim.
--
Roger Clarke mailto:Roger.Clarke at xamax.com.au
T: +61 2 6288 6916 http://www.xamax.com.au http://www.rogerclarke.com
Xamax Consultancy Pty Ltd 78 Sidaway St, Chapman ACT 2611 AUSTRALIA
Visiting Professorial Fellow UNSW Law & Justice
Visiting Professor in Computer Science Australian National University
More information about the Link
mailing list