[LINK] Wired: 'Inside arXiv—the Most Transformative Platform in All of Science'

Fri Mar 28 06:27:28 AEDT 2025

[ This topic and article feel like a throwback to a time when the 
Internet was a lot younger (and so were some of us), and people did good 
and interesting things.  They still do.  But the good and interesting 
things are getting rather drowned out by the dross and the venal. ]

 > Then Ginsparg heard about something called the “World Wide Web.” 
Initially skeptical—“I can’t really pay attention to every single 
fad”—he became intrigued when the Mosaic browser was released in 1993. 
Soon after, Ginsparg built a web interface for arXiv, which over time 
became its primary mode of access.  [ And then he got to eat Tim 
Berners-Lee's BBQ'd swordfish. ]

[ Here's how to get an exponential adoption curve:
 > Early on, Ginsparg expected to receive on the order of 100 
submissions to arXiv a year. It turned out to be closer to 100 a month, 
and growing. “Day one, something happened, day two something happened, 
day three, Ed Witten posted a paper,” as Ginsparg once put it. “That was 
when the entire community joined.” Edward Witten is a revered string 
theorist and, quite possibly, the smartest person alive.

[ And a warning about the unfortunate but crucial need for commercial 
nous when establishing a commons:
 > THE BIGGEST MYSTERY is not why arXiv succeeded. Rather, it’s how it 
wasn’t killed by vested interests intent on protecting traditional 
academic publishing. Perhaps this was due to a decision Ginsparg made 
early on: Upon submission, users signed a clause that gave arXiv 
nonexclusive license to distribute the work in perpetuity, even in the 
event of future publication elsewhere. The strategic move ensured that 
no major publishers, known for their typically aggressive actions to 
maintain feudal control, would ever seriously attempt to shut it down.

[ And the hacker mentality in one sentence:
 > “I learned Fortran in the 1960s, and real [ForTran] programmers 
didn’t document”.

[ My counter-story as a non-hacker:  "I did too - just - but I also 
learned COBOL, and real COBOL programmers [once upon a time] did 
document [ at least embedded in the coding ] ". ]

_________

Inside arXiv—the Most Transformative Platform in All of Science
Modern science wouldn’t exist without the online research repository 
known as arXiv. Three decades in, its creator still can’t let it go.
SHEON HAN
Wired
MAR 27, 2025 6:00 AM
https://www.wired.com/story/inside-arxiv-most-transformative-code-science/

“JUST WHEN I thought I was out, they pull me back in!” With a sly grin 
that I’d soon come to recognize, Paul Ginsparg quoted Michael Corleone 
from The Godfather. Ginsparg, a physics professor at Cornell University 
and a certified MacArthur genius, may have little in common with Al 
Pacino’s mafia don, but both are united by the feeling that they were 
denied a graceful exit from what they’ve built.

Nearly 35 years ago, Ginsparg created arXiv, a digital repository where 
researchers could share their latest findings—before those findings had 
been systematically reviewed or verified. Visit arXiv.org today (it’s 
pronounced like “archive”) and you’ll still see its old-school Web 1.0 
design, featuring a red banner and the seal of Cornell University, the 
platform’s institutional home. But arXiv’s unassuming facade belies the 
tectonic reconfiguration it set off in the scientific community. If 
arXiv were to stop functioning, scientists from every corner of the 
planet would suffer an immediate and profound disruption. “Everybody in 
math and physics uses it,” Scott Aaronson, a computer scientist at the 
University of Texas at Austin, told me. “I scan it every night.”

Every industry has certain problems universally acknowledged as broken: 
insurance in health care, licensing in music, standardized testing in 
education, tipping in the restaurant business. In academia, it’s 
publishing. Academic publishing is dominated by for-profit giants like 
Elsevier and Springer. Calling their practice a form of thuggery isn’t 
so much an insult as an economic observation. Imagine if a book 
publisher demanded that authors write books for free and, instead of 
employing in-house editors, relied on other authors to edit those books, 
also for free. And not only that: The final product was then sold at 
prohibitively expensive prices to ordinary readers, and institutions 
were forced to pay exorbitant fees for access.

The “free editing” academic publishers facilitate is called peer review, 
the process by which fellow researchers vet new findings. This can take 
months, even a year. But with arXiv, scientists could post their 
papers—known, at this unvetted stage, as preprints—for instant and free 
access to everyone. One of arXiv’s great achievements was “showing that 
you could divorce the actual transmission of your results from the 
process of refereeing,” said Paul Fendley, an early arXiv moderator and 
now a physicist at All Souls College, Oxford. During crises like the 
Covid pandemic, time-sensitive breakthroughs were disseminated 
quickly—particularly by bioRxiv and medRxiv, platforms inspired by 
arXiv—potentially saving, by one study’s estimate, millions of lives.

While arXiv submissions aren’t peer-reviewed, they are moderated by 
experts in each field, who volunteer their time to ensure that 
submissions meet basic academic standards and follow arXiv’s guidelines: 
original research only, no falsified data, sufficiently neutral 
language. Submissions also undergo automated checks for baseline quality 
control. Without these, pseudoscientific papers and amateur work would 
flood the platform.

In 2021, the journal Nature declared arXiv one of the “10 computer codes 
that transformed science,” praising its role in fostering scientific 
collaboration. (The article is behind a paywall—unlock it for $199 a 
year.) By a recent count, arXiv hosts more than 2.6 million papers, 
receives 20,000 new submissions each month, and has 5 million monthly 
active users. Many of the most significant discoveries of the 21st 
century have first appeared on the platform. The “transformers” paper 
that launched the modern AI boom? Uploaded to arXiv. Same with the 
solution to the Poincaré conjecture, one of the seven Millennium Prize 
problems, famous for their difficulty and $1 million rewards. Just 
because a paper is posted on arXiv doesn’t mean it won’t appear in a 
prestigious journal someday, but it’s often where research makes its 
debut and stays openly available. The transformers paper is still 
routinely accessed via arXiv.

For scientists, imagining a world without arXiv is like the rest of us 
imagining one without public libraries or GPS. But a look at its inner 
workings reveals that it isn’t a frictionless utopia of open-access 
knowledge. Over the years, arXiv’s permanence has been threatened by 
everything from bureaucratic strife to outdated code to even, once, a 
spy scandal. In the words of Ginsparg, who usually redirects interview 
requests to an FAQ document—on arXiv, no less—and tried to talk me out 
of visiting him in person, arXiv is “a child I sent off to college but 
who keeps coming back to camp out in my living room, behaving badly.”

GINSPARG AND I met over the course of several days last spring in 
Ithaca, New York, home of Cornell University. I’ll admit, I was 
apprehensive ahead of our time together. Geoffrey West, a former 
supervisor of Ginsparg’s at Los Alamos National Laboratory, once 
described him as “quite a character” who is “infamous in the community” 
for being “quite difficult.” He also said he was “extremely funny” and a 
“great guy.” In our early email exchanges, Ginsparg told me, upfront, 
that stories about arXiv never impress him: “So many articles, so few 
insights,” he wrote.

At 69 years old, Ginsparg has the lean build of a retired triathlete, 
his knees etched with scars collected over a lifetime of hiking, 
mountain climbing, and cycling. (He still leads hikes on occasion, 
leaving younger scientists struggling to keep up.) His attire was always 
relaxed, as though he’d just stepped off the Camino de Santiago, making 
my already casual clothes seem overdressy. Much of our time together was 
spent cycling the town’s rolling hills, and the maximum speed on the 
ebike I rented could not keep up with his efficient pedaling.

Invited one afternoon to Ginsparg’s office in Cornell’s physics 
building, I discovered it to be not “messy,” per se, because that 
suggests it could be cleaned. Instead, the objects in the room seemed 
inert, long since resigned to their fate: unopened boxes from the 1990s, 
piles of Physics Today magazines, an inexplicable CRT monitor, a 
tossed-aside invitation to the Obama White House. New items were 
occasionally added to the heap. I spotted a copy of Stephen Wolfram’s 
recent book, The Second Law, with a note from Wolfram that read, “Since 
you can’t find it on arXiv :)” The only thing that seemed actively in 
use was the blackboard, dense with symbols and equations related to 
quantum measurement theory, sprawling with bra-ket notation.

As he showed me around the building and his usual haunts, Ginsparg was 
gregarious, not letting a single detail slip by: the nesting patterns of 
local red-tailed hawks, the comings and goings of the dining staff, the 
plans for a new building going up behind his office. He was playful, 
even prankish. Midway through telling me about a podcast he was 
listening to, Ginsparg suddenly stopped and said, “I like your hair 
color, by the way, it works for you”—my hair is dyed ash gray, if anyone 
cares—before seamlessly transitioning to a story about a hard drive that 
had failed him.

The drive, which he had sent for recovery, contained a language model, 
Ginsparg’s latest intellectual fascination. Among his litany of peeves 
is that, because arXiv has seen a surge in submissions in recent times, 
especially in the AI category, the number of low-quality papers has 
followed a similar curve—and arXiv has nowhere near enough volunteers to 
vet them all. Hence his fussing with the drive, part of a quest to catch 
subpar submissions with what he calls “the holy grail crackpot filter.” 
And Ginsparg thinks, as he often has in arXiv’s three-decade history, 
that the quality would not be up to snuff if he doesn’t do it himself.

LONG BEFORE ARXIV became critical infrastructure for scientific 
research, it was a collection of shell scripts running on Ginsparg’s 
NeXT machine. In June 1991, Ginsparg, then a researcher at Los Alamos 
National Laboratory, attended a conference in Colorado, where a fateful 
encounter took place.

First came a remark from Joanne Cohn, a friend of Ginsparg’s and a 
postdoc at the Institute for Advanced Study in Princeton, who maintained 
a mailing list for physics preprints. At the time, there was no 
centralized way to access these preprints. Unless researchers were on 
certain mailing lists—which were predicated on their affiliations with 
prestigious institutions—or knew exactly whom to contact via email, they 
had to wait months to read new work in published journals.

Then came an offhand comment from a physicist worried about his 
computer’s storage filling up with emailed articles while he was traveling.

Ginsparg, who had been coding since high school, asked Cohn if she’d 
considered automating the distribution process. She hadn’t and told him 
to go ahead and do it himself. “My recollection is that the next day 
he’d come up with the scripts and seemed pretty happy about having done 
it so quickly,” Cohn told me. “It’s hard to communicate how different it 
was at the time. Paul had really seen ahead.”

Hearing tales from and about Ginsparg, you can’t help but see him as a 
sort of Forrest Gump figure of the internet age, who found himself at 
crucial junctures and crossed paths with revolutionary figures. As an 
undergrad at Harvard, he was classmates with Bill Gates and Steve 
Ballmer; his older brother was a graduate student at Stanford studying 
with Terry Winograd, an AI pioneer. The brothers both had email 
addresses and access to Arpanet, the precursor to the internet, at a 
time when few others did.

After earning his PhD in theoretical physics at Cornell, Ginsparg began 
teaching at Harvard. A career there wasn’t to be: He wasn’t granted 
tenure—Harvard is infamous for this—and started looking for a job 
elsewhere. That’s when Ginsparg was recruited to Los Alamos, where he 
was free to do research on theoretical high-energy physics full-time, 
without other responsibilities. Plus, New Mexico was perfect for his 
active lifestyle.

When arXiv started, it wasn’t a website but an automated email server 
(and within a few months also an FTP server). Then Ginsparg heard about 
something called the “World Wide Web.” Initially skeptical—“I can’t 
really pay attention to every single fad”—he became intrigued when the 
Mosaic browser was released in 1993. Soon after, Ginsparg built a web 
interface for arXiv, which over time became its primary mode of access. 
He also occasionally consulted with a programmer at the European 
Organization for Nuclear Research (CERN) named Tim Berners-Lee—now Sir 
Tim “Inventor of the World Wide Web” Berners-Lee—whom Ginsparg fondly 
credits with grilling excellent swordfish at his home in the French 
countryside.

In 1994, with a National Science Foundation grant, Ginsparg hired two 
people to transform arXiv’s shell scripts into more reliable Perl code. 
They were both technically gifted, perhaps too gifted to stay for long. 
One of them, Mark Doyle, later joined the American Physical Society and 
became its chief information officer. The other, Rob Hartill, was 
working simultaneously on a project to collect entertainment data: the 
Internet Movie Database. (After IMDb, Hartill went on to do notable work 
at the Apache Software Foundation.)

Before arXiv was called arXiv, it was accessed under the hostname 
xxx.lanl.gov (“xxx” didn’t have the explicit connotations it does today, 
Ginsparg emphasized). During a car ride, he and his wife brainstormed 
nicer-sounding names. Archive? Already taken. Maybe they could sub in 
the Greek equivalent of X, chi (pronounced like “kai”). “She wrote it 
down and crossed out the e to make it more symmetric around the X,” 
Ginsparg said. “So arXiv it was.” At this point, there wasn’t much 
formal structure. The number of developers typically stayed at one or 
two, and much of the moderation was managed by Ginsparg’s friends, 
acquaintances, and colleagues.

Early on, Ginsparg expected to receive on the order of 100 submissions 
to arXiv a year. It turned out to be closer to 100 a month, and growing. 
“Day one, something happened, day two something happened, day three, Ed 
Witten posted a paper,” as Ginsparg once put it. “That was when the 
entire community joined.” Edward Witten is a revered string theorist 
and, quite possibly, the smartest person alive. “The arXiv enabled much 
more rapid worldwide communication among physicists,” Witten wrote to me 
in an email. Over time, disciplines such as mathematics and computer 
science were added, and Ginsparg began to appreciate the significance of 
this new electronic medium. Plus, he said, “it was fun.”

As the usage grew, arXiv faced challenges similar to those of other 
large software systems, particularly in scaling and moderation. There 
were slowdowns to deal with, like the time arXiv was hit by too much 
traffic from “stanford.edu.” The culprits? Sergey Brin and Larry Page, 
who were then busy indexing the web for what would eventually become 
Google. Years later, when Ginsparg visited Google HQ, both Brin and Page 
personally apologized to him for the incident.

THE BIGGEST MYSTERY is not why arXiv succeeded. Rather, it’s how it 
wasn’t killed by vested interests intent on protecting traditional 
academic publishing. Perhaps this was due to a decision Ginsparg made 
early on: Upon submission, users signed a clause that gave arXiv 
nonexclusive license to distribute the work in perpetuity, even in the 
event of future publication elsewhere. The strategic move ensured that 
no major publishers, known for their typically aggressive actions to 
maintain feudal control, would ever seriously attempt to shut it down.

But even as arXiv’s influence grew, higher-ups at Los Alamos never 
particularly championed the project—which was becoming, one could argue, 
more influential than the lab itself. (This was, of course, long past 
the heyday of Oppenheimer depicted in Christopher Nolan’s middling 2023 
docudrama.) Those early years at Los Alamos were “dreamlike and 
heavenly,” Ginsparg emphasized, the best job he ever had. But in 1999, a 
fellow physicist at the lab, Wen Ho Lee, was accused of leaking 
classified information to China. Lee, a Taiwanese American, was later 
cleared of wrongdoing, and the case was widely criticized for racial 
profiling. At the time, the scandal led to internal upheaval. There were 
travel restrictions to prevent leaks, and even discussions about 
subjecting employees to lie detector tests. “It just got glummer and 
glummer,” Ginsparg said. It didn’t help that a performance review that 
year labeled him “a strictly average performer” with “no particular 
computer skills contributing to lab programs.” Also, his daughter had 
just been born, and there weren’t schools nearby. He was ready to leave.

Ginsparg stops short of saying he “brought” arXiv with him, but the fact 
is, he ended up back at his alma mater, Cornell—tenured, this time—and 
so did arXiv. He vowed to be free of the project within “five years 
maximum.” After all, his main job wasn’t supposed to be running arXiv—it 
was teaching and doing research. At the university, arXiv found a home 
within the library. “They disseminate material to academics,” Ginsparg 
said, “so that seemed like a natural fit.”

A natural fit it was not. Under the hood, arXiv was a complex software 
platform that required technical expertise far beyond what was typically 
available in a university library. The logic for the submission process 
alone involved a vast number of potential scenarios and edge cases, 
making the code convoluted. Ginsparg and other early arXiv members I 
spoke to felt that the library failed to grasp arXiv’s significance and 
treated it more like an afterthought.

On the library’s side, some people thought Ginsparg was too hands-on. 
Others said he wasn’t patient enough. A “good lower-level manager,” 
according to someone long involved with arXiv, “but his sense of 
management didn’t scale.” For most of the 2000s, arXiv couldn’t hold on 
to more than a few developers.

THERE ARE TWO paths for pioneers of computing. One is a life of board 
seats, keynote speeches, and lucrative consulting gigs. The other is the 
path of the practitioner who remains hands-on, still writing and 
reviewing code. It’s clear where Ginsparg stands—and how anathema the 
other path is to him. As he put it to me, “Larry Summers spending one 
day a week consulting for some hedge fund—it’s just unseemly.”

But overstaying one’s welcome also risks unseemliness. By the mid-2000s, 
as the web matured, arXiv—in the words of its current program director, 
Stephanie Orphan—got “bigger than all of us.” A creationist physicist 
sued it for rejecting papers on creationist cosmology. Various other 
mini-scandals arose, including a plagiarism one, and some users 
complained that the moderators—volunteers who are experts in their 
respective fields—held too much power. In 2009, Philip Gibbs, an 
independent physicist, even created viXra (arXiv spelled backward), a 
more or less unregulated Wild West where papers on 
quantum-physico-homeopathy can find their readership, for anyone eager 
to learn why pi is a lie.

Then there was the problem of managing arXiv’s massive code base. 
Although Ginsparg was a capable programmer, he wasn’t a software 
professional adhering to industry norms like maintainability and 
testing. Much like constructing a building without proper structural 
supports or routine safety checks, his methods allowed for quick initial 
progress but later caused delays and complications. Unrepentant, 
Ginsparg often went behind the library’s back to check the code for 
errors. The staff saw this as an affront, accusing him of micromanaging 
and sowing distrust.

In 2011, arXiv’s 20th anniversary, Ginsparg thought he was ready to move 
on, writing what was intended as a farewell note, an article titled 
“ArXiv at 20,” in Nature: “For me, the repository was supposed to be a 
three-hour tour, not a life sentence. ArXiv was originally conceived to 
be fully automated, so as not to scuttle my research career. But daily 
administrative activities associated with running it can consume hours 
of every weekday, year-round without holiday.”

Ginsparg would stay on the advisory board, but daily operations would be 
handed over to the staff at the Cornell University Library.

It never happened, and as time went on, some accused Ginsparg of 
“backseat driving.” One person said he was holding certain code 
“hostage” by refusing to share it with other employees or on GitHub. 
Ginsparg was frustrated because he couldn’t understand why implementing 
features that used to take him a day now took weeks. I challenged him on 
this, asking if there was any documentation for developers to onboard 
the new code base. Ginsparg responded, “I learned Fortran in the 1960s, 
and real programmers didn’t document,” which nearly sent me, a coder, 
into cardiac arrest.

Technical problems were compounded by administrative ones. In 2019, 
Cornell transferred arXiv to the school’s Computing and Information 
Science division, only to have it change hands again after a few months. 
Then a new director with a background in, of all things, for-profit 
academic publishing took over; she lasted a year and a half. “There was 
disruption,” said an arXiv employee. “It was not a good period.”

But finally, relief: In 2022, the Simons Foundation committed funding 
that allowed arXiv to go on a hiring spree. Ramin Zabih, a Cornell 
professor who had been a long-time champion, joined as the faculty 
director. Under the new governance structure, arXiv’s migration to the 
cloud and a refactoring of the code base to Python finally took off.

ONE SATURDAY MORNING, I met Ginsparg at his home. He was carefully 
inspecting his son’s bike, which I was borrowing for a three-hour ride 
we had planned to Mount Pleasant. As Ginsparg shared the route with me, 
he teasingly—but persistently—expressed doubts about my ability to keep 
up. I was tempted to mention that, in high school, I’d cycled solo 
across Japan, but I refrained and silently savored the moment when, on 
the final uphill later that day, he said, “I might’ve oversold this to you.”

Over the months I spoke with Ginsparg, my main challenge was 
interrupting him, as a simple question would often launch him into an 
extended monolog. It was only near the end of the bike ride that I 
managed to tell him how I found him tenacious and stubborn, and that if 
someone more meek had been in charge, arXiv might not have survived. I 
was startled by his response.

“You know, one person’s tenacity is another person’s terrorism,” he said.

“What do you mean?” I asked.

“I’ve heard that the staff occasionally felt terrorized,” he said.

“By you?” I replied, though a more truthful response would’ve been “No 
shit.” Ginsparg apparently didn’t hear the question and started talking 
about something else.

Beyond the drama—if not terrorism—of its day-to-day operations, arXiv 
still faces many challenges. The linguist Emily Bender has accused it of 
being a “cancer” for the way it promotes “junk science” and “fast 
scholarship.” Sometimes it does seem too fast: In 2023, a much-hyped 
paper claiming to have cracked room-temperature superconductivity turned 
out to be thoroughly wrong. (But equally fast was exactly that 
debunking—proof of arXiv working as intended.) Then there are opposite 
cases, where arXiv “censors”—so say critics—perfectly good findings, 
such as when physicist Jorge Hirsch, of h-index fame, had his paper 
withdrawn for “inflammatory content” and “unprofessional language.”

[ 
https://www.nature.com/nature-index/news/whats-wrong-with-the-h-index-according-to-its-inventor

[ I wrote in 2008:
http://www.rogerclarke.com/SOS/Cit-CAIS.html#RTFToC3
 > Citation analysis can produce many impact measures, which have 
various advantages and disadvantages. A pair of measures that may 
represent a fair compromise is the so-called `h-index', supplemented by 
the `h-count'.

How does Ginsparg feel about all this? Well, he’s not the type to wax 
poetic about having a mission, promoting an ideology, or being a pioneer 
of “open science.” He cares about those things, I think, but he’s 
reluctant to frame his work in grandiose ways.

At one point, I asked if he ever really wants to be liberated from 
arXiv. “You know, I have to be completely honest—there are various 
aspects of this that remain incredibly entertaining,” Ginsparg said. “I 
have the perfect platform for testing ideas and playing with them.” 
Though he no longer tinkers with the production code that runs arXiv, he 
is still hard at work on his holy grail for filtering out bogus 
submissions. It’s a project that keeps him involved, keeps him active. 
Perhaps, with newer language models, he’ll figure it out. “It’s like 
that Al Pacino quote: They keep bringing me back,” he said. A familiar 
smile spread across Ginsparg’s face. “But Al Pacino also developed a 
real taste for killing people.”

PAPER TRAIL

There’s no paradox in saying that arXiv is both an inestimable resource 
for the latest research and a kind of Reddit for scientists, where the 
profound and the preposterous collide. String theory showdowns? Yes. 
Lawsuits over rejected papers? Naturally. Here are seven of its more 
memorable moments. —S. H.

1991: “Ground Ring of Two-Dimensional String Theory,” by Edward Witten
The string theorist’s first paper posted to arXiv. Witten’s early 
adoption helped legitimize the platform.

1994: “The World as a Hologram,” by Leonard Susskind
A real brain-breaker: Just as a hologram creates a three-dimensional 
image from a flat surface, everything inside a given space can be fully 
described by information on its two-dimensional boundary. Right?

2001: “Flaws in the Big Bang Point to GENESIS, A New Millennium Model of 
the Cosmos,” by Robert Gentry
When this “creationist” paper was rejected and Gentry’s access to arXiv 
revoked, he filed a lawsuit against the platform, claiming violation of 
constitutional rights.

2002–2003: Grigori Perelman’s Poincaré papers
With these, the Russian mathematician solved one of the seven Millennium 
Prize problems (the only one solved to date). He declined the $1 million 
prize and lives in seclusion.

2013: Two Papers on Word Representation, by Mikolov et al.
In which word2vec—the verbal math that allows machines to understand 
words—was introduced. Around this time, computer science papers began to 
dominate arXiv.

2017: “Attention Is All You Need,” by eight Google researchers
The paper that launched a thousand chatbots.

2023: “The First Room-Temperature Ambient-Pressure Superconductor,” by a 
team of South Korean scientists
A room-temp superconductor? Researchers worldwide attempted to reproduce 
the results but ultimately debunked the claim.

-- 
Roger Clarke                            mailto:Roger.Clarke at xamax.com.au
T: +61 2 6288 6916   http://www.xamax.com.au  http://www.rogerclarke.com

Xamax Consultancy Pty Ltd      78 Sidaway St, Chapman ACT 2611 AUSTRALIA 

Visiting Professorial Fellow                          UNSW Law & Justice
Visiting Professor in Computer Science    Australian National University