Introduction & Goals¶

This document is a description of and prescription for the arXiv Next Generation (arXiv NG, or NG) architecture. It is both a concrete product of NG Phase 1, and a living record of the technical decisions made throughout the project. In other words, this document will be updated and versioned continuously as implementation proceeds.

The NG charter describes the overarching vision for the project as follows:

Purpose: The overall goal of this project is to renew arXiv’s technical infrastructure, in order to continue to fulfill arXiv’s mission of providing rapid dissemination of research findings at no cost to readers and submitters. Work will proceed in two phases; the purpose of Phase I (18-month) is to develop a complete plan to renew arXiv’s technical infrastructure, and to deliver either a working proof-of-concept next generation system (arXiv-NG), or selected production-ready modules for a next generation system. Phase II will see through to completion the work initiated in Phase I, resulting in a fully functional, production system.

Rationale: While the current arXiv system has proven remarkably reliable even as the number of users and submissions has continued to grow, it has also become very difficult to extend and modify the system, and requires staff with knowledge of programming languages that are becoming obsolete.

Anticipated results: An arXiv infrastructure that will be less work to maintain and utilizes modern, standard programming practices, resulting in greater capacity to implement improvements and new features, and support the development of new features by others. We will retain the essential features users expect of the current system, while introducing improvements with minimal disruption to the user experience. The finished system will be will be user-focused, sustainable, and production-ready. The end result will be a superior user experience for readers, submitters, moderators, administrators, and arXiv member-supporters. 1

—arXiv NG Phase 1 Charter

What we mean by “introducing improvements with minimal disruption to the user experience” is that we will work to enhance arXiv users’ experiences by developing the features and improvements they value most, while not disrupting their existing work processes or the quality of existing services.

The arXiv leadership has explored a wide range of strategies for the NG process, ranging from greenfield redevelopment, evolutionary development, to adoption off-the-shelf solutions. Recognizing the unique business processes surrounding the arXiv.org system, as well as the advanced state of the existing system (notwithstanding the limitations described below), and after careful review of possible OTS products in the e-print and repository space, we have decided to pursue an incremental in-house re-development of the existing system.

Overview¶

The remainder of this chapter contextualizes the arXiv architecture by summarizing the main drivers of the NG project, and the key objectives and constraints that inform the architecture.

What is arXiv.org, and what does it do?¶

The mission of arXiv is to provide rapid dissemination of research findings at no cost to readers and submitters. The longevity and stability of arXiv have generated a reputation for (and expectation of) data integrity and preservation of the scholarly record.

The arXiv platform provides several core services to scholars in quantitative sciences and engineering. The following list is adapted from the Technical Overview written by Simeon Warner et al. in 2014.

Accept article submissions & replacements from authors.
Accept article submissions & replacements from external sources.
Ensure reasonable standards and classification of accepted submissions.
Publish accepted submissions in a timely fashion (usually 1 day turnaround).
Provide alerts of new submissions and updates.
Support discovery of articles, both through our own search tools and through external search engines and platforms that utilize arXiv content.
Support browsing of articles based on their classification, and promote understanding of provenance and context.
Support linking of arXiv e-prints to related scholarly objects, including peer-reviewed articles, datasets, and code.
Provide an accurate historical record of submission, Announcement, announcements, and revisions.

arXiv is not an isolated silo but interacts with a rich ecosystem of external services which increase its value. We might characterize these features as outside the core of arXiv but as things that we want to help others to do.

Reference and citation extraction/handling.
Citation analysis.
Commentary and reviews.
Social networking.
Text analysis and mining.

What is the problem?¶

The existing arXiv system is highly stable in that it provides a consistent set of core services with high availability. The codebase that supports that system has grown organically over a long period of time, with varying and sometimes unclear architectural visions. The technology on which arXiv is built is variously antiquated or (due to cultural changes) obscure. As a joint result of those factors, it is exceedingly expensive to develop the existing codebase to fix bugs, address feature requests, and keep pace with end-user expectations of quality, usability, and security. The principal challenge of the classic renewal process will be to progressively evolve arXiv into a modern and architecturally sound software system while maintaining the level of consistency and availability of the system as a whole.

Guiding Principles¶

Several high-level concerns motivate the architecture described in this document.

Maintainability. The legacy arXiv system has fallen behind contemporary developments in web application architecture, standards, and best practices. The project should reset the clock, and put the IT team in a better position to leverage modern tools to solve problems for our users. In doing so, we want to minimize vendor- and platform-lock in, and participate in open source software development.
Evolvability. The architecture should enable shorter release cycles and faster turn-around on feature development and bug-fixes. Wherever possible, the arXiv architecture and its implementation should minimize the cost of discarding or reimplementing components or subsystems. This work should position arXiv to respond to changes in the scholarly communication and scientific landscapes.
Extensible data architecture. The arXiv platform should have extensible data architecture that provides both a high level of stability for core scholarly objects as well as the flexibility to keep pace with changing stakeholder requirements as well as best practices in scholarly publishing, library science, and data curation.
Resilience. Increase the resilience of the arXiv.org site to local failures and high load. Structural features of the classic system cause local faults to cause disproportionately large outages, and also makes it difficult to scale under high load. arXiv should be prepared to deal with significantly higher overall request volume, as well as larger and more frequent spikes in traffic.
Improve moderation and administration workflows. Support complex moderation and administration workflows that integrate human and computational effort. The resulting model should increase the efficiency of the moderation system, facilitate long-term scalability, and increase the degree of control that administrators can exercise over the system.
Improve programmatic access through APIs. Promote and support innovative uses of arXiv content and metadata by external researchers and developers by providing rich, modern APIs, thorough documentation, and engaging in ongoing dialogue with API consumers. We need better documentation, better access to arXiv resources, and consistent mechanisms for authentication and authorization.
Support integrations. Provide mechanisms for rich integrations with partner services in the scholarly publishing domain, e.g. through coordinated submission workflows, two-way linkages between e-prints and other scholarly objects (e.g. Announcements, multimedia, code, data), and other related products that provide value for scholars who publish in arXiv.

Key Objectives¶

With the foregoing concerns in mind, the arXiv-NG project boils down to these five key objectives:

Maintain 99.999% uptime of the main public site, and 99.99% uptime for the submission system, while absorbing 10% per year growth in traffic, submissions, and frequency of high-traffic events with a five-fold safety margin. That means that by 2024 we should be prepared for 60% growth in those dimensions, and be able to handle a 300% increase without degradation of the service.
Incrementally replace the classic code-base with a more modular system of components, with thorough test coverage, documentation, and consistent quality and stylistic standards. This also involves making some changes to the underlying data architecture, e.g. to achieve clearer separation of concerns.
We must be able to deploy and scale the arXiv system with as little “manual” intervention as possible, and do so in a cloud environment that facilitates more fluid scaling, broader geographic reach & redundancy. We’re moving from on-premises infrastructure to AWS, adopting DevOps tooling and practices, and implementing Continuous Integration & Continuous Delivery (CI/CD) workflows.
The system must be comprehensibly documented, observable, and monitored such that it can be supported by any reasonably experienced member of the IT team. This involves writing and maintaining easy-to-navigate documentation, and establishing logging and metrics systems that provide actionable, timely, and specific information.
Build out a modern API platform that can support third-party integrations and research at scale. This includes a better metadata API, a submission API that can facilitate a broad range of integrations, cost-effective programmatic access to full-text content, and webhook notifications. Users must be able to access documentation, obtain authentication credentials, and request authorizations, via a single point of access.

Specific development goals and milestones¶

We have individuated specific development goals for the arXiv-NG project, with an expected set of deliverables related to each major subsystem of the arXiv platform. You can read about them here. The reader should keep in mind that these are not an exhaustive list of features and improvements. Rather, they provide a roadmap and architectural direction for major units of development work onto which more specific features and fixes will be attached.

Key Constraints¶

Long-term success depends on both a feasible strategy and appropriate management of cost and risks. The following key constraints shape the architectural and implementation decisions made throughout the NG project:

It must be possible to maintain the arXiv system at the current or similar funding levels (not including short-term funding for NG).
Decisions about source code licensing, security, and infrastructure are constrained by Cornell University and CU Library institutional policies.
To the greatest extent possible, arXiv should adopt open source practices and principles. This should guide both the selection of technologies as well as the management of NG source code and artifacts.
The NG development process should begin adding value to the production system as early as possible and throughout the project.

About this Document¶

This architecture documentation is based on the arc42 documentation model, and also draws heavily on the C4 software architecture model. The C4 model describes an architecture at four hierarchical levels, from the business context of the system to the internal architecture of small parts of the system.

In this document, I have departed slightly from the original language of C4 in order to avoid collision with names in adjacent domains. Specifically, I describe the system at three levels:

context: This includes both the business and technical contexts in the arc42 model. It describes the interactions between the arXiv system and external entities and systems.
service: This is similar to the “container” concept in the C4 model. A service is a part of the system that is developed, tested, and deployed quasi-independently. It may encapsulate a few applications, data stores, and other components that are tightly coordinated.
component: A component is a building block within a service. A component might be a Python module, a data store, an internal service layer, etc that has specific responsibilities, behaviors, and interactions.

Chapter 2, Context & Scope, briefly describes the business context and stakeholders for the arXiv platform.

Chapter 3, arXiv Classic, describes the classic arXiv system and its constituent subsystems.

Chapter 4, Objectives & Requirements, elaborates on the guiding principles and key objectives summarized in this chapter, providing additional context and specifying subsidiary objectives and requirements.

Chapter 5, Solution Strategy, provides an overview of the NG Classic Renewal Process. It discusses the methodology that we will apply to meet the objectives laid out in chapter 4, and discusses the high-level architectural features of the NG system. Those features guide the specific architecture and implementation of each major domain of the arXiv system.

Chapter 6, Cross-Cutting Concepts, elaborates in further detail decisions about procedure and technology that apply to many or all arXiv NG subsystems.

Chapter 7, Architecture of the arXiv NG Software System, describes the services and other components that make up each of the major arXiv-NG subsystems.

Acknowledgments¶

The following individuals provided significant feedback about and contributions to the concepts described in this document:

Oya Rieger
Sandy Payette
Gail Steinhart
Jim Entwood
Martin Lessmeister
Brian Caruso
Liz Woods
David Fielding
Brandon Barker
Bennett Wineholt
Matt Bierbaum

1: https://confluence.cornell.edu/display/arxivpub/arXiv+NG+Phase+I+Project+Charter