Architecture of the arXiv NG Software System

This section describes the architecture of the arXiv Next Generation software system, from the perspective of the eventual “completed” product. Additional notes are included regarding transitional states where appropriate.

Overview

The arXiv-NG software system is comprised of five main subsystems, each focused on a separate aspect of user activity and with distinct availability and durability expectations. Deployment & networking infrastructure can be considered a sixth subsystem.

  1. Submission

  2. Announcement

  3. Dissemination

  4. Enhancement

  5. Authentication & Authorization

  6. Infrastructure

Each subsystem is comprised of specific services, agents, and data stores. Although these divisions capture significant behavioral and architectural joints of the arXiv system, the boundaries between them are not entirely impermeable. Some services may be utilized across subsystem. For example, the Compilation service may be used by both the submission & moderation subsystem and by the Announcement agent.

../_images/ng-subsystems.png

Fig. 11 Major subsystems of arXiv NG.

System Context

In earlier sections I described the Business Context for the classic arXiv system. As is the case in any transformational process, arXiv NG involves changes in both technology and operations. Consequently, the system context for arXiv NG differs slightly from that of the Classic system.

../_images/ng-system-context.png

Fig. 12 System context for arXiv Next Generation.

There are several notable differences:

  1. I differentiate between three different kinds of external platforms and services, each of which entails a different kind of relationship with arXiv:

    • Partner platforms, such as ADS, INSPIRE, and others. These platforms have a strong alignment with arXiv’s mission and vision, and have a long and proven track-record as stable components of the scholarly infrastructure. Most of these platforms aggregate/index arXiv content and provide discipline-specific discovery features to scholarly communities.

    • Authoring and submission platforms, including overlay journals. These platforms have entered into a trusted relationship with arXiv, and are authorized to provide alternative submission interfaces to arXiv users.

    • Other external platforms and services. These are third-party services that provide valuable services to arXiv users and others, but do not have a specific relationship with arXiv.

  2. The NG architecture introduces the concept of a third-party dark archive. One or more such archive(s) is/are updated daily from the canonical record, and strengthen arXiv’s preservation practices.

  3. Elimination of semi-external services that perform core quality assurance and policy enforcement functions. This has been a long-standing security and privacy risk, and NG internalizes all necessary QA functionality and brings it into alignment with uniform policies and practices.

Submission

This subsystem is responsible for all submission-related activities up to (but not including) announcement. It includes:

  • The primary submission and moderation user interfaces.

  • The submission database and event-store.

  • APIs for both internal and external programmatic interaction with arXiv submissions.

  • Automated processes that perform quality assurance and policy enforcement activities in response to submission events.

  • Backend services to handle file upload, TeX compilation, submission preview, plain text extraction, autoclassification, and other supporting activities.

In addition to handling requests from submitters (directly or via alternative interfaces mediated by arXiv APIs) and moderators, this subsystem provides APIs that allow the announcement subsystem to identify announceable submissions, and update the state of those submissions as they are scheduled and announced.

Work on this subsystem involves implementing a new data architecture that can integrate human and automated processes, decomposing complicated functionality in the classic system into isolated services, and implementing more extensible and accessible user interfaces.

Announcement

This subsystem is responsible for announcing submitted e-prints on a daily basis, maintaing the core canonical record (including e-print metadata, content, and history), and ensuring the durability of the canonical record.

In contrast to the Classic arXiv system, replication of the canonical record to mirrors is an application-level rather than infrastructure-level concern. Instead of relying on low-level filesystem synchronization to propagate changes to the canonical record, replication of the record is handled by arXiv software. This affords opportunities for greater observability and improved semantics.

Work in this subsystem involves migrating to a more consistent consolidated record structure for the arXiv canonical record, and strengthening preservation of the record by establishing one or more long-term dark archives with outside organizations.

Dissemination

This subsystem encompasses the parts of the site through which users access arXiv e-prints. This includes the search system, the subject listings pages, abstract pages, and content endpoints.

In contrast to the Classic system, in which dissemination-related components relied directly on the core database and filesystem, an overarching goal of the dissemination subsystem is to decouple and localize state that is specific to each service. For example, the search interface and query APIs do not rely on the legacy database or filesystem in any way; instead, an indexing agent listens for events produced by the announcement subsystem, and updates an index upon which the search interface and query APIs rely.

This model affords finer-grained control over backup and recovery practices for different kinds of state, and overall better evolvability characteristics; dissemination-related services can evolve independently of other system components, so long as they continue to consume announcement-related events.

Services and/or interfaces in the dissemination subsystem may also consume events or APIs in the dissemination subsystem. For example, the browse interface (that provides the abstract view) may load data about external content related to particular e-prints to enhance the presentation to readers. Similarly, the indexing agent may consume dissemination events (e.g. funding information) in order to provide additional query functionality in downstream interfaces.

Work on this subsystem involves decoupling the public site from the legacy database, and integrating with NG announcement subsystem, so that the site can handle more traffic. Additionally, this subsystem encompasses the arXiv API Gateway, which involves providing consistent and thoroughly-documented mechanisms for integrating with the arXiv platform programmatically.

Enhancement

This subsystem is a collection of independent services that provide features and data to improve discovery and consumption of arXiv content.

Examples include tools for author disambiguation and linking to external resources (such as code or datasets).

Similar to the dissemination subsystem, services in the dissemination subsystem are responsible for maintaining their own state. They may consume events from the announcement subsystem, and/or retrieve data from that subsystem via APIs.

A unique requirement of the dissemination subsystem is that (in most cases) services must be designed to seemlessly support both submission identifiers and canonical e-print identifiers. For example, users may provide external links for an e-print during or after the submission process and prior to announcement; the external links service must be attentive to announcement events, update its state with the canonical identifier when that submission is announced, and generate appropriate events so that downstream services in the dissemination subsystem may also update their state.

Work on this subsystem largely entails development of new services to address specific functionality requested by stakeholders. Most of this work can be performed independently of work in other areas, with a small amount of work focused on integration of submission and/or dissemination components to leverage new functionality.

Authentication & Authorization

This subsystem is responsible for user registration, authentication, and authorization management. It encompasses user interfaces for registration and log-in and endorsement-related functionality. It also provides resources for external developers to obtain access to arXiv APIs, and to perform OAuth2 workflows to support external interfaces (e.g. for submission).

Work on this subsystem involves migrating to a more scalable backend framework for handling authenticated user sessions on the arXiv.org site, adding support for modern authentication and authorization workflows, and replacing the software that supports user registration, endorsement, and access control.

Subsystem architecture