Announcement Subsystem ¶

One of the core functions that arXiv provides is long-term high-fidelity storage of e-print announcements and attendant metadata. Ensuring the integrity of that service is vital to maintaining trust within the scientific communities that arXiv serves.

Contents

Key Requirements ¶

Permanent long-term preservation and availability of arXiv e-prints, including original submission source files, rendered PDFs, core metadata, and change history;
Integrity-monitoring functionality, including fixity checksums;
High availability, horizontal scaling of read access;
Must issue and resolve permanent URIs.
Supports a wide range of content types, including text, binary, data, video, image, PDF, and other formats.
High level of redundancy, including full “off site” backups and long-term archiving.

Context ¶

Transitions announcement-ready papers from the submission system to permanent canonical repository.
Generates and disseminates arXiv IDs.
Generates event notifications about new announcements, for use by other agents/services (e.g. to update search index, generate email announcements).
Provides core metadata and content for use by other services (e.g. search, browse, overlap detection).

Canonical record ¶

The arXiv canonical record is comprised of all of the announcement events and e-print versions in the arXiv corpus, including their metadata, source content, and canonical renderings.

This section describes the concepts, structure, and maintenance of the canonical record.

Goals ¶

The canonical record is the primary source of truth for e-prints announced on the arXiv platform, and is comprised of both e-print metadata and content including original submission content and derived PDFs.

Content is organized so that it can be easily determined what papers were announced or altered on a given day.
At announcement time, the Announcement agent deposits new metadata records, source files, and PDFs in the core repository. Changes to this content can only occur through the announcement system.
The underlying storage technology must prioritize long-term durability. Availability and response latency characteristics are less important, so long as the daily announcement cycle is not significantly delayed (including deposit of new metadata and updates to secondary data stores). Since announcement is handled by a single process (and we do not expect that to change), write performance is much less important than read performance.
The content of the canonical record is verifiable. A checksum is stored for each object that is deposited in the repository that can later be used to monitor content integrity.
It must be easy to backup, replicate, mirror, and share the arXiv canonical record. This means that the system should have minimal dependencies (e.g. not require obscure software to parse metadata records, not require access to other data sources to interpret), and be conceptually as simple as possible. The metadata record should be both human readable and readily parsed for computational purposes.
Only publicly consumable information is allowed in the canonical record. This minimizes the complexity of access control, and makes it simpler to provide public APIs.

Definitions ¶

The following terms are used throughout this document. They are ordered in a narrative fashion for the sake of readability for those unfamiliar with the arXiv data model.

Version: A scientific work comprised of a Source Package, Canonical Render, and a Metadata Record.
Source Package: The original content of a submission to arXiv provided by the Submitter.
Canonical Render: A representation of the scientific work suitable for consumption by human readers. This is usually (but not always) a PDF compiled from TeX sources in the Source Package at the time of announcement.
Metadata Record: A collection of descriptive metadata associated with a scientific work. It includes both the descriptive metadata provides by the Submitter and also details about the Source Package and Canonical Render. Note: this replaces the concept of the Abs File in the Classic system. See Canonical metadata record.
E-Print: An ordinal collection of Versions. The second and subsequent Version are usually generated by replacement or withdrawal Events. Each E-Print is assigned a unique arXiv Identifier at the time of announcement of its first Version.
Event: An announcement-related activity that results in the creation or modification of a Version. See Event types.
Listing: A record containing a subset of Events.
Event Stream: The chronological series of all Event in the Canonical Record.
Submitter: The human user responsible for transmitting a scientific work to arXiv. This may or may not be an author of the work.
Client: The system that mediated the transmission of a scientific work to arXiv.
Canonical Record: The entire collection of Events and Versions in the arXiv system.
Primary Record: The authoritative copy of the Canonical Record. This is the source of truth for all other records and systems. See Primary announcement record.
Announcement Agent: The software system that maintains the Primary Record, and writes the Event Stream. See Announcement agent.
Replica: A non-authoritative copy of the Primary Record. A Replica may be Partial (i.e. containing only a subset of Events and Versions) or Complete.
Observer: A system that processes an Event Stream.
Replicant: An Observer that generates and maintains a Replica.
Repository: A server that provides access to the current state of the Canonical Record to other systems.
Announcement: The creation or modification of a Version, usually based on a submission.
Announcement Date: The date on which a Version was Announced.
Original Announcement Date: The date on which the first Version of an E-Print was Announced.
Fixity Checksum: The URL-safe Base64-encoded MD5 digest of bytes content. The content may be the raw content of a file.
Manifest: A record of resources and Fixity Checksums in a part of the Canonical Record.
arXiv Identifier: A unique identifier assigned to an E-Print on the day that its first Version is Announced. See arxiv-identifier.
Versioned Identifier: An arXiv Identifier with a version affix, e.g. v5, that refers to a specific Version. See arxiv-identifier.

Structure of the Canonical Record ¶

Each resource in the Canonical Record is stored as a bitstream value in a key-value store.

The key prefix structure for a Version record is:

e-prints/<YYYY>/<MM>/<arXiv ID>/v<version>/

Where YYYY is the year and MM the month during which the first Version of the E-Print was announced.

Sub-keys are:

Metadata record: <arXiv ID>v<version>.json
Source package: <arXiv ID>v<version>.tar.gz
PDF: <arXiv ID>v<version>.pdf

The purpose of this record is to provide the ultimate source of truth regarding a particular E-Print and its Versions.

Events are stored in Listing files. The key prefix structure for Listing file is:

announcement/<YYYY>/<MM>/<DD>/

YYYY is the year, MM the month, and DD the day on which the Events encoded therein occurred and on which the subordinate Listing files were generated.

Each daily key prefix may contain one or more sub-keys. Each sub-key ending in .json is treated as a Listing file.

Canonical metadata record ¶

The canonical metadata record stored by the metadata repository is a descendant of the classic “abs file”.

Because of the strong integrity goals of the canonical record, and the cost of keeping external systems up to date with the canonical record, the core metadata record is intended to house only the most essential descriptive and procedural metadata that are not expected to change frequently (if at all). Metadata that are highly dynamic and/or that are likely change outside of the daily announcement process, are not good candidates for the canonical metadata record. For example, author-curated links associated with a paper should not be included in the canonical metadata record, as they are subject to frequent modification by the author after announcement.

The classic abs file contains the following fields, all of which we will continue to support. Many of these fields are described in https://arxiv.org/help/prep.

arXiv Identifier.
Submitter (“from”); includes full name and email address.
Submission dates: includes (separately) submission dates for all previous versions up to and including the version being described. E.g. if a paper has four versions, the record for v3 includes the submission dates for versions 1, 2, and 3 (but not v4).
Authors: this is the full author string in the canonical arXiv format.
Categories: this includes primary and secondary categories.
Comments: short text field containing notes by the submitter and/or admins.
License: URI of the license selected by the submitter.
Title.
Abstract.
DOI (optional).
Report number (optional).
MSC classes (optional).
ACM classes (optional).

In addition, the NG canonical metadata record introduces the following fields:

Creation date: the ISO-8601 timestamp when the Version described by the record was created.
Updated date: the ISO-8601 timestamp when the Version described by the record was last modified.
Changes: an array of Events describing changes made to the Version described by the record. Each entry includes an ISO-8601 timestamp and a brief description of the change that was made. The intent is to reduce the overhead of describing the history of an E-Print.
Language: the primary language (ISO 639-2) of the scientific work.
Admin notes: separate the admin notes from the Comments field (for example overlap notes).
Withdrawn: a boolean field that indicates whether the paper is withdrawn.
Withdrawal reason: a comment field that explains why the paper is withdrawn.

Consistency ¶

In order to efficiently verify the completeness and integrity of a Replica, and to identify the source of inconsistencies, consistency checks are performed at several levels of granularity. The completeness and integrity of all or a part of the Canonical Record can be verified by comparing the checksum values at the corresponding level of granularity. The way in which checksum values are calculated for each level is described below. This is inspired by the strategy for checksum validation of large chunked uploads to AWS S3. All checksum values are md5 hashes, stored and transmitted as URL-safe base64-encoded strings.

Completeness and fixity can be verified at each level as follows:

Level	Completeness	Integrity
File	Presence/absence of a key in key-value store.	Hash of the content bitstream.
Version	Number and names of files.	Hash of the concatenated (ordered lexicographically by name of the content file) file integrity hashes.
E-print	Number of versions.	Hash of the concatenated version integrity hashes, sorted by ascending version number.
Day	Presence of all e-print keys.	Hash of the concatenated e-print integrity hashes, sorted lexicographically by identifier.
Month	Presence of keys for all calendar days in the month.	Hash of the concatenated day integrity hashes, sorted chronologically.
Year	Presence of keys for all calendar months in the year.	Hash of the concatenated month integrity hashes, sorted chronologically.
All	Presence of keys for all years since 1991.	Hash of the concatenated year integrity hashes, sorted chronologically.

The keys of members and each of their calculated checksums are recorded in Manifest records. These manifests are considered to be outside of the canonical record itself, so their integrity should be ensured independently.

Manifest records ¶

Manifest records are maintained at each level of the Canonical Record. Each record contains a mapping of member keys to integrity checksums. For example, a manifest record for a particular year would contain a mapping like:

{
    "2021-01": "[ fixity checksum for 2021-01 ]",
    "2021-02": "[ fixity checksum for 2021-02 ]",
    "2021-03": "[ fixity checksum for 2021-03 ]",
    "2021-04": "[ fixity checksum for 2021-04 ]",
    "2021-05": "[ fixity checksum for 2021-05 ]",
    "2021-06": "[ fixity checksum for 2021-06 ]",
    "2021-07": "[ fixity checksum for 2021-07 ]",
    "2021-08": "[ fixity checksum for 2021-08 ]",
    "2021-09": "[ fixity checksum for 2021-09 ]",
    "2021-10": "[ fixity checksum for 2021-10 ]",
    "2021-11": "[ fixity checksum for 2021-11 ]",
    "2021-12": "[ fixity checksum for 2021-12 ]"
}

Preservation record ¶

The preservation record is a daily digest containing the Versions and Events for that day. The purpose of the preservation record is to facilitate long-term archiving of arXiv content in cases where direct replication of the canonical record is unwanted or impractical. For example, many long-term dark archive service providers ingest content and transform it into a normalized format.

The key structure of the daily preservation record is:

announcement/<listing>.json               # Events for the day.
e-prints/<arXiv ID>v<version>/
    <arXiv ID>v<version>.json             # Metadata Record
    <arXiv ID>v<version>.tar.gz           # Source Package
    <arXiv ID>v<version>.pdf              # Canonical Render.
    <arXiv ID>v<version>.manifest.json    # Version Manifest.
suppress/<arXiv ID>v<version>/tombstone
preservation.manifest.json

The preservation.manifest.json record is similar to the version manifest record; it contains all of the keys and corresponding checksums for the items in the preservation record.

The suppress/ key prefix is used to indicate Versions the contents of which (for legal reasons) have been removed from the Canonical Record and should be suppressed from dissemination in the event that the archive is “lit up” for public consumption. The tombstone record is an UTF-8 encoded plain text file containing a brief account of the reason for suppression.

Event types ¶

The following types of Event are supported. The “Δ Content” column indicates whether or not the Event results in modification or addition of content. All events result in changes to E-Print metadata via modification or creation of a Metadata Record for a Version.

Event type	Description	Δ Content
new	Creation of the first Version of an E-Print.	Yes
update	Changes to an existing Version.	Yes
update_metadata	Changes to the metadata of an existing Version.	No
replace	Creation of the second or subsequent Version of an E-Print.	Yes
cross	Addition of secondary classification terms to an existing Version.	No
jref	Modification of the DOI, journal reference, and/or report number metadata fields. Deprecated.	No
withdraw	Creation of a new Version of an E-Print that declares the E-Print to be withdrawn. This Version has no associated content.	Yes
migrate	A change to the structure or format of a Version. For example, adoption of a new encoding, or addition of a new bitstream.	Yes
migrate_metadata	A change to the structure or format of the metadata of a Version. For example, addition of a new core metadata field.	No

Announcement completion ¶

At the end of the daily announcement process, the announcement agent shall generate an announcement_complete event on the Event Stream. This event contains a summary of the announcements for that day.

Protocol ¶

This section describes how the Canonical Record is updated and replicated.

Announcement ¶

Events are generated by the Announcement Agent which implements arXiv-specific logic for selecting and preparing announcement-ready submissions. For example:

Query the submission subsystem for announcement-ready submissions.
Retrieve the Source Package for each submission, and generate the Canonical Render using the Compilation service.
Generate the Metadata Record.

The Announcement Agent shall generate a set of Events based on the aforementioned preparations. For each Event the Announcement Agent shall:

Write the Event, and the Metadata Record, Source Package, and Canonical Render of the attendant Version to the Primary Record.
Create or update the Manifest records with the new changes.
Encode and publish the Event on the Event Stream, along with new/updated fixity checksums.

The Announcement Agent shall process one Event at a time. In the case that any one of the above steps should fail permanently (i.e. all reasonable retry attempts are exhausted) the Announcement Agent shall log an error message and exit.

Replication ¶

Upon receipt of a new Event on the Event Stream, a Replicant shall:

Decode the Event and fixity checksums.
Dereference URIs, e.g. indicating the location of the Source Package and Canonical Render bitstreams.
Write the Event, and the Metadata Record, Source Package, and Canonical Render of the attendant Version to the Replica.
Create or update the Manifest records with the new changes. This includes the fixity checksums.
Verify that the new checksums match the checksums decoded along with the Event.

The Replicant shall process one Event at a time, in order. In the case that the Replicant should fail to process an Event, it must log an error message and exit.

Catch-up ¶

This section describes how a new Replicant shall come up to date if it falls behind in processing Events from the Event Stream.

Each Event is assigned a monotonically incrementing integer value, starting at 0 on each announcement day. The Events for each announcement day, as well as the years, months, and days on which Events were generated may be retrieved via the Repository for the Primary Record.

The Replicant shall connect to and retrieve the first Event in the Event Stream. It shall then call the Repository for the Primary Record to retrieve and process all Events prior to the loaded one, in chronological order.

During catch-up, the Replicant shall update its local fixity Manifests, but shall not verify checksums until all prior events are processed.

Once all prior Events are processed, the Replicant shall process the initially loaded Event and proceed with replication as normal.

Other Observers may use a similar approach if catch-up is required.

Domains & Services ¶

This section describes the domains within the announcement system, and the services that implement and maintain the canonical record.

Fig. 19 Overview of announcement services.¶

Primary announcement record
Mirrors
Archival replicant

Primary announcement record ¶

https://github.com/arXiv/arxiv-canonical

The Primary Record is the authoritative source of truth about the Canonical Record. It is comprised of the canonical record data itself, an Announcement agent that operates the daily announcement process and updates the record with data from the submission subsystem, and a Primary repository that provides a read-only API for the canonical record (including Source Package and Canonical Render) to other services.

Primary canonical record data ¶

The primary canonical record is stored in AWS Simple Storage System (S3), which provides a RESTful HTTP API for setting and getting binary content based on simple keys. Access to the canonical record bucket is strictly limited to the Announcement agent and the Primary repository.

Listings and Metadata Records are serialized in JSON, and stored as UTF-8 encoded text files. JSON Schema documents are used to describe and validate the metadata records, and are stored in https://github.com/arXiv/arxiv-canonical.

Announcement agent ¶

The announcement agent is a periodic process that runs in a single thread once per day. The announcement agent is responsible for:

Getting publishable E-Prints from the submission system, and marking them as Announced.
Minting new arXiv Identifiers.
Writing to the Primary announcement record.
Populating the Event Stream with information about each Event in order to facilitate downstream work by other agents/services (e.g. to update the search index, generate announcement emails).

Primary repository ¶

The primary Repository is a small web service that provides a RESTful JSON API for resources in the Primary Record. Operations supported by this API include:

Retrieve metadata for a specific Version of an E-Print.
Retrieve the Source Package or Canonical Render for a specific Version of an E-Print.
Retrieve a summary of all Versions of an E-Print.
Retrieve all of the Events associated with a Versions or an E-Print.
Retrieve all of the Events in a specific time period, and/or that correspond to other content-based filters (such as primary or secondary classification).

This API is used by other services in the arXiv NG system, as well as external clients, to access the canonical record.

Note

It will be strongly preferable to include a backend cache in the implementation of the repository software, as requests for particular resources or sets of resources in a given period of time are likely to be biased toward a small subset of all resources.

Mirrors ¶

Historically, member institutions have operated a global network of mirror sites, but many mirrors have become unavailable over time due to lack of maintenance and aging technology. In late 2015, the decision was made to begin dismantling the mirror network as trying to maintain them presented an impediment for developing new features on arΧiv.

Currently, only four of the original thirteen mirrors remain open and are updated daily. Yet many of the drivers that motivated the existence of the mirrors in the first place—such as resilience to sustained outages and geopolitical redundancy—continue to exist.

This section describes the implementation of arXiv NG mirrors as Replicants in the canonical record framework.

Mirror replicant-agent ¶

The mirror Replicant in an agent process that consumes the Event Stream via the Replication protocol described above. Like all other NG applications, it runs as a standalone Docker container that can be deployed on any infrastructure.

As mentioned above, the backing storage system must support storage and retrieval of binary content by key. The software for the canonical record supports both local filesysystem and S3 storage backends.

Mirror repository API ¶

The mirror deploys the same repository API software as the Primary repository, backed by the Replica maintained by the Mirror replicant-agent.

Mirror site ¶

The mirror can provide the public arXiv site by deploying the browse application, which leverages the Mirror repository API.

Archival replicant ¶

The archival Replicant is responsible for assembling and distributing the daily Preservation record. Like the Mirror replicant-agent, the archival replicant processes the Event Stream in real time. Rather than writing the Events in the canonical format, however, it adds records (dereferencing and downloading bitstreams from the Primary repository as needed) to the daily preservation record.

The record closes at midnight ET, at which time the preservation record is pushed to third-party archives for processing. The archival replicant maintains a log of interactions with third-party systems.