.. _Announcement-integrity: Announcement Subsystem ********************** One of the core functions that arXiv provides is long-term high-fidelity storage of e-print announcements and attendant metadata. Ensuring the integrity of that service is vital to maintaining trust within the scientific communities that arXiv serves. .. contents:: :depth: 4 Key Requirements ================ - Permanent long-term preservation and availability of arXiv e-prints, including original submission source files, rendered PDFs, core metadata, and change history; - Integrity-monitoring functionality, including fixity checksums; - High availability, horizontal scaling of read access; - Must issue and resolve permanent URIs. - Supports a wide range of content types, including text, binary, data, video, image, PDF, and other formats. - High level of redundancy, including full "off site" backups and long-term archiving. Context ======= - Transitions announcement-ready papers from the submission system to permanent canonical repository. - Generates and disseminates arXiv IDs. - Generates event notifications about new announcements, for use by other agents/services (e.g. to update search index, generate email announcements). - Provides core metadata and content for use by other services (e.g. search, browse, overlap detection). Canonical record ================ The arXiv canonical record is comprised of all of the announcement events and e-print versions in the arXiv corpus, including their metadata, source content, and canonical renderings. This section describes the concepts, structure, and maintenance of the canonical record. Goals ----- The canonical record is the primary source of truth for e-prints announced on the arXiv platform, and is comprised of both e-print metadata and content including original submission content and derived PDFs. - Content is organized so that it can be easily determined what papers were announced or altered on a given day. - At announcement time, the :ref:`announcement-agent` deposits new metadata records, source files, and PDFs in the core repository. Changes to this content can only occur through the announcement system. - The underlying storage technology must prioritize long-term durability. Availability and response latency characteristics are less important, so long as the daily announcement cycle is not significantly delayed (including deposit of new metadata and updates to secondary data stores). Since announcement is handled by a single process (and we do not expect that to change), write performance is much less important than read performance. - The content of the canonical record is verifiable. A checksum is stored for each object that is deposited in the repository that can later be used to monitor content integrity. - It must be easy to backup, replicate, mirror, and share the arXiv canonical record. This means that the system should have minimal dependencies (e.g. not require obscure software to parse metadata records, not require access to other data sources to interpret), and be conceptually as simple as possible. The metadata record should be both human readable and readily parsed for computational purposes. - Only publicly consumable information is allowed in the canonical record. This minimizes the complexity of access control, and makes it simpler to provide public APIs. Definitions ----------- The following terms are used throughout this document. They are ordered in a narrative fashion for the sake of readability for those unfamiliar with the arXiv data model. .. glossary:: Version A scientific work comprised of a :term:`Source Package`, :term:`Canonical Render`, and a :term:`Metadata Record`. Source Package The original content of a submission to arXiv provided by the :term:`Submitter`. Canonical Render A representation of the scientific work suitable for consumption by human readers. This is usually (but not always) a PDF compiled from TeX sources in the :term:`Source Package` at the time of announcement. Metadata Record A collection of descriptive metadata associated with a scientific work. It includes both the descriptive metadata provides by the :term:`Submitter` and also details about the :term:`Source Package` and :term:`Canonical Render`. *Note: this replaces the concept of the Abs File in the Classic system.* See :ref:`canonical-metadata-record`. E-Print An ordinal collection of :term:`Versions `. The second and subsequent :term:`Version` are usually generated by replacement or withdrawal :term:`Events `. Each E-Print is assigned a unique :term:`arXiv Identifier` at the time of announcement of its first :term:`Version`. Event An announcement-related activity that results in the creation or modification of a :term:`Version`. See :ref:`event-types`. Listing A record containing a subset of :term:`Events `. Event Stream The chronological series of all :term:`Event ` in the :term:`Canonical Record`. Submitter The human user responsible for transmitting a scientific work to arXiv. This may or may not be an author of the work. Client The system that mediated the transmission of a scientific work to arXiv. Canonical Record The entire collection of :term:`Events ` and :term:`Versions ` in the arXiv system. Primary Record The authoritative copy of the :term:`Canonical Record`. This is the source of truth for all other records and systems. See :ref:`primary-announcement-record`. Announcement Agent The software system that maintains the :term:`Primary Record`, and writes the :term:`Event Stream`. See :ref:`announcement-agent`. Replica A non-authoritative copy of the :term:`Primary Record`. A Replica may be Partial (i.e. containing only a subset of :term:`Events ` and :term:`Versions `) or Complete. Observer A system that processes an :term:`Event Stream`. Replicant An :term:`Observer` that generates and maintains a :term:`Replica`. Repository A server that provides access to the current state of the :term:`Canonical Record` to other systems. Announcement The creation or modification of a :term:`Version`, usually based on a submission. Announcement Date The date on which a :term:`Version` was :term:`Announced `. Original Announcement Date The date on which the first :term:`Version` of an :term:`E-Print` was :term:`Announced `. Fixity Checksum The URL-safe Base64-encoded MD5 digest of bytes content. The content may be the raw content of a file. Manifest A record of resources and :term:`Fixity Checksums ` in a part of the :term:`Canonical Record`. arXiv Identifier A unique identifier assigned to an :term:`E-Print` on the day that its first :term:`Version` is :term:`Announced `. See :ref:`arxiv-identifier`. Versioned Identifier An :term:`arXiv Identifier` with a version affix, e.g. ``v5``, that refers to a specific :term:`Version`. See :ref:`arxiv-identifier`. Structure of the Canonical Record --------------------------------- Each resource in the :term:`Canonical Record` is stored as a bitstream value in a key-value store. The key prefix structure for a Version record is: .. code-block:: plain e-prints////v/ Where ``YYYY`` is the year and ``MM`` the month during which the first :term:`Version` of the :term:`E-Print` was announced. Sub-keys are: - Metadata record: ``v.json`` - Source package: ``v.tar.gz`` - PDF: ``v.pdf`` The purpose of this record is to provide the ultimate source of truth regarding a particular E-Print and its Versions. :term:`Events ` are stored in :term:`Listing` files. The key prefix structure for :term:`Listing` file is: .. code-block:: plain announcement///
/ ``YYYY`` is the year, ``MM`` the month, and ``DD`` the day on which the :term:`Events ` encoded therein occurred and on which the subordinate :term:`Listing` files were generated. Each daily key prefix may contain one or more sub-keys. Each sub-key ending in ``.json`` is treated as a :term:`Listing` file. .. _canonical-metadata-record: Canonical metadata record ------------------------- The canonical metadata record stored by the metadata repository is a descendant of the classic "abs file". Because of the strong integrity goals of the canonical record, and the cost of keeping external systems up to date with the canonical record, the core metadata record is intended to house only the most essential descriptive and procedural metadata that are not expected to change frequently (if at all). Metadata that are highly dynamic and/or that are likely change outside of the daily announcement process, are not good candidates for the canonical metadata record. For example, author-curated links associated with a paper should not be included in the canonical metadata record, as they are subject to frequent modification by the author after announcement. The classic abs file contains the following fields, all of which we will continue to support. Many of these fields are described in ``_. - :term:`arXiv Identifier`. - :term:`Submitter` ("from"); includes full name and email address. - Submission dates: includes (separately) submission dates for all previous versions up to and including the version being described. E.g. if a paper has four versions, the record for v3 includes the submission dates for versions 1, 2, and 3 (but not v4). - Authors: this is the full author string in the canonical arXiv format. - Categories: this includes primary and secondary categories. - Comments: short text field containing notes by the submitter and/or admins. - License: URI of the license selected by the submitter. - Title. - Abstract. - DOI (optional). - Report number (optional). - MSC classes (optional). - ACM classes (optional). In addition, the NG canonical metadata record introduces the following fields: - Creation date: the ISO-8601 timestamp when the :term:`Version` described by the record was created. - Updated date: the ISO-8601 timestamp when the :term:`Version` described by the record was last modified. - Changes: an array of :term:`Events ` describing changes made to the :term:`Version` described by the record. Each entry includes an ISO-8601 timestamp and a brief description of the change that was made. The intent is to reduce the overhead of describing the history of an :term:`E-Print`. - Language: the primary language (ISO 639-2) of the scientific work. - Admin notes: separate the admin notes from the Comments field (for example overlap notes). - Withdrawn: a boolean field that indicates whether the paper is withdrawn. - Withdrawal reason: a comment field that explains why the paper is withdrawn. .. _consistency: Consistency ----------- In order to efficiently verify the completeness and integrity of a :term:`Replica`, and to identify the source of inconsistencies, consistency checks are performed at several levels of granularity. The completeness and integrity of all or a part of the :term:`Canonical Record` can be verified by comparing the checksum values at the corresponding level of granularity. The way in which checksum values are calculated for each level is described below. This is inspired by the strategy for checksum validation of large chunked uploads to AWS S3. All checksum values are md5 hashes, stored and transmitted as URL-safe base64-encoded strings. Completeness and fixity can be verified at each level as follows: ======== ================================= ================================== Level Completeness Integrity ======== ================================= ================================== File Presence/absence of a key Hash of the content bitstream. in key-value store. Version Number and names of files. Hash of the concatenated (ordered lexicographically by name of the content file) file integrity hashes. E-print Number of versions. Hash of the concatenated version integrity hashes, sorted by ascending version number. Day Presence of all e-print keys. Hash of the concatenated e-print integrity hashes, sorted lexicographically by identifier. Month Presence of keys for all calendar Hash of the concatenated day days in the month. integrity hashes, sorted chronologically. Year Presence of keys for all calendar Hash of the concatenated month months in the year. integrity hashes, sorted chronologically. All Presence of keys for all years Hash of the concatenated year since 1991. integrity hashes, sorted chronologically. ======== ================================= ================================== The keys of members and each of their calculated checksums are recorded in :ref:`manifest-records`. These manifests are considered to be outside of the canonical record itself, so their integrity should be ensured independently. .. _manifest-records: Manifest records ^^^^^^^^^^^^^^^^ Manifest records are maintained at each level of the :term:`Canonical Record`. Each record contains a mapping of member keys to integrity checksums. For example, a manifest record for a particular year would contain a mapping like: .. code-block:: python { "2021-01": "[ fixity checksum for 2021-01 ]", "2021-02": "[ fixity checksum for 2021-02 ]", "2021-03": "[ fixity checksum for 2021-03 ]", "2021-04": "[ fixity checksum for 2021-04 ]", "2021-05": "[ fixity checksum for 2021-05 ]", "2021-06": "[ fixity checksum for 2021-06 ]", "2021-07": "[ fixity checksum for 2021-07 ]", "2021-08": "[ fixity checksum for 2021-08 ]", "2021-09": "[ fixity checksum for 2021-09 ]", "2021-10": "[ fixity checksum for 2021-10 ]", "2021-11": "[ fixity checksum for 2021-11 ]", "2021-12": "[ fixity checksum for 2021-12 ]" } .. _preservation-record: Preservation record ------------------- The preservation record is a daily digest containing the :term:`Versions ` and :term:`Events ` for that day. The purpose of the preservation record is to facilitate long-term archiving of arXiv content in cases where direct replication of the canonical record is unwanted or impractical. For example, many long-term dark archive service providers ingest content and transform it into a normalized format. The key structure of the daily preservation record is: .. code-block:: plain announcement/.json # Events for the day. e-prints/v/ v.json # Metadata Record v.tar.gz # Source Package v.pdf # Canonical Render. v.manifest.json # Version Manifest. suppress/v/tombstone preservation.manifest.json The ``preservation.manifest.json`` record is similar to the version manifest record; it contains all of the keys and corresponding checksums for the items in the preservation record. The ``suppress/`` key prefix is used to indicate :term:`Versions ` the contents of which (for legal reasons) have been removed from the :term:`Canonical Record` and should be suppressed from dissemination in the event that the archive is "lit up" for public consumption. The ``tombstone`` record is an UTF-8 encoded plain text file containing a brief account of the reason for suppression. .. _event-types: Event types ----------- The following types of :term:`Event` are supported. The "Δ Content" column indicates whether or not the :term:`Event` results in modification or addition of content. All events result in changes to :term:`E-Print` metadata via modification or creation of a :term:`Metadata Record` for a :term:`Version`. ================ ================================================== ========= Event type Description Δ Content ================ ================================================== ========= new Creation of the first :term:`Version` of an Yes :term:`E-Print`. update Changes to an existing :term:`Version`. Yes update_metadata Changes to the metadata of an existing No :term:`Version`. replace Creation of the second or subsequent Yes :term:`Version` of an :term:`E-Print`. cross Addition of secondary classification terms to an No existing :term:`Version`. jref Modification of the DOI, journal reference, and/or No report number metadata fields. Deprecated. withdraw Creation of a new :term:`Version` of an Yes :term:`E-Print` that declares the :term:`E-Print` to be withdrawn. This :term:`Version` has no associated content. migrate A change to the structure or format of a Yes :term:`Version`. For example, adoption of a new encoding, or addition of a new bitstream. migrate_metadata A change to the structure or format of the No metadata of a :term:`Version`. For example, addition of a new core metadata field. ================ ================================================== ========= .. _announcement-completion: Announcement completion ^^^^^^^^^^^^^^^^^^^^^^^ At the end of the daily announcement process, the announcement agent shall generate an ``announcement_complete`` event on the :term:`Event Stream`. This event contains a summary of the announcements for that day. Protocol ======== This section describes how the :term:`Canonical Record` is updated and replicated. Announcement ------------ :term:`Events ` are generated by the :term:`Announcement Agent` which implements arXiv-specific logic for selecting and preparing announcement-ready submissions. For example: - Query the submission subsystem for announcement-ready submissions. - Retrieve the :term:`Source Package` for each submission, and generate the :term:`Canonical Render` using the :ref:`compilation-service`. - Generate the :term:`Metadata Record`. The :term:`Announcement Agent` shall generate a set of :term:`Events ` based on the aforementioned preparations. For each :term:`Event` the :term:`Announcement Agent` shall: 1. Write the :term:`Event`, and the :term:`Metadata Record`, :term:`Source Package`, and :term:`Canonical Render` of the attendant :term:`Version` to the :term:`Primary Record`. #. Create or update the :term:`Manifest` records with the new changes. #. Encode and publish the :term:`Event` on the :term:`Event Stream`, along with new/updated fixity checksums. The :term:`Announcement Agent` shall process one :term:`Event` at a time. In the case that any one of the above steps should fail permanently (i.e. all reasonable retry attempts are exhausted) the :term:`Announcement Agent` shall log an error message and exit. .. _protocol-replication: Replication ----------- Upon receipt of a new :term:`Event` on the :term:`Event Stream`, a :term:`Replicant` shall: 1. Decode the :term:`Event` and fixity checksums. #. Dereference URIs, e.g. indicating the location of the :term:`Source Package` and :term:`Canonical Render` bitstreams. #. Write the :term:`Event`, and the :term:`Metadata Record`, :term:`Source Package`, and :term:`Canonical Render` of the attendant :term:`Version` to the :term:`Replica`. #. Create or update the :term:`Manifest` records with the new changes. This includes the fixity checksums. #. Verify that the new checksums match the checksums decoded along with the :term:`Event`. The :term:`Replicant` shall process one :term:`Event` at a time, in order. In the case that the :term:`Replicant` should fail to process an :term:`Event`, it must log an error message and exit. .. _protocol-catch-up: Catch-up -------- This section describes how a new :term:`Replicant` shall come up to date if it falls behind in processing :term:`Events ` from the :term:`Event Stream`. Each :term:`Event` is assigned a monotonically incrementing integer value, starting at 0 on each announcement day. The :term:`Events ` for each announcement day, as well as the years, months, and days on which :term:`Events ` were generated may be retrieved via the :term:`Repository` for the :term:`Primary Record`. The :term:`Replicant` shall connect to and retrieve the first :term:`Event` in the :term:`Event Stream`. It shall then call the :term:`Repository` for the :term:`Primary Record` to retrieve and process all :term:`Events ` prior to the loaded one, in chronological order. During catch-up, the :term:`Replicant` shall update its local fixity :term:`Manifests `, but shall not verify checksums until all prior events are processed. Once all prior :term:`Events ` are processed, the :term:`Replicant` shall process the initially loaded :term:`Event` and proceed with replication as normal. Other :term:`Observers ` may use a similar approach if catch-up is required. Domains & Services ================== This section describes the domains within the announcement system, and the services that implement and maintain the canonical record. .. _figure-ng-announcement-overview: .. figure:: ../_static/diagrams/ng-announcement.png :width: 600px Overview of announcement services. .. contents:: :depth: 2 :local: .. _primary-announcement-record: Primary announcement record --------------------------- :fa:`github` https://github.com/arXiv/arxiv-canonical The :term:`Primary Record` is the authoritative source of truth about the :term:`Canonical Record`. It is comprised of the :ref:`canonical record ` data itself, an :ref:`announcement-agent` that operates the daily announcement process and updates the record with data from the submission subsystem, and a :ref:`primary-repository` that provides a read-only API for the canonical record (including :term:`Source Package` and :term:`Canonical Render`) to other services. .. _primary-canonical-record: Primary canonical record data ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The primary canonical record is stored in AWS Simple Storage System (S3), which provides a RESTful HTTP API for setting and getting binary content based on simple keys. Access to the canonical record bucket is strictly limited to the :ref:`announcement-agent` and the :ref:`primary-repository`. :term:`Listings ` and :term:`Metadata Records ` are serialized in JSON, and stored as UTF-8 encoded text files. JSON Schema documents are used to describe and validate the metadata records, and are stored in https://github.com/arXiv/arxiv-canonical. .. _announcement-agent: Announcement agent ^^^^^^^^^^^^^^^^^^ The announcement agent is a periodic process that runs in a single thread once per day. The announcement agent is responsible for: 1. Getting publishable :term:`E-Prints ` from the submission system, and marking them as :term:`Announced `. 2. Minting new :term:`arXiv Identifiers `. 3. Writing to the :ref:`primary-announcement-record`. 4. Populating the :term:`Event Stream` with information about each :term:`Event` in order to facilitate downstream work by other agents/services (e.g. to update the search index, generate announcement emails). .. _primary-repository: Primary repository ^^^^^^^^^^^^^^^^^^ The primary :term:`Repository` is a small web service that provides a RESTful JSON API for resources in the :term:`Primary Record`. Operations supported by this API include: - Retrieve metadata for a specific :term:`Version` of an :term:`E-Print`. - Retrieve the :term:`Source Package` or :term:`Canonical Render` for a specific :term:`Version` of an :term:`E-Print`. - Retrieve a summary of all :term:`Versions ` of an :term:`E-Print`. - Retrieve all of the :term:`Events ` associated with a :term:`Versions ` or an :term:`E-Print`. - Retrieve all of the :term:`Events ` in a specific time period, and/or that correspond to other content-based filters (such as primary or secondary classification). This API is used by other services in the arXiv NG system, as well as external clients, to access the canonical record. .. note:: It will be strongly preferable to include a backend cache in the implementation of the repository software, as requests for particular resources or sets of resources in a given period of time are likely to be biased toward a small subset of all resources. .. _mirrors: Mirrors ------- Historically, member institutions have operated a global network of mirror sites, but many mirrors have become unavailable over time due to lack of maintenance and aging technology. In late 2015, the decision was made to begin dismantling the mirror network as trying to maintain them presented an impediment for developing new features on arΧiv. Currently, only four of the original thirteen mirrors remain open and are updated daily. Yet many of the drivers that motivated the existence of the mirrors in the first place—such as resilience to sustained outages and geopolitical redundancy—continue to exist. This section describes the implementation of arXiv NG mirrors as :term:`Replicants ` in the canonical record framework. .. _mirror-replicant-agent: Mirror replicant-agent ^^^^^^^^^^^^^^^^^^^^^^ The mirror :term:`Replicant` in an agent process that consumes the :term:`Event Stream` via the :ref:`protocol-replication` protocol described above. Like all other NG applications, it runs as a standalone Docker container that can be deployed on any infrastructure. As mentioned above, the backing storage system must support storage and retrieval of binary content by key. The software for the canonical record supports both local filesysystem and S3 storage backends. .. _mirror-repository-api: Mirror repository API ^^^^^^^^^^^^^^^^^^^^^ The mirror deploys the same repository API software as the :ref:`primary-repository`, backed by the :term:`Replica` maintained by the :ref:`mirror-replicant-agent`. Mirror site ^^^^^^^^^^^ The mirror can provide the public arXiv site by deploying the :ref:`browse` application, which leverages the :ref:`mirror-repository-api`. Archival replicant ------------------ The archival :term:`Replicant` is responsible for assembling and distributing the daily :ref:`preservation-record`. Like the :ref:`mirror-replicant-agent`, the archival replicant processes the :term:`Event Stream` in real time. Rather than writing the :term:`Events ` in the canonical format, however, it adds records (dereferencing and downloading bitstreams from the :ref:`primary-repository` as needed) to the daily preservation record. The record closes at midnight ET, at which time the preservation record is pushed to third-party archives for processing. The archival replicant maintains a log of interactions with third-party systems.