Current development priorities & milestones

Introduction

This document describes the development priorities for the arXiv IT team as part of the arXiv-NG project. This is not an exhaustive list of features and improvements. The focus here is on major changes that impact large portions of the system. Each development goal is accompanied by one or more specific deliverables. This document is updated throughout the project.

How do I request a new feature or improvement?

You can contact us by writing to help@arxiv.org. As our interfaces are replaced during the NG project, you can also provide comments (yes, we do read them!) by clicking on our “Feedback” button. Please keep in mind that while we can’t reply in detail to every request or comment that we receive, your feedback is extremely valuable and is certainly taken into account when making design decisions.

Our initial screening is based on whether the request is obviously consistent with arXiv’s core mission and principles. If it’s obviously not, then we’ll be quick to say “no.”

If your request is something that we’ve heard before, we’ll give you the same answer that we provided to the first requester. This might be, “no, sorry”, or “we have added this to our backlog”. Reasons for us to include something on the backlog include that it provides value for our users, and that it’s consistent with our mission and big-picture goals.

What does it mean for a request to be “on our backlog”

It’s actually not as bad as it sounds. Our backlog is where we put all of the things that we want to do, including really important things. Each ticket on the backlog is assigned a relative priority (e.g. “low”, “critical”, etc).

When we start work on a new release for a software component, for example to deliver one of the milestones described below, we look at the backlog and identify requests (starting with the highest priority ones) that are appropriate and achievable as part of that release.

Who makes the final decision about what requests will be fulfilled and when?

The arXiv management team is led our Program Director and our Scientific Director, and includes the Operations Manager, IT Team Lead, and Lead System Architect. The management team is ultimately responsible for approving and prioritizing development goals. The management team is advised by a Technical Advisory Group, and our Directors are advised by our Scientific Advisory Board and Members Advisory Board.

How do you test and release new components and features?

You can read about our testing and release process here.

Infrastructure

Secrets management infrastructure - Critical

The classic system has no secrets management system to speak of: database passwords, access keys, and other sensitive data are stored throughout the codebase in configuration files or code. Not only is this poor practice, but it doesn’t scale well. Our goal is to set up a secure system for creating, storing, and distributing secrets on a least required privilege basis.

Specific deliverables

  1. Done. We have deployed and are now using HashiCorp Vault running in Amazon Web Services (AWS) to manage secrets for all NG components deployed in the cloud.

Centralized logging - Critical

Web server logs are a critical data asset for arXiv. Logs are used to investigate and troubleshoot outages and monitor the system for suspicious activity. We also aggregate usage data from our server logs to provide feedback to our member institutions and other stakeholders about arXiv’s global impact.

In the classic system, server logs are stored as raw flat files on the server filesystem. Log rotation and backup is robust. However, log analysis is slow and cumbersome, requiring a large amount of developer time every year to produce relatively simply data products for reporting purposes. Making improvements to the classic log analysis is difficult, and the analysis itself requires a great deal of manual work that could be automated. The lack of real-time analysis also hampers the IT team’s ability to see patterns and trends in server traffic and performance. Our goal is to set up a secure and scalable log store that allows us to store and analyze logs that originate both from the classic system and from NG components deployed on cloud infrastructure.

Specific deliverables

  1. Done. We have deployed and are now using a combination of Logstash , Elasticsearch, and related tools as a centralized log store for classic and NG systems.

  2. We will use this system for our 2019 reports to member institutions.

DevOps & cloud deployment - Critical

The classic system is built around a “servers as pets” paradigm, in which the system runs on a handful of machines that are long-lived and manually configured. Manual configuration, maintenance, and deployment tend to be error prone, makes horizontal scaling more costly, and is difficult to test and verify.

Our goal is to be able to deploy and update new versions of software automatically, and to be able to test and verify those deployments programatically. We need to be able to deploy, monitor, and scale individual components of the system so that we can better utilize infrastructure resources and respond to fluctuations in load gracefully. Provisioning and deprovisioning server resources should require little or no manual intervention. This will free the IT team up to focus on developing improvements to the arXiv software itself.

Specific deliverables

  1. Done. We have deployed and are now using Kubernetes in Amazon Web Services to run NG web services.

  2. Done. We are using the Travis CI platform for continuous integration (running automated tests on each code change).

  3. In progress. We are beginning to use Travis CI for continuous deployment (automatically deploying new versions of software to our Kubernetes cluster).

  4. In progress. Monitoring & alerting. Kubernetes also brings with it a variety of monitoring tools that allows us to better characterize the behavior of the system, including load, resource utilization, and performance. We are working to connect those tools to our existing alerting system.

Backup, recovery, and failover - Critical

Protecting arXiv data from loss is a critical priority of the arXiv-NG project. The arXiv system stores data in a variety of forms and facilities. The infrastructure on which the classic system runs does have robust backup protections in place to guard against data loss, and support recovery scenarios in the case of local failures. As we adopt new technologies and infrastructure, our goal is to maintain a high level of awareness of risks to our data, and how those data are protected. In addition, we want to provide better protection against long-term risks as we continuously improve our stewardship practices.

Specific deliverables

  1. Ongoing. We have developed a risk model for data preservation that can inform backup and redundancy decisions. This work involves developing threat models, inventories of data storage facilities and their redundancy characteristics, and recovery mechanisms for various failure scenarios. See Backup & Recovery.

  2. In progress. We will develop a new policy for service outages that describes what constitutes a service outage, and how we communicate about outages to our end users and other stakeholders.

  3. In progress. We will establish a third-party dark archive for our core e-print metadata and content, to further protect against long-term risks to the arXiv platform and organization. This depends on the data architecture changes described in Data architecture for e-prints - Critical.

Submission & moderation

Data architecture for submission - Critical

The classic arXiv submission system is built around an object-centric data model. Submissions are represented objects whose properties map to rows in a database table, and workflows are implemented by developing web controllers that mutate those objects (and the underlying rows). This model works well for simple systems in which there is a single point of entry for submission data. A limitation of the classic architecture is that it requires new submission interfaces to reimplement the commands (and rules) that it exposes, and to reimplement updates to the administrative log.

Our goal is to produce a data architecture and supporting software that:

  • Allows for a range of interfaces and automated systems to operate on submissions with a consistent and maintainable set of transformation rules.

  • Provides a complete and auditable history of commands performed on the submission.

  • Makes it easier to implement configurable rules-based workflows and policies, including hooks for automated processes.

We are achieving this by treating instances of command execution (events) themselves as data. You can read more about the approach here.

Specific deliverables

  1. In progress. We have produced a software package that implements an :doc:`event-centric data architecture while preserving integrations with classic components that depend on the submission database. This is used by the NG submission UI and API (below).

  2. We will extend that software package to support existing moderation workflows, which will be necessary to replace the backend for moderation UI later on.

  3. Later in the project, when all other classic components that depend on the submission database have been replaced, we will migrate submission data to an isolated data store that is better optimized for the NG submission data architecture.

Service architecture for backend operations - Critical

Many low-level operations that are important for submission and moderation workflows are entangled in the classic codebase in ways that makes it difficult to recompose them in new ways that might improve the end-user experience. An important part of our work on the submission and moderation system is to isolate those units of functionality, and reimplement or encapsulate them as standalone services.

Specific deliverables

  1. In progress. Reimplement upload and file management routines as a standalone file management service. Not only does this make it easier to implement upload policies, it also provides better isolation of potentially hazardous content from the rest of the system. See https://github.com/arxiv/arxiv-filemanager.

  2. In progress. Encapsulate LaTeX compilation functionality (the “TeX tree”) as a standalone compilation service. This provides better isolation; indeed, poor isolation of the compilation process has led to outages in the past. This will also allow us to provide API access to our compilation process to trusted platforms. See https://github.com/arxiv/arxiv-compiler.

  3. Done. Implement a plaintext extraction service, with help from Paul Ginsparg and Matt Bierbaum in CIS. This is required for overlap detection, classification checks, and other QA/QC processes. See arXiv Fulltext Extraction Service.

  4. In progress. Encapsulate the auto-classifier as a production service in core arXiv, with help from Paul Ginsparg and Matt Bierbaum in CIS.

  5. In progress. Encapsulate overlap detection routines as a standalone service in core arXiv, with help from Paul Ginsparg and Matt Bierbaum in CIS. This will be used to facilitate full text search, see Full text search - Low.

Submission UI - High

The inflexibility of the classic submission UI has been a roadblock for long-requested improvements to improve user experience. Reimplementing the submission UI using the NG data architecture for submission, and composing NG backend services, will put us in a position to implement improvements much more efficiently. Our goal is produce a submission UI that initially has feature-parity with the classic UI, but is much more extensible. We are doing this by building on the software package that we have developed, and taking a fresh approach to the UI/UX design with extensive input from users.

Specific deliverables

  1. In progress. We are producing a feature-parity submission UI using NG technology and architectural patterns. We will deploy this in parallel to the classic UI after extensive alpha testing with users, and allow for a long public beta period during which we will make further disseminations. See https://github.com/arxiv/arxiv-submission-ui.

  2. Once the initial public beta has ended for the NG submission UI, we will begin introducing more significant improvements from the backlog. This includes things like front-loading classification checks, providing feedback about possible overlap, and providing a much more user-friendly representation of TeX log output.

  3. When Secondary metadata - Moderate functionality has been implemented (below), we will incorporate interfaces to facilitate adding those metadata during or after the submission process.

  4. When Author name disambiguation - High functionality has been implemented (below), we will incorporate interfaces to facilitate adding author information during the submission process.

Submission API - Moderate

The classic system includes an API for bulk deposit based on version 1 of the SWORD protocol. This API is used by several key user groups, including overlay journals and conferences. We are aware of a great deal of interest from additional groups and platforms that serve our user-base to leverage APIs for submission. Unfortunately, the design of the classic submission API–especially its authorization model–makes it difficult to support those use-cases in a way that respects user controls and maintains direct engagement between authors and the arXiv platform.

Replacing the classic submission API is necessary to ensure that all entry-points into the submission system do so via the NG submission data architecture described above.

Our goal is to offer a simple, modern submission API that can replace the classic SWORDv1 API, that takes advantage of the authentication and authorization features developed as part of the API Gateway (below).

Specific deliverables

  1. In progress. We are developing a submission API behind the API Gateway, available for use by trusted partners. This depends on the infrastructure, client registry, and authorization support milestones of the API Gateway project. See https://github.com/arxiv/arxiv-submission-core/tree/develop/metadata.

Moderation, automation, & scaling - Critical

Although the moderation interface developed in 2017 made much-needed functional and usability improvements, its reliance on the classic system prevents more substantial changes to address significant stresses posed by the increasing volume of material handled by the small number of volunteer arXiv moderators. Our goal is to support the kinds of sophisticated rules-based and semi-automated workflows requested by our stakeholders, which will require replacing the backend components of the moderation system.

Specific deliverables

  1. Done. In response to moderator feedback, we developed a new moderation interface as a single-page application backed by APIs built into the classic system.

  2. We will reimplement the backend components of the moderation system using the Data architecture for submission - Critical (above). This will require extension of the submission core software package to support moderation commands. The first version will target feature-parity with the existing moderation workflows, plus opportunistic improvements.

  3. We will implement interfaces for administrators and moderators to specify handling rules for submissions, incorporating backend services like the auto classifier.

Announcement & metadata

Data architecture for e-prints - Critical

The core data architecture for e-print metadata and content poses several challenges for scaling, development, and preservation that are a high priority for NG. The classic system makes heavy use of a shared filesystem, which presents serious scaling and availability issues that have already led to unacceptable outages. It also uses a non-standard serialization format, which makes maintenance and further engineering difficult. Some parts of what are considered the core scientific record are distributed across core metadata records and database tables, making the public site brittle and limiting our failover options during outages.

Our goal is to increase the reliability of the public site, and the durability of the core scientific record, by migrating to a simpler and more robust architecture for our most critical data. This will allow us to replicate data in a much more reliable and cost-effective way, including to the planned third-party dark archive and to mirror sites.

Specific deliverables

  1. In progress. We are developing a formal schema for core arXiv metadata records and the announcement log, and a software package that implements that schema for controlled writing and reading.

  2. We will migrate core announcement metadata and content off of the shared filesystem, and onto a cloud-based key-binary store. This will allow us to scale our public site much more effectively, reducing the frequency and severity of outages. See Primary announcement record.

Announcement - Critical

The classic announcement process involves the daily execution of a single script that performs a large number of serial tasks culminating in distribution of e-mails to subscribers about new e-prints on the arXiv platform. Each of these steps is performed on a single host. This serial, single-host architecture makes the announcement process brittle, and difficult to develop further. Our goal is to improve the resilience of the announcement process by decomposing it into tasks that can be executed asynchronously and/or on distributed hardware.

Specific deliverables

  1. As we put into place the changes to core data architecture mentioned above, we will replace parts of the legacy publishing routine with refactored code that is testable and well-documented. This will include adding hooks that generate system notifications that can be consumed by services running on separate hardware.

  2. We will reimplement the e-mail announcement routine as a standalone service that provides better interfaces for users to manage subscriptions (filling a variety of long-standing feature requests related to this functionality). This will leverage the notification architecture and search backend, allowing us to run the daily e-mail announcements on separate infrastructure and provide a richer set of subscription options. See announcement-service.

  3. We will implement a webhook API based on the notification architecture to update API clients about new e-prints. This will alleviate the need for partner platforms and other API users to poll our system for updates. We anticipate that this will significantly reduce load on our APIs. See announcement-service.

Author name disambiguation - High

The arXiv metadata model does not support individuation or disambiguation of author names. At submission time, submitters are asked to provide a single string with the names of all of the authors of a paper that follows a canonical format, but the format is not strictly enforced. Users have an expectation that we can provide both precise and accurate lists of papers written by individual authors; for example, that they can click on an author’s name on one paper, and see a list of all of the other papers on arXiv by that same person. While we do have a mechanism for authors to “claim” co-authorship, and so can relate papers to user accounts, we don’t have a reliable way to match those accounts to parsed fragments of the canonical author string in the metadata. Most users don’t realize this, and express frustration when they discover this fact.

Our goal is to provide mechanisms for disambiguating parts of the canonical author string, but do so in a way that minimizes additional complexity during the submission process, can be updated after announcement, and does not require curation by arXiv staff. We will do this by implementing an annotation-based service that allows paper owners to perform disambiguation themselves, with help from automated tools. We have benefitted from extensive input from the Metadata Services group within Cornell University Library during the planning process.

Specific deliverables

  1. We will develop an application that allows paper owners to disambiguate names on their papers, leveraging ORCID, arXiv authority records, and authority records curated by partner platforms (e.g. INSPIRE).

  2. We will incorporate suggestions from a disambiguation engine that suggests possible matches between named authors in arXiv papers.

  3. We will incorporate disambiguation data into our display and search functionality, so that users can get the best possible discovery experience given the available data.

Secondary metadata - Moderate

A wide range of requirements and feature requests that we have received from stakeholders and end users involve attaching relational metadata to arXiv e-prints. This includes things like funding information, links to datasets, code, and other online content, and better support for information about the published version of record. Including this kind of relational metadata in the core arXiv metadata record is a poor fit even after those records are reengineered, given the way that e-prints are versioned, the requirement that secondary metadata be maintainable outside of the submission process, and the requirement that support for secondary metadata be as evolvable and extensible as possible. Additionally, we need to bring forward into NG the automated routines that we use to harvest relational metadata (e.g. DOIs, journal citations) from other publishing platforms; a shortcoming of the classic system is that the provenance of these kinds of metadata are not tracked, which presents challenges for our partners to interpret and use those metadata downstream.

Our goal is to support the accession, display, and reuse of secondary metadata by implementing a separate data structure and backend service.

Specific deliverables

  1. We will implement a service that handles secondary metadata, starting with improved support for information about a paper’s version of record. As part of this work, we will reimplement the harvesting routines that collect metadata from external sources to enhance arXiv secondary metadata. This service will be accessible via the arXiv API Gateway.

  2. We will extend the service to support addition of author-curated links by paper-owners, and display those links on the abstract page.

  3. We will extend the service to support additional high-value metadata elements, based on input from member institutions and other stakeholders.

Public site (Browse)

Decouple the public site from classic infrastructure - Critical

The public “read only” parts of classic system–what we call “browse”–relies on a monolithic database and an enormous networked filesystem. Its design is not very resilient to outages, and makes the site difficult to replicate. In order to provide a truly fault-tolerant public site, our goal is to redesign the browse system so that it can be deployed on both the classic infrastructure and in the cloud.

Specific deliverables

  1. Done. We reimplemented the e-print abstract page using NG technology. Our goal was to precisely replicate the classic abstract page. The bulk of the work (and the primary value) was to facilitate deep analysis of the integrations that will need to be disentangled going forward.

  2. In progress. In conjunction with work on the Data architecture for e-prints - Critical, we will implement the remaining routes in the browse application, with the objective of eliminating dependency on the database and simplifying the dependency on the filesystem.

  3. In conjunction with work on the Data architecture for e-prints - Critical, we will extend the browse application to run entirely on a key-binary store, so that it can be deployed in Kubernetes. We will initially use this as a failover deployment in the case of the main site going down, and will ultimately shift to serving the main site from the cloud.

Improve accessibility, usability of the site - High

The classic site is difficult to use with screen readers and other assistive technology. While our users value the austere, retro look-and-feel of the site, we have prioritized incremental UI changes that will significantly improve accessibility. These improvements will be made throughout the project in tandem with UI redesigns.

Migration & revamp of public help documentation - High

The arXiv public documentation has accumulated a great deal of cruft over time, and is difficult to maintain. As part of a broader review of public documentation, the operations team is migrating static documents from the public site to a version-controlled repository in markdown format. The goal is to make it easier to maintain, version, and deploy the public documentation without extensive technical knowledge.

Specific deliverables

  1. In progress. We will deploy a reimplemented static site for help pages and other public documentation.

Reimplemented RSS feeds - Moderate

Our RSS feeds are used by partners and other API users for a wide range of downstream purposes that provide value to arXiv users. It will be important to continue to provide these RSS feeds as we reengineer the arXiv platform.

Specific deliverables

  1. We will reimplement the arXiv RSS feeds as a standalone application on top of the Elasticsearch cluster that we deployed for search.

Accounts & Authorization

Scalable session management & authentication - Critical

The classic system relies on a session management architecture that is tied to a centralized database. We require a mechanism for authenticating users that accommodates our architectural goals, including decreased reliance on a central database and incremental migration to cloud-based infrastructure.

Specific deliverables

  1. Done. We have deployed a distributed session store that is available for both on-premises and cloud-based applications.

  2. Done. We have implemented a new software package for user accounts and session management that integrates with the classic system.

  3. Done. We have reimplemented the authentication mechanisms in the classic system with a new NG application that uses the distributed session store and NG authentication and authorization package.

  4. We have implemented and deployed new service that is used to handle authentication in our Kubernetes cluster in AWS.

User registration and profile management - Critical

Managing user accounts is a pain point for our users, including account recovery, account verification, and handling of duplicate accounts. In order to implement other high-priority changes to our authentication and authorization system, as well as authorization workflows for our APIs, we will need to wrest control of user registration from the classic system. Or goal is to reimplement user registration and profile management in the NG architecture, and make opportunistic improvements to user experience.

Specific deliverables

  1. In progress. We are expanding the NG authentication application to support user registration, account recovery, verification, and profile management.

Role based access control - High

The classic system uses a combination of flags and access-control lists to authorize users for pre-defined sets of activities within the arXiv system. As the number of authorized users grows, and as workflows for moderation and administration evolve, this arrangement has become increasingly limiting and is unwieldy to maintain. Our access control model must also support varying levels of trust afforded to API consumers and partner platforms. Our goal is to support a more flexible authorization model that puts more control in the hands of the operations team.

Specific deliverables

  1. We will extend the NG authentication and authorization package developed for Scalable session management & authentication - Critical to support the concept of user roles, which can be created and managed by the operations team. Full implementation depends on the replacement of classic endpoints that rely on authorization, which includes submission, moderation, and administrative interfaces.

Improved endorsement system - Moderate

The classic endorsement system has been a roadblock for users not directly affiliated with academic institutions. Given the importance of arXiv for users in industries outside of higher education, our goal is to make the endorsement process much more user-friendly and transparent.

Specific deliverables

  1. We will extend the NG authentication application to replace the endorsement mechanisms in the classic system. We will work closely with the operations and moderation teams to design an endorsement workflow that reduces pain for new users while maintaining the core objectives of the classic endorsement system.

API Gateway

Infrastructure for consolidated access control, documentation - High

The classic system lacks consistent authentication and authorization mechanisms for API clients. In order to expose new APIs securely, and support varying levels of access for trusted partners (including the functionality described for the submission API), we require an authn/z mechanism that implements current best practices for security and access control. In addition, the documentation for our APIs is disparate, obscure, and hard to find. Our goal is to significantly improve the API user experience by implementing access control mechanisms that conform to contemporary best practices, and provide documentation and other resources in a consistent easy-to-use format in a single location.

Specific deliverables

  1. Alpha. We have implemented an API client registration application that allows authenticated arXiv users to obtain credentials for accessing NG APIs. This implements the OAuth2 protocol for two-legged and three-legged authorization.

  2. Ongoing. We have adopted the OpenAPI 3.0 and JSON Schema 07 standard for documenting our public and backend APIs, and have implemented and will deploy an application to aggregate and display that documentation for both human and programmatic consumption.

Reimplement and/or upgrade OAI-PMH endpoint - High

arXiv is recognized as an early and exemplary adopter of the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH), and our OAI-PMH endpoint is relied upon by partners and other API users. It is a high priority to bring this functionality forward as we move away from the underlying data sources upon which it relies. Our goal is to preserve critical integrations with our partners.

Specific deliverables

  1. We will reimplement the OAI-PMH endpoint as a standalone service available via the API gateway. We will consider upgrading this endpoint to the ResourceSync specification (ANSI/NISO Z39.99-2017), with input from partners and stakeholders.

API client registry - Moderate

As part of our push to promote innovative development around our APIs, based on input from our API users we plan to develop an API client registry that provides a simple directory of API-based projects.