Overview & quick start guide

Goals and strategy

What are we doing?

We are incrementally reimplementing components of the legacy arXiv software as smaller services and other applications that can be deployed in the cloud.

Why?

The legacy system has evolved haphazardly over two decades into a sprawling monolith. The code and underlying data architectute has poor separation of concerns, few tests, and is exceptionally difficult to deploy and scale. The cost of developing the legacy system further far outweighs the cost of developing new systems that will be easier to maintain and extend going forward.

What’s the plan?

We have identified components of the system that will be carved off and replaced with new services. We are starting from the “ends” of the arXiv system–the public-facing site on one end, and the submission system on the other–and are working inward. The last piece will be to replace the components that run the daily e-print announcement process.

There are two critical paths:

  1. Submission + moderation:

  2. Public site:

    • Reimplement search using cloud-based technology.

    • Reimplement “browse” (abstract page, home page, listing pages).

    • Implement NG canonical record as a mirror of the legacy system, so that we can run the public site entirely in the cloud.

In addition, there are a host of other APIs, pages, and various other components that we are chipping away at opportunistically as we go along. This includes things like:

  • RSS feeds

  • OAI-PMH endpoint

  • External links and harvesters (e.g. DOI, adding support for code, datasets)

An extended description of the project can be found at https://arxiv.github.io/arxiv-arxitecture.

What are you doing with your APIs?

We are working toward a new arXiv API Gateway where we can expose NG services for external use. This includes things like:

  • Modernized search API, backed by Elasticsearch;

  • Compilation service, to compile TeX source using arXiv’s system;

  • Submission API, to facilitate external interfaces for submitting to arXiv;

  • Plain text API, to allow researchers and others to quickly obtain extracted plain text content from arXiv e-prints.

  • etc

How is the project organized/managed?

The arXiv team uses Atlassian Jira internally for project management. We try to make as much as possible of that system publicly visible, to make it easier to communicate with stakeholders about our work. The arXiv Jira site is at https://arxiv-org.atlassian.net. Jira gives us a nice suite of tools for organizing tickets, communicating priorities, and estimating workload and timelines.

As more external developers and stakeholders get involved, we are increasingly using GitHub issues and projects to track our work. You can find projects where we are looking for external contributions at https://github.com/orgs/arXiv/projects. We use webhooks to keep Jira up to date with GitHub, so that we aren’t spending a lot of time manually updating tickets in two places.

Martin Lessmeister (IT Team Lead) and Erick Peirson (Lead System Architect) are jointly responsible for planning, prioritization, and management of the project.

Where can I find the source code?

We have made as much as possible of the arXiv codebase publicly available on GitHub, at https://github.com/orgs/arXiv. All arXiv-NG software is provided under the MIT License, so you can use it for pretty much anything.

Contributing to arXiv NG

How can I contribute to software development?

Contributions from external developers are greatly appreciated. Things to be sure and read:

Reach out to Erick (brp53@cornell.edu) and he’ll get you wired in.

How can I contribute to testing, reporting bugs, and requesting features?

Please reach out to Erick (brp53@cornell.edu) to express your interest.

You can always raise issues on any of our GitHub repositories.

Getting the work done

What kinds of software are we actually building?

In general, we are building two types of applications:

  1. Services. Python+Flask apps that provide APIs or user interfaces.

  2. Agents. Python apps that read from event streams and do work.

Services

Most components are being replaced with stand-alone web services implemented in Python 3.6 using the Flask microframework.

This includes:

  • Backend services that provide RESTful JSON APIs;

  • User-facing applications that serve up traditional HTML+CSS webpages.

Services may also include worker applications to do asynchronous tasks, which we are implementing using Celery <http://celeryproject.org>.

To read more about how services get put together, see Internal architecture & Flask implementation.

Agents

To coordinate system-wide processes, we are using data streams (at the moment, AWS Kinesis). Agents are applications that read event information from these streams and use that information to do work.

For example, we have an agent that listens for updates to e-print metadata, and keeps the search index up to date.

What happens to the database?

The legacy system is built around one giant MySQL database and a big shared file system. As we replace components with new services, we are also carving off parts of the database along with them.

We are using a range of storage solutions for state, including:

  • MySQL/MariaDB

  • Redis

  • AWS S3

  • EFS

While there are exceptions, generally each service is responsible for its own state. This helps us to achieve better separation of concerns, and gives us more flexibility in implementing backup and recovery strategies.