Objectives & Requirements

This chapter briefly elaborates on the Guiding Principles enumerated in the introduction, in light of the discussion of the classic arXiv system.

Modernization & best practices

Objective 1

Modernization. The selection of technologies should take advantage of contemporary developments in web application architecture, standards, and best practices, while minimizing lock-in with corporate vendors and proprietary technology.

This objective is related to several connected issues:

Programming languages

Perl (the primary language in which the arXiv classic system is implemented) has been on the losing end of cultural shifts in the web development world; other languages, such as Python and Ruby, have become vastly more popular over the past decade. As a result, not only is it more difficult to attract talented developers, but there is also far less tooling and support for Perl as technology advances. Adopting more mainstream programming language(s) positions arXiv to benefit from those advances.

Infrastructure

arXiv is not well positioned to respond to trends around web infrastructure. The declining cost of cloud-based infrastructure, and rapid advances in the tooling and technology around them, is prompting organizations to migrate away from on-premises server farms in favor of off-site cloud platforms such as Google Cloud, Microsoft Azure, and Amazon Web Services. Although migrating to a cloud-based infrastructure is not a core requirement, arXiv must be prepared for the possibility that on-premises infrastructure is unavailable or increasingly expensive. Moving in the direction of cloud infrastructure also positions arXiv to take advantage of tools and technology that would be too expensive to support on-premises, and that reduce the development team’s dependence on external support.

Agility

The preceding issues are related to broader questions about how the existing architecture and development practices are related to the overall agility of the arXiv project. Increasing the team’s ability to perform new development, and reduces the complexity of maintaining the system, are paramount. These can be significantly advanced by taking advantage of improvements in technology and standards that have occurred over the past decade. See Evolvability & agility, below.

Lock-in

The overarching requirement to maintain the stability of arXiv in the long term, combined with the relatively small size of the arXiv development team, accentuates the risks of platform and technology lock-in. Although the architecture and implementation of arXiv must evolve over time, adopting a product that is unlikely to exist or be supported in several years is risky and should be weighed against the ability to change course should that occur. Depending too heavily on proprietary technology can sometimes increase the risk that tools will disappear, e.g. if the company ceases to exist or is purchased by a competitor.

The right mix of technologies will allow the development team to deliver value to arXiv users more efficiently, while avoiding choices that limit our options in the future.

Evolvability & agility

Objective 2

Facilitate both short-term and long-term evolvability. The architecture should enable shorter release cycles and faster turn-around on feature development and bug-fixes. Wherever possible, the arXiv architecture and its implementation should minimize the cost of discarding or reimplementing components or subsystems.

In support of the core arXiv mission, our primary concern is providing value to users as efficiently and effectively as possible. A major limitation of the classic arXiv system has been the complexity of implementing significant new features in the existing architecture. Increasing turnaround on feature development and bug-fixes is therefore a major requirement for the classic renewal process.

Similarly, it must be possible to maintain the arXiv system at the current or similar funding levels (not including short-term funding for NG). Developer labor is the primary driver of development costs in the classic arXiv system, and this is likely to remain the case for the foreseeable future.

Removing the constraints imposed by the classic “monolithic” architecture will significantly advance this objective. Achieving this goal also entails continuous improvement of development processes, which is already a core practice.

Specific requirements:

  • It must be possible to make internal changes to a subsystem/service with minimal risk of creating regressions or introducing bugs in other systems.

  • It must be possible to develop, test, and release a subsystem/service without coordinating multiple code-bases across separate repositories.

  • The tools and processes used to run a subsystem/service on developer machines, development VMs, Continuous Integration platforms, and in staging and production environments should be as uniform as possible (i.e. minimal variation in setup/configuration).

  • Codebases that are easier to navigate and reason about.

  • More comprehensive API documentation.

Data architecture: integrity and flexibility

Objective 3

Support an extensible data architecture that provides both a high level of stability for core scholarly objects as well as the flexibility to keep pace with best practices in scholarly publishing, library science, and data curation.

From an external perspective, arXiv addresses several different data-related concerns that sometimes entail divergent requirements.

ArXiv provides (implicitly or explicitly) some strong guarantees about long-term stability and integrity of scholarly objects and attendant metadata. Users expect that, once announced, papers will continue to be available indefinitely in the form that they were published. Users also expect that arXiv will preserve and make available some details of the submission history and versions, which may be significant for priority concerns or to commemorate corrections and other revisions. For those data:

  • Long-term stability and integrity are paramount.

  • Re-creating those data would be costly or impossible.

  • The structure of those data are expected to change infrequently, if at all.

  • The vast majority of data-manipulation operations are to add new data.

  • When changes are made to those data (e.g. a revision is published), it is expected that those changes are commemorated as versions.

Some areas of the system, e.g. the submission and moderation system, have stronger requirements about recording chronologies of events. For example, the classic arXiv system displays information to moderators and administrators about actions that have been taken on individual submissions.

ArXiv also enhances author and reader experience by supplementing arXiv metadata from external sources (e.g. associating external refereed papers with arXiv pre-prints), transforming core data into other forms (e.g. extracting and displaying cited references), and providing mechanisms for discovering arXiv content (e.g. search index). While not trivial, the relative value of those data (in terms of the cost of reproducing them) are considerably lower than that of the core data described above. For example, it is not a crisis if a record is mysteriously dropped from the search index; re-indexing that paper is relatively trivial. Similarly, references can be re-extracted, and external metadata can be retrieved a second time, if the local version of those data are lost. Moreover, it is often the case that those data will be refreshed intentionally: we may make improvements to how we index papers for search, we may get better at extracting references, etc. In those cases:

  • Long-term stability and integrity are not expected.

  • Re-creating those data is considerably easier and less expensive.

  • The structure of the data are expected to change as arXiv evolves.

  • The data are more likely to be updated/modified.

  • When those data are created, deleted, or updated there is no obvious need to commemorate those changes.

In addition, SLA requirements and access patterns may vary widely across the system.

The classic arXiv system relies on a single relational database (MySQL server, with replication) to store and query data. In the reader-facing browse and abstract views of the web application, responsibility for queries are offloaded to a flat-file metadata record (the “abs file”). The relational database is adequately performant. Over time, however, the schema has become inordinately complicated, with indications of poor normalization in some areas.

Integrating human effort and automated processes

Objective 4

Support complex moderation and administration workflows that integrate human and computational effort. This should increase the efficiency of the moderation system, facilitate long-term scalability, and increase the degree of control that administrators can exercise over the entire system.

The classic arXiv system combines synchronous behavior (in response to user requests) and asynchronous behavior (scheduled tasks). As one arXiv admin put it, “If we all walked away tomorrow, the system would continue to accept and publish new papers on time.” The NG architecture must continue to support both modalities.

arXiv relies on a committed group of volunteer moderators to screen submissions on a daily basis, with support from the admin team. While human moderators will continue to play a crucial role for the foreseeable future, the extent to which some human activities may be replaced or enhanced through better automation is an area of interest related to scalability. The classic arXiv system already incorporates some automation in the moderation process, e.g. the document similarity system and the document classifier.

It should also be noted that automation (e.g. classifying submissions) in arXiv is an activate area of research.

With those considerations in mind, the arXiv architecture should provide a framework for incorporating a growing number of automated processes into the moderation process. Building on recent work in this area (e.g. the concept of classification proposals), this system should support configurable mechanisms for integrating the results of those automated process with human judgment. It should also provide mechanisms to audit both automated and human activities in submission and moderation workflows.

API consumer innovation

Objective 5

Promote and support innovative uses of arXiv content and metadata by external researchers and developers by providing rich modern REST APIs, thorough documentation, and engaging in ongoing dialogue with API consumers.

There already exists a large ecosystem of researchers and web developers who are generating online tools based on arXiv content. This includes recommender systems, search tools, text mining, social media “alt metrics”, etc. This relieves pressure to add trendy bells and whistles to arXiv itself. At the same time, the popularity of arXiv motivates external groups (e.g. journals, authoring platforms) to collaborate on integrations that add value for users.

In the classic arXiv system, programmatic access is distributed across several APIs, including:

  • RSS/Atom endpoint

  • OAI-PMH endpoint

  • “arXiv API” (XML-based)

  • PDFs available in S3 as (inconsistently packed) archives

  • SWORD endpoint (submission)

We want to make working with arXiv content on a programmatic basis easy and enjoyable. In a survey of over 100 external API consumers, there was strong enthusiasm for:

  • JSON-based APIs.

  • RESTful behavior.

  • A single point of access for arXiv APIs.

  • Better support for querying and filtering arXiv records.

  • Access to additional content, such as full text and cited references.

  • Better access to available content (e.g. PDFs currently provided via S3).

  • Better documentation, with more and varied examples.

  • Submission mechanisms.

  • Improvements to the metadata that we expose (e.g. author identities through ORCIDs).

The arXiv NG architecture should provide tools that allow us to better understand who is using arXiv content programmatically.

Finally, we recognize the importance to developers of their projects being visible and available to end users. The NG platform should help end users discover developer platforms and applications.

Integrations with partners

Objective 6

Provide mechanisms for rich integrations with partner services in the scholarly publishing domain, e.g. through coordinated submission workflows, two-way linkages between e-prints and other scholarly objects (e.g. Announcements, multimedia, code, data), and other related products that provide value for scholars who publish in arXiv.

A variety of partners (including scholarly societies, publishers, and content authoring platforms) are eager to collaborate further around integrations that support and enhance author experience while advance arXiv’s core mission. The classic renewal process should advance those collaborations as much as possible.

This objective relates to: