arxiv.canonical.classic.backfill module¶
Functions for backfilling the NG record from classic.
In order to ensure a smooth transition from classic to the NG announcement process, we need to be able to initially operate both the classic and NG canonical records in parallel. This means that we need to be able to:
Backfill the canonical record from the classic record, starting at the beginning of time and running up to the present. See
backfill()
.Continuously update the canonical record from data in the classic system. See
backfill_today()
.
This module is implemented on the assumption that its functions will be executed on a machine with access to the classic filesystem, specifically to the abs/source files and daily.log file. It is agnostic, however, about the target storage medium for the canonical record. So this these functions can be used to backfill the canonical record both on local filesystems and in (for example) an S3 bucket.
What version is this?¶
The lacuna of the classic record is an unambiguous mapping between announcement events and specific versions of an e-print. For example, if we encounter a replacement event in the daily.log file, there is no explicit indication of whether the resulting version is 2, 3, or some higher value. The abs file does not provide this information either, as only the submission date of each version is preserved (although this could at least be used as a lower bound). So, we need to get creative.
Start at the beginning of time. Initialize a counter that keeps track of the last version number seen for each e-print identifier.
Prior to the start of the daily.log (mid-1998): Read the abs file for each
e-print, and generate a new
and subsequent replace
event(s) using the
submission date(s) as the announcement date(s).
Read daily.log in order. Rely on the version number mapping to keep track of where we are with each e-print.
-
arxiv.canonical.classic.backfill.
backfill
(register, daily_path, abs_path, ps_cache_path, state_path, limit_to=None, cache_path=None, until=None)[source]¶ Lazily backfill the canonical record from the classic record.
Note: you must consume this iterator in order for backfilling to occur. This was implemented lazily because there is considerable I/O (including possibly some over the network), and being able to control processing rate at a high level was foreseen as important.
- Parameters
register (
IRegisterAPI
) – A canonical register instance that will handle events derived from the classic record.daily_path (str) – Absolute path to the daily.log file.
abs_path (str) – Absolute path of the directory containing abs files and source packages. Specifically, this is the directory that contains the
ftp
andorig
subdirectories.state_path (str) – Absolute path of a writeable directory where backfill state can be stored. This allows us to persist the backfill state, in case we need to restart after a failure.
limit_to (set) – A set of
Identifier`s indicating a subset of e-prints to backfill. If ``None`
(default) all events for all e-prints are backfilled.cache_path (str) – If provided, a writable directory where a cache of events can be maintained. This cuts down on spin-up time considerably.
- Returns
Yields :class:`Event`s that have been successfully backfilled.
- Return type
iterator
- Return type
-
arxiv.canonical.classic.backfill.
backfill_today
(register, daily_path, abs_path, ps_cache_path, state_path, cache_path=None)[source]¶ Lazily backfill the canonical record from today’s events in classic record.
This is intended to be used to keep the canonical record up to date from the classic record on a daily basis, after the initial backfill.
Note: you must consume this iterator in order for backfilling to occur. This was implemented lazily because there is considerable I/O (including possibly some over the network), and being able to control processing rate at a high level was foreseen as important.
- Parameters
register (
IRegisterAPI
) – A canonical register instance that will handle events derived from the classic record.daily_path (str) – Absolute path to the daily.log file.
abs_path (str) – Absolute path of the directory containing abs files and source packages. Specifically, this is the directory that contains the
ftp
andorig
subdirectories.state_path (str) – Absolute path of a writeable directory where backfill state can be stored. This allows us to persist the backfill state, in case we need to restart after a failure.
cache_path (str) – If provided, a writable directory where a cache of events can be maintained. This cuts down on spin-up time considerably.
- Returns
Yields :class:`Event`s that have been successfully backfilled.
- Return type
iterator
- Return type