arxiv.canonical.classic.daily module

Parser for the daily.log file.

The main goal of this implementation is parsing the log file for the purpose of transforming it into the arXiv Canonical format. Specifically, we want to use this legacy data structure to generate Event data that can be serialized in the daily listing files.

From the original arXiv::Updates::DailyLog:

Module to provide information about updates to the archive
over specified periods. This should be the only section
of code that reads the daily.log file.

 Simeon Warner - 6Jan2000...
 25Jan2000 - modified so that undef $startdate or $enddate select
   the beginning or end of time respectively.
 25Jan2000 - modified so that by simply removing the `-' from
   and ISO8601 date we get YYYYMMDD from YYYY-MM-DD
 16Oct2000 - to allow easy resumption in the OAI1 interface and
   because it seems that it might be useful in other contexts the
   number limited behaviour has been changed. query_daily_log() and
   hence all other routines now stop at then end of a day and
   returns the that day (in the form YYYY-MM-DD) as the value
   if limited, undef otherwise.

Thoughts: If this is to be used on the mirror sites then we will need
to mirror the daily log. This probably means that that file
should be split up.

 [CVS: $Id: DailyLog.pm,v 1.6 2010/03/23 03:53:09 arxiv Exp $]

```

class arxiv.canonical.classic.daily.DailyLogParser[source]

Bases: object

Parses the daily log file.

parse(path, for_date=None)[source]

Parse the daily log file.

Parameters

path (str) – Path to the daily log file.

Returns

Each item is an EventData from the log file.

Return type

iterable

Return type

Iterable[EventData]

parse_line(raw)[source]

Parse a single line from the daily log file.

Parameters

raw (str) – A single line.

Returns

Yields EventData instances from the line.

Return type

iterable

Return type

Iterable[EventData]

class arxiv.canonical.classic.daily.EventData[source]

Bases: tuple

Data about events that can be extracted from the daily log.

property arxiv_id

Alias for field number 0

property categories

Alias for field number 4

property event_date

Alias for field number 1

property event_type

Alias for field number 2

property version

Alias for field number 3

arxiv.canonical.classic.daily.IDENTIFIER_RANGE = re.compile('^(?P<start_id>\\d{7})\\-(?P<end_id>\\d{7})$')

The old-style format supported ranges of identifiers, e.g. 1234-1238.

arxiv.canonical.classic.daily.LINE = re.compile('^(?P<event_date>\\d{6})\\|(?P<archive>[a-z-]+)\\|(?P<data>.*)$')

Each line in the log file begins with a date stamp and an archive.

class arxiv.canonical.classic.daily.LineParser[source]

Bases: object

Shared behavior among newstyle and oldstyle line parsing.

parse(e_date, archive, data)[source]

Parse data from a daily log file line.

Return type

Iterable[EventData]

parse_cross(archive, fragment)[source]

Parse entries for cross-list e-prints.

Return type

Iterable[Tuple[Identifier, EventType, str]]

parse_new(archive, fragment)[source]

Parse entries for new e-prints.

Return type

Iterable[Tuple[Identifier, EventType, str]]

parse_replace(archive, fragment)[source]

Parse entries for replacements.

Return type

Iterable[Tuple[Identifier, EventType, str]]

arxiv.canonical.classic.daily.NEW_STYLE_CUTOVER_AFTER = datetime.date(2007, 4, 2)

Date after which the new-style format was adopted.

class arxiv.canonical.classic.daily.NewStyleLineParser[source]

Bases: arxiv.canonical.classic.daily.LineParser

Parses new-style daily log lines.

Starting after 2007-04-02 (NEW_STYLE_CUTOVER_AFTER), the format changed to put all announcement-related events on a given day on the same line. The three original sections of the line are preserved, but within each section are entries for e-prints from all archives.

parse_cross(archive, fragment)[source]

Parse entries for cross-lists.

Parameters
  • archive (str) – Literally just "arxiv"; this is a dummy place-holder, since new-style lines contain entries for all archives for which announcements occurred on a particular day.

  • fragment (str) – Section of the line containing cross-list entries.

Returns

Yields Event instances from this section.

Return type

iterable

Return type

Iterable[Tuple[Identifier, EventType, str]]

parse_new(archive, fragment)[source]

Parse entries for new e-prints.

Parameters
  • archive (str) – Literally just "arxiv"; this is a dummy place-holder, since new-style lines contain entries for all archives for which announcements occurred on a particular day.

  • fragment (str) – Section of the line containing new e-print entries.

Returns

Yields Event instances from this section.

Return type

iterable

Return type

Iterable[Tuple[Identifier, EventType, str]]

parse_replace(archive, fragment)[source]

Parse entries for replaced e-prints.

Parameters
  • archive (str) – Literally just "arxiv"; this is a dummy place-holder, since new-style lines contain entries for all archives for which announcements occurred on a particular day.

  • fragment (str) – Section of the line containing replacement entries.

Returns

Yields Event instances from this section.

Return type

iterable

Return type

Iterable[Tuple[Identifier, EventType, str]]

class arxiv.canonical.classic.daily.OldStyleLineParser[source]

Bases: arxiv.canonical.classic.daily.LineParser

Parses data from old-style log lines.

The original format used a separate line for each archive. The line contained three sections: e-prints newly announced in that archive, e-prints cross-listed to that archive, and e-prints replaced either in that archive or with a new cross-list category in that archive. Thus there may be multiple lines for a given announcement day, one per archive in which announcement activity occurred.

parse_cross(archive, fragment)[source]

Parse entries for cross-list e-prints.

Parameters
  • archive (str) – Archive to which entries on this line apply (to which the e-print has been cross-listed).

  • fragment (str) – Section of the line containing cross-list entries.

Returns

Yields Event instances from this section.

Return type

iterable

Return type

Iterable[Tuple[Identifier, EventType, str]]

parse_new(archive, fragment)[source]

Parse entries for new e-prints.

Parameters
  • archive (str) – Archive to which entries on this line apply.

  • fragment (str) – Section of the line containing new e-print entries.

Returns

Yields Event instances from this section.

Return type

iterable

Return type

Iterable[Tuple[Identifier, EventType, str]]

parse_replace(archive, fragment)[source]

Parse entries for replacements.

Parameters
  • archive (str) – Archive to which entries on this line apply.

  • fragment (str) – Section of the line containing replacement entries.

Returns

Yields Event instances from this section.

Return type

iterable

Return type

Iterable[Tuple[Identifier, EventType, str]]

arxiv.canonical.classic.daily.SINGLE_IDENTIFIER = re.compile('^(\\d{7})$')

Numeric part of an old-style arXiv ID.

arxiv.canonical.classic.daily.SQUASHED_IDENTIFIER = re.compile('(?P<archive>[a-z\\-]+)(?P<identifier>\\d{7})')

The old-style format ommitted the forward slash in the old identifier.

arxiv.canonical.classic.daily.WEIRD_INVERTED_ENTRY = re.compile('^(?P<identifier>\\d{7})(?:\\.\\d)?(?P<archive>[a-z\\-]+)(\\.[a-zA-Z\\-]+)?$')

Pattern for a weird edge case not handled in the legacy code.

Here is an example:

quant-ph9902016 9704019.0chao-dyn 9902003.0chao-dyn 9904021.0chao-dyn

quant-ph9902016 is normal. But 9704019.0chao-dyn does not match any patterns in the legacy code. In this particular case (from 1999), we can infer that 9704019 belongs with chao-dyn rather than quant-ph because quant-ph/9704019 was last updated in 1997 and this entry is in 1999 when chao-dyn/9704019 was last updated.

Not sure what the decimal part is supposed to mean. It does not appear to refer to the e-print version. I also considered the possibility that it is a range of some kind, e.g. 9912003.4solv-int -> solv-int/9912003 and solv-int/9912004, but this is in a replacement section and there is only one version of solv-int/9912004.

arxiv.canonical.classic.daily.parse(path, for_date=None, cache_path=None)[source]

Parse the daily log file.

Parameters

path (str) – Path to the daily log file.

Returns

Each item is an EventData from the log file.

Return type

iterable

Return type

Iterable[EventData]

arxiv.canonical.classic.daily.scan(path, identifier, cache_path=None)[source]
Return type

Iterable[EventData]