arxiv.canonical.classic.daily module¶
Parser for the daily.log file.
The main goal of this implementation is parsing the log file for the purpose
of transforming it into the arXiv Canonical format. Specifically, we want to
use this legacy data structure to generate Event
data that can be
serialized in the daily listing files.
From the original arXiv::Updates::DailyLog
:
Module to provide information about updates to the archive
over specified periods. This should be the only section
of code that reads the daily.log file.
Simeon Warner - 6Jan2000...
25Jan2000 - modified so that undef $startdate or $enddate select
the beginning or end of time respectively.
25Jan2000 - modified so that by simply removing the `-' from
and ISO8601 date we get YYYYMMDD from YYYY-MM-DD
16Oct2000 - to allow easy resumption in the OAI1 interface and
because it seems that it might be useful in other contexts the
number limited behaviour has been changed. query_daily_log() and
hence all other routines now stop at then end of a day and
returns the that day (in the form YYYY-MM-DD) as the value
if limited, undef otherwise.
Thoughts: If this is to be used on the mirror sites then we will need
to mirror the daily log. This probably means that that file
should be split up.
[CVS: $Id: DailyLog.pm,v 1.6 2010/03/23 03:53:09 arxiv Exp $]
-
class
arxiv.canonical.classic.daily.
DailyLogParser
[source]¶ Bases:
object
Parses the daily log file.
-
class
arxiv.canonical.classic.daily.
EventData
[source]¶ Bases:
tuple
Data about events that can be extracted from the daily log.
-
property
arxiv_id
¶ Alias for field number 0
-
property
categories
¶ Alias for field number 4
-
property
event_date
¶ Alias for field number 1
-
property
event_type
¶ Alias for field number 2
-
property
version
¶ Alias for field number 3
-
property
-
arxiv.canonical.classic.daily.
IDENTIFIER_RANGE
= re.compile('^(?P<start_id>\\d{7})\\-(?P<end_id>\\d{7})$')¶ The old-style format supported ranges of identifiers, e.g.
1234-1238
.
-
arxiv.canonical.classic.daily.
LINE
= re.compile('^(?P<event_date>\\d{6})\\|(?P<archive>[a-z-]+)\\|(?P<data>.*)$')¶ Each line in the log file begins with a date stamp and an archive.
-
class
arxiv.canonical.classic.daily.
LineParser
[source]¶ Bases:
object
Shared behavior among newstyle and oldstyle line parsing.
-
arxiv.canonical.classic.daily.
NEW_STYLE_CUTOVER_AFTER
= datetime.date(2007, 4, 2)¶ Date after which the new-style format was adopted.
-
class
arxiv.canonical.classic.daily.
NewStyleLineParser
[source]¶ Bases:
arxiv.canonical.classic.daily.LineParser
Parses new-style daily log lines.
Starting after 2007-04-02 (
NEW_STYLE_CUTOVER_AFTER
), the format changed to put all announcement-related events on a given day on the same line. The three original sections of the line are preserved, but within each section are entries for e-prints from all archives.-
parse_cross
(archive, fragment)[source]¶ Parse entries for cross-lists.
- Parameters
- Returns
Yields
Event
instances from this section.- Return type
iterable
- Return type
-
parse_new
(archive, fragment)[source]¶ Parse entries for new e-prints.
- Parameters
- Returns
Yields
Event
instances from this section.- Return type
iterable
- Return type
-
-
class
arxiv.canonical.classic.daily.
OldStyleLineParser
[source]¶ Bases:
arxiv.canonical.classic.daily.LineParser
Parses data from old-style log lines.
The original format used a separate line for each archive. The line contained three sections: e-prints newly announced in that archive, e-prints cross-listed to that archive, and e-prints replaced either in that archive or with a new cross-list category in that archive. Thus there may be multiple lines for a given announcement day, one per archive in which announcement activity occurred.
-
parse_cross
(archive, fragment)[source]¶ Parse entries for cross-list e-prints.
- Parameters
- Returns
Yields
Event
instances from this section.- Return type
iterable
- Return type
-
-
arxiv.canonical.classic.daily.
SINGLE_IDENTIFIER
= re.compile('^(\\d{7})$')¶ Numeric part of an old-style arXiv ID.
-
arxiv.canonical.classic.daily.
SQUASHED_IDENTIFIER
= re.compile('(?P<archive>[a-z\\-]+)(?P<identifier>\\d{7})')¶ The old-style format ommitted the forward slash in the old identifier.
-
arxiv.canonical.classic.daily.
WEIRD_INVERTED_ENTRY
= re.compile('^(?P<identifier>\\d{7})(?:\\.\\d)?(?P<archive>[a-z\\-]+)(\\.[a-zA-Z\\-]+)?$')¶ Pattern for a weird edge case not handled in the legacy code.
Here is an example:
quant-ph9902016 9704019.0chao-dyn 9902003.0chao-dyn 9904021.0chao-dyn
quant-ph9902016
is normal. But9704019.0chao-dyn
does not match any patterns in the legacy code. In this particular case (from 1999), we can infer that9704019
belongs withchao-dyn
rather thanquant-ph
becausequant-ph/9704019
was last updated in 1997 and this entry is in 1999 whenchao-dyn/9704019
was last updated.Not sure what the decimal part is supposed to mean. It does not appear to refer to the e-print version. I also considered the possibility that it is a range of some kind, e.g.
9912003.4solv-int
->solv-int/9912003
andsolv-int/9912004
, but this is in a replacement section and there is only one version ofsolv-int/9912004
.