agent.process.metadata_checks module

Automated metadata checks.

class agent.process.metadata_checks.CheckAbstractForUnicodeAbuse(submission_id, process_id=None)

Bases: agent.process.base.Process

Screen for possible abuse of unicode in abstracts.

We support unicode characters in abstracts, but this can get out of hand. This rule adds a flag if the ratio of non-ASCII to ASCII characters is too high.

check_abstract(previous, trigger, emit)

Check abstract for low ASCII content.

Return type

None

steps = [<function CheckAbstractForUnicodeAbuse.check_abstract>]
class agent.process.metadata_checks.CheckForSimilarTitles(submission_id, process_id=None)

Bases: agent.process.base.Process

Check for other submissions with very similar titles.

Ask classic for titles of papers submitted within the last several months. Add an annotation to the submission if a title is more similar to the current submission’s title than a configurable threshold.

check_for_duplicates(candidates, trigger, emit)

Look for very similar titles, and add flags if appropriate.

Return type

None

get_candidates(previous, trigger, emit)

Get candidate titles from the database.

Return type

List[Tuple[int, str, Agent]]

steps = [<function CheckForSimilarTitles.get_candidates>, <function CheckForSimilarTitles.check_for_duplicates>]
class agent.process.metadata_checks.CheckTitleForUnicodeAbuse(submission_id, process_id=None)

Bases: agent.process.base.Process

Screen for possible abuse of unicode in titles.

We support unicode characters in titles, but this can get out of hand. This rule adds a flag if the ratio of non-ASCII to ASCII characters is too high.

check_title(previous, trigger, emit)

Check title for low ASCII content.

Return type

None

steps = [<function CheckTitleForUnicodeAbuse.check_title>]
agent.process.metadata_checks.intersection(phrase_a, phrase_b)

Calculate the number tokens shared by two phrases.

Return type

int

agent.process.metadata_checks.jaccard(phrase_a, phrase_b)

Calculate the Jaccard similarity of two phrases.

Return type

float

agent.process.metadata_checks.normalize(phrase)

Prepare a phrase for tokenization.

Return type

str

agent.process.metadata_checks.tokenized(phrase)

Split a phrase into tokens and remove stopwords.

Return type

Set[str]

agent.process.metadata_checks.union(phrase_a, phrase_b)

Calculate the total number tokens in two phrases.

Return type

int

agent.process.metadata_checks.window(days)

Get a datetime from days days ago.

Return type

datetime