search.services.index package

Provides integration with an ElasticSearch cluster.

The primary entrypoint to this module is search(), which handles search.domain.Query instances passed by controllers, and returns a DocumentSet containing search results. get_document() is available for future use, e.g. as part of a search API.

In addition, add_document() and bulk_add_documents() are provided for indexing (e.g. by the search.agent.consumer.MetadataRecordProcessor).

SearchSession encapsulates configuration parameters and a connection to the Elasticsearch cluster for thread-safety. The functions mentioned above load the appropriate instance of SearchSession depending on the context of the request.

class search.services.index.SearchSession(host, index, port=9200, scheme='http', user=None, password=None, mapping=None, verify=True, **extra)[source]

Bases: object

Encapsulates session with Elasticsearch host.

add_document(document)[source]

Add a document to the search index.

Uses paper_id_v as the primary identifier for the document. If the document is already indexed, will quietly overwrite.

Parameters:

document (Document) – Must be a valid search document, per schema/DocumentMetadata.json.

Raises:
Return type:

None

bulk_add_documents(documents, docs_per_chunk=500)[source]

Add documents to the search index using the bulk API.

Parameters:
  • document (Document) – Must be a valid search document, per schema/DocumentMetadata.json.
  • docs_per_chunk (int) – Number of documents to send to ES in a single chunk
Raises:
  • IndexConnectionError – Problem communicating with Elasticsearch host.
  • BulkIndexingError – Problem serializing document for indexing.
Return type:

None

cluster_available()[source]

Determine whether or not the ES cluster is available.

Returns:
Return type:bool
Return type:bool
create_index()[source]

Create the search index.

Parameters:mappings (dict) – See elastic.co/guide/en/elasticsearch/reference/current/mapping.html
Return type:None
exists(paper_id_v)[source]

Determine whether a paper exists in the index.

Return type:bool
get_document(document_id)[source]

Retrieve a document from the index by ID.

Uses metadata_id as the primary identifier for the document.

Parameters:

doument_id (int) – Value of metadata_id in the original document.

Returns:

Return type:

Document

Raises:
  • IndexConnectionError – Problem communicating with the search index.
  • QueryError – Invalid query parameters.
Return type:

Document

get_task_status(task)[source]

Get the status of a running task in ES (e.g. reindex).

Parameters:task (str) – A task ID, e.g. returned in response to an asynchronous reindexing request.
Returns:Response from ElasticSearch task API.
Return type:dict
Return type:dict
index_exists(index_name)[source]

Determine whether or not an index exists.

Parameters:index_name (str) –
Returns:
Return type:bool
Return type:bool
reindex(old_index, new_index, wait_for_completion=False)[source]

Create a new index and reindex with the current mappings.

Creating the new index and performing the reindexing operation are two separate actions via the ES API. If creation of the next index succeeds but the request to reindex fails, no attempt is made to clean up. If the new index already exists, will still attempt to perform the reindex operation.

Parameters:
  • old_index (str) – Name of the index to copy from.
  • new_index (str) – Name of the index to create and copy to.
Returns:

Response from ElasticSearch reindex API. If wait_for_completion is False (default), should include a task key with a task ID that can be used to check the status of the reindexing operation.

Return type:

dict

Return type:

dict

search(query, highlight=True)[source]

Perform a search.

Parameters:

query (Query) –

Returns:

Return type:

DocumentSet

Raises:
  • IndexConnectionError – Problem communicating with the search index.
  • QueryError – Invalid query parameters.
Return type:

DocumentSet

search.services.index.add_document(self, document)[source]

Add a document to the search index.

Uses paper_id_v as the primary identifier for the document. If the document is already indexed, will quietly overwrite.

Parameters:

document (Document) – Must be a valid search document, per schema/DocumentMetadata.json.

Raises:
Return type:

None

search.services.index.bulk_add_documents(self, documents, docs_per_chunk=500)[source]

Add documents to the search index using the bulk API.

Parameters:
  • document (Document) – Must be a valid search document, per schema/DocumentMetadata.json.
  • docs_per_chunk (int) – Number of documents to send to ES in a single chunk
Raises:
  • IndexConnectionError – Problem communicating with Elasticsearch host.
  • BulkIndexingError – Problem serializing document for indexing.
Return type:

None

search.services.index.cluster_available(self)[source]

Determine whether or not the ES cluster is available.

Returns:
Return type:bool
Return type:bool
search.services.index.create_index(self)[source]

Create the search index.

Parameters:mappings (dict) – See elastic.co/guide/en/elasticsearch/reference/current/mapping.html
Return type:None
search.services.index.current_session()[source]

Get/create SearchSession for this context.

Return type:SearchSession
search.services.index.exists(self, paper_id_v)[source]

Determine whether a paper exists in the index.

Return type:bool
search.services.index.get_document(self, document_id)[source]

Retrieve a document from the index by ID.

Uses metadata_id as the primary identifier for the document.

Parameters:

doument_id (int) – Value of metadata_id in the original document.

Returns:

Return type:

Document

Raises:
  • IndexConnectionError – Problem communicating with the search index.
  • QueryError – Invalid query parameters.
Return type:

Document

search.services.index.get_session(app=None)[source]

Get a new session with the search index.

Return type:SearchSession
search.services.index.get_task_status(self, task)[source]

Get the status of a running task in ES (e.g. reindex).

Parameters:task (str) – A task ID, e.g. returned in response to an asynchronous reindexing request.
Returns:Response from ElasticSearch task API.
Return type:dict
Return type:dict
search.services.index.handle_es_exceptions()[source]

Handle common ElasticSearch-related exceptions.

Return type:Generator[+T_co, -T_contra, +V_co]
search.services.index.index_exists(self, index_name)[source]

Determine whether or not an index exists.

Parameters:index_name (str) –
Returns:
Return type:bool
Return type:bool
search.services.index.init_app(app=None)[source]

Set default configuration parameters for an application instance.

Return type:None
search.services.index.ok()[source]

Health check.

Return type:bool
search.services.index.reindex(self, old_index, new_index, wait_for_completion=False)[source]

Create a new index and reindex with the current mappings.

Creating the new index and performing the reindexing operation are two separate actions via the ES API. If creation of the next index succeeds but the request to reindex fails, no attempt is made to clean up. If the new index already exists, will still attempt to perform the reindex operation.

Parameters:
  • old_index (str) – Name of the index to copy from.
  • new_index (str) – Name of the index to create and copy to.
Returns:

Response from ElasticSearch reindex API. If wait_for_completion is False (default), should include a task key with a task ID that can be used to check the status of the reindexing operation.

Return type:

dict

Return type:

dict

search.services.index.search(self, query, highlight=True)[source]

Perform a search.

Parameters:

query (Query) –

Returns:

Return type:

DocumentSet

Raises:
  • IndexConnectionError – Problem communicating with the search index.
  • QueryError – Invalid query parameters.
Return type:

DocumentSet