bartocsuggest documentation

bartocsuggest is a Python module that suggests vocabularies given a list of words based on the BARTOC FAST API (https://bartoc-fast.ub.unibas.ch/bartocfast/api).

Documentation available at: https://bartocsuggest.readthedocs.io/en/latest/

Codebase available at: https://github.com/MHindermann/bartocsuggest

Core functionality

class bartocsuggest.Session(words, preload_folder=None, language='und')

Vocabulary suggestion session using the BARTOC FAST API.

Parameters
  • words (Union[List[str], str, _ConceptScheme]) – the input words (list of strings, or path to XLSX file, or JSKOS concept scheme)

  • preload_folder (Optional[str]) – the path to the preload folder, defaults to None

  • language (str) – the language of the words given as RFC 3066 language tag, defaults to “und” (for undefined)

preload(max=100000, min=0, verbose=False)

Preload responses.

For each word in self.words, a query is sent to the BARTOC FAST API. The response is saved to self.preload_folder. Use this method for batchwise handling of large (>100) self.words.

Parameters
  • max (int) – stop with the max-th word in self.words, defaults to 100000

  • min (int) – start with min-th word in self.words, defaults to 0

  • verbose (bool) – toggle running comment printed to console, defaults to False

Return type

None

suggest(remote=True, sensitivity=1, score_type=<class 'bartocsuggest.Recall'>, verbose=False)

Suggest vocabularies based on self.words.

Parameters
  • remote (bool) – toggle between remote BARTOC FAST querying and preload folder, defaults to True

  • sensitivity (int) – set the maximum allowed Levenshtein distance between word and result, defaults to 1

  • score_type (ScoreType) – set the score type on which the suggestion is based, defaults to bartocsuggest.Recall

  • verbose (bool) – toggle running comment printed to console, defaults to False

Return type

Suggestion

class bartocsuggest.Suggestion(_scheme, _vocabularies, _sensitivity, _score_type)

A suggestion of vocabularies.

Parameters
  • _scheme (_ConceptScheme) – the input concept scheme

  • _vocabularies (List[_Source]) – the suggested vocabularies

  • _sensitivity (int) – the used sensitivity

  • _score_type (ScoreType) – the used score type

get(scores=False, max=None)

Return the suggested vocabularies sorted from best to worst.

Parameters
  • scores (bool) – toggle returning results and their scores, defaults to False

  • max (Optional[int]) – limit the number of suggestions to max, defaults to None

Return type

Union[List[str], List[Tuple[str, int]]]

get_score_type()

Return the suggestion’s score type.

Return type

ScoreType

get_sensitivity()

Return the suggestion’s sensitivity.

Return type

int

print()

Print the suggestion to the console.

print_concordance(vocabulary_uri=None)

Print the concordance as JSKOS to the console.

The concordance is between the session’s input words from which this suggestion was derived and a vocabulary to be chosen by URI. If no vocabulary URI is selected, the most highly suggested vocabulary is used. To see the suggested vocabularies and their URIs, use the print method of this class. For JSKOS, see https://gbv.github.io/jskos/jskos.html (version 0.4.6).

Parameters

vocabulary_uri (Optional[str]) – the URI of the vocabulary, defaults to None

Return type

None

save_concordance(folder, filename=None, vocabulary_uri=None)

Save the concordance as JSKOS in the JSON format.

Parameters
  • folder (str) – the path to the save folder

  • filename (Optional[str]) – the name of the file, defaults to None

  • vocabulary_uri (Optional[str]) – the URI of the vocabulary, defaults to None

Return type

None

save_mappings(folder, filename=None, vocabulary_uri=None)

Save the mappings as JSKOS in the NDJSON format.

Mappings in this format can be used in the Cocoda Mapping Tool, see https://coli-conc.gbv.de/cocoda/app/ (version 1.3.6). For NDJSON, see https://github.com/ndjson/ndjson-spec (version 1.0.0).

Parameters
  • folder (str) – the path to the save folder

  • filename (Optional[str]) – the name of the file, defaults to None

  • vocabulary_uri (Optional[str]) – the URI of the vocabulary, defaults to None

Return type

None

class bartocsuggest.ScoreType

A score type.

All score types are relative to a specific vocabulary and a list of words. There are four score type classes: bartocsuggest.Recall, bartocsuggest.Average, bartocsuggest.Coverage, bartocsuggest.Sum. Use the help method on these classes for more information.

class bartocsuggest.Recall

The number of words over a vocabulary’s coverage.

The lower the better (minimum is 1). See https://en.wikipedia.org/wiki/Precision_and_recall#Recall.

For example, for words [a,b,c] and coverage 2, recall is len(words)/coverage = len([a,b,c])/2 = 1.5.

class bartocsuggest.Average

The average over a vocabulary’s match scores.

The lower the the better (minimum is 0). The score of a match is defined by the Levenshtein distance between word and match.

For example, for scores [1,1,4], the average is scores/len(scores) = (1+1+4)/3 = 2.

class bartocsuggest.Coverage

The number of a vocabulary’s matches in the list of words.

Note that this is dependent on the sensitivity parameter of bartocsuggest.Session.suggest().

For example, for words [a,b,c] and vocabulary matches a,c, the coverage is a,c in [a,b,c] = 2.

class bartocsuggest.Sum

The sum over a vocabulary’s match scores.

The lower the average the better (minimum is 0). The score of a match is defined by the Levenshtein distance between word and match.

For example, for scores [1,1,4], the sum is (1+1+4) = 6.

Wrappers

class bartocsuggest.AnnifSession(text, project_id, limit=None, threshold=None, preload_folder=None)

Wrapper for the Annif REST API based on the Annif-client module.

Annif indexes the input text based on the project identifier with an optional limit or threshold. Use this Session to get vocabulary suggestions for full texts instead of words. bartocsuggest.AnnifSession inherits its methods preload and suggest from bartocsuggest.Session.

Parameters
  • text (str) – the input text

  • project_id (str) – the project identifier

  • limit (Optional[int]) – the maximum number of results to return, defaults to None

  • threshold (Optional[int]) – the minimum score threshold, defaults to None

Indices and tables