bartocsuggest documentation¶
bartocsuggest is a Python module that suggests vocabularies given a list of words based on the BARTOC FAST API (https://bartoc-fast.ub.unibas.ch/bartocfast/api).
Documentation available at: https://bartocsuggest.readthedocs.io/en/latest/
Codebase available at: https://github.com/MHindermann/bartocsuggest
Core functionality¶
-
class
bartocsuggest.
Session
(words, preload_folder=None, language='und')¶ Vocabulary suggestion session using the BARTOC FAST API.
- Parameters
words (
Union
[List
[str
],str
,_ConceptScheme
]) – the input words (list of strings, or path to XLSX file, or JSKOS concept scheme)preload_folder (
Optional
[str
]) – the path to the preload folder, defaults to Nonelanguage (
str
) – the language of the words given as RFC 3066 language tag, defaults to “und” (for undefined)
-
preload
(max=100000, min=0, verbose=False)¶ Preload responses.
For each word in
self.words
, a query is sent to the BARTOC FAST API. The response is saved toself.preload_folder
. Use this method for batchwise handling of large (>100)self.words
.- Parameters
max (
int
) – stop with the max-th word in self.words, defaults to 100000min (
int
) – start with min-th word in self.words, defaults to 0verbose (
bool
) – toggle running comment printed to console, defaults to False
- Return type
None
-
suggest
(remote=True, sensitivity=1, score_type=<class 'bartocsuggest.Recall'>, verbose=False)¶ Suggest vocabularies based on
self.words
.- Parameters
remote (
bool
) – toggle between remote BARTOC FAST querying and preload folder, defaults to Truesensitivity (
int
) – set the maximum allowed Levenshtein distance between word and result, defaults to 1score_type (
ScoreType
) – set the score type on which the suggestion is based, defaults tobartocsuggest.Recall
verbose (
bool
) – toggle running comment printed to console, defaults to False
- Return type
-
class
bartocsuggest.
Suggestion
(_scheme, _vocabularies, _sensitivity, _score_type)¶ A suggestion of vocabularies.
- Parameters
_scheme (
_ConceptScheme
) – the input concept scheme_vocabularies (
List
[_Source
]) – the suggested vocabularies_sensitivity (
int
) – the used sensitivity_score_type (
ScoreType
) – the used score type
-
get
(scores=False, max=None)¶ Return the suggested vocabularies sorted from best to worst.
- Parameters
scores (
bool
) – toggle returning results and their scores, defaults to Falsemax (
Optional
[int
]) – limit the number of suggestions to max, defaults to None
- Return type
Union
[List
[str
],List
[Tuple
[str
,int
]]]
-
get_sensitivity
()¶ Return the suggestion’s sensitivity.
- Return type
int
-
print
()¶ Print the suggestion to the console.
-
print_concordance
(vocabulary_uri=None)¶ Print the concordance as JSKOS to the console.
The concordance is between the session’s input words from which this suggestion was derived and a vocabulary to be chosen by URI. If no vocabulary URI is selected, the most highly suggested vocabulary is used. To see the suggested vocabularies and their URIs, use the print method of this class. For JSKOS, see https://gbv.github.io/jskos/jskos.html (version 0.4.6).
- Parameters
vocabulary_uri (
Optional
[str
]) – the URI of the vocabulary, defaults to None- Return type
None
-
save_concordance
(folder, filename=None, vocabulary_uri=None)¶ Save the concordance as JSKOS in the JSON format.
- Parameters
folder (
str
) – the path to the save folderfilename (
Optional
[str
]) – the name of the file, defaults to Nonevocabulary_uri (
Optional
[str
]) – the URI of the vocabulary, defaults to None
- Return type
None
-
save_mappings
(folder, filename=None, vocabulary_uri=None)¶ Save the mappings as JSKOS in the NDJSON format.
Mappings in this format can be used in the Cocoda Mapping Tool, see https://coli-conc.gbv.de/cocoda/app/ (version 1.3.6). For NDJSON, see https://github.com/ndjson/ndjson-spec (version 1.0.0).
- Parameters
folder (
str
) – the path to the save folderfilename (
Optional
[str
]) – the name of the file, defaults to Nonevocabulary_uri (
Optional
[str
]) – the URI of the vocabulary, defaults to None
- Return type
None
-
class
bartocsuggest.
ScoreType
¶ A score type.
All score types are relative to a specific vocabulary and a list of words. There are four score type classes:
bartocsuggest.Recall
,bartocsuggest.Average
,bartocsuggest.Coverage
,bartocsuggest.Sum
. Use the help method on these classes for more information.
-
class
bartocsuggest.
Recall
¶ The number of words over a vocabulary’s coverage.
The lower the better (minimum is 1). See https://en.wikipedia.org/wiki/Precision_and_recall#Recall.
For example, for words [a,b,c] and coverage 2, recall is len(words)/coverage = len([a,b,c])/2 = 1.5.
-
class
bartocsuggest.
Average
¶ The average over a vocabulary’s match scores.
The lower the the better (minimum is 0). The score of a match is defined by the Levenshtein distance between word and match.
For example, for scores [1,1,4], the average is scores/len(scores) = (1+1+4)/3 = 2.
-
class
bartocsuggest.
Coverage
¶ The number of a vocabulary’s matches in the list of words.
Note that this is dependent on the sensitivity parameter of
bartocsuggest.Session.suggest()
.For example, for words [a,b,c] and vocabulary matches a,c, the coverage is a,c in [a,b,c] = 2.
-
class
bartocsuggest.
Sum
¶ The sum over a vocabulary’s match scores.
The lower the average the better (minimum is 0). The score of a match is defined by the Levenshtein distance between word and match.
For example, for scores [1,1,4], the sum is (1+1+4) = 6.
Wrappers¶
-
class
bartocsuggest.
AnnifSession
(text, project_id, limit=None, threshold=None, preload_folder=None)¶ Wrapper for the Annif REST API based on the Annif-client module.
Annif indexes the input text based on the project identifier with an optional limit or threshold. Use this Session to get vocabulary suggestions for full texts instead of words.
bartocsuggest.AnnifSession
inherits its methods preload and suggest frombartocsuggest.Session
.- Parameters
text (
str
) – the input textproject_id (
str
) – the project identifierlimit (
Optional
[int
]) – the maximum number of results to return, defaults to Nonethreshold (
Optional
[int
]) – the minimum score threshold, defaults to None