whether or not to return an all distance values in the metadata.
number of results to return in the metadata after sorting by last distance calculated
Resolves the ResolverLabel for the given array of TOKEN and WORD_EMBEDDINGS annotations
an array of TOKEN and WORD_EMBEDDINGS Annotation objects coming from ChunkTokenizer and ChunkEmbeddings respectively
an array of Annotation objects, with the result of the entity resolution for each chunk and the following metadata
all_k_results -> Sorted ResolverLabels in the top
alternatives that match the distance
all_k_resolutions -> Respective ResolverNormalized strings
all_k_distances -> Respective distance values after aggregation
all_k_wmd_distances -> Respective WMD distance values
all_k_tfidf_distances -> Respective TFIDF Cosinge distance values
all_k_jaccard_distances -> Respective Jaccard distance values
all_k_sorensen_distances -> Respective SorensenDice distance values
all_k_jaro_distances -> Respective JaroWinkler distance values
all_k_levenshtein_distances -> Respective Levenshtein distance values
all_k_confidences -> Respective normalized probabilities based in inverse distance values
target_text -> The actual searched string
resolved_text -> The top ResolverNormalized string
confidence -> Top probability
distance -> Top distance value
sentence -> Sentence index
chunk -> Chunk Index
token -> Token index
validates the dataset before applying it further down the pipeline
what function to use to calculate confidence: INVERSE or SOFTMAX
creates WordEmbeddingsReader, based on the DB name and connection
Name of the desired database
Connection to the RocksDB
The instance of the class WordEmbeddingsReader
This cannot hold EMBEDDINGS since otherwise ER will try to re-save and read embeddings again
what distance function to use for KNN: 'EUCLIDEAN' or 'COSINE'
distance weights to apply before pooling: [WMD, TFIDF, Jaccard, SorensenDice, JaroWinkler, Levenshtein]
whether or not to use Jaccard token distance.
whether or not to use Jaro-Winkler character distance.
whether or not to use Levenshtein character distance.
whether or not to use Sorensen-Dice token distance.
whether or not to use TFIDF token distance.
whether or not to use WMD token distance.
penalty for extra words in the knowledge base match during WMD calculation
Annotator reference id.
whether or not to return an empty annotation on unmatched chunks
number of neighbours to consider in the KNN query to calculate WMD
Inverted Document Frequency of the ngrams.
ngram range for subword vectorization in candidate selection
pooling strategy to aggregate distances: AVERAGE or SUM
Inverted Document Frequency of the term.
threshold value for the aggregated distance