com.johnsnowlabs.nlp.annotators.resolution
whether or not to return an all distance values in the metadata.
whether or not to return an all distance values in the metadata. Default: False
number of results to return in the metadata after sorting by last distance calculated
number of results to return in the metadata after sorting by last distance calculated
what function to use to calculate confidence: INVERSE or SOFTMAX
what function to use to calculate confidence: INVERSE or SOFTMAX
what distance function to use for KNN: 'EUCLIDEAN' or 'COSINE'
what distance function to use for KNN: 'EUCLIDEAN' or 'COSINE'
distance weights to apply before pooling: [WMD, TFIDF, Jaccard, SorensenDice, JaroWinkler, Levenshtein]
distance weights to apply before pooling: [WMD, TFIDF, Jaccard, SorensenDice, JaroWinkler, Levenshtein]
whether or not to use Jaccard token distance.
whether or not to use Jaccard token distance. Default: True
whether or not to use Jaro-Winkler character distance.
whether or not to use Jaro-Winkler character distance. Default: False
whether or not to use Levenshtein character distance.
whether or not to use Levenshtein character distance. Default: False
whether or not to use Sorensen-Dice token distance.
whether or not to use Sorensen-Dice token distance. Default: False
whether or not to use TFIDF token distance.
whether or not to use TFIDF token distance. Default: True
whether or not to use WMD token distance.
whether or not to use WMD token distance. Default: True
penalty for extra words in the knowledge base match during WMD calculation
penalty for extra words in the knowledge base match during WMD calculation
column name for the value we are trying to resolve
whether or not to return an empty annotation on unmatched chunks
whether or not to return an empty annotation on unmatched chunks
number of neighbours to consider in the KNN query to calculate WMD
number of neighbours to consider in the KNN query to calculate WMD
ngram range for subword vectorization in candidate selection
ngram range for subword vectorization in candidate selection
column name for the original, normalized description
pooling strategy to aggregate distances: AVERAGE or SUM
pooling strategy to aggregate distances: AVERAGE or SUM
threshold value for the aggregated distance
threshold value for the aggregated distance
Returns the ChunkEntityResolverModel Transformer, that can be used to transform input datasets
Returns the ChunkEntityResolverModel Transformer, that can be used to transform input datasets
The dataset provided to the fit method should have one chunk per row and contain the following columns: ChunkTokens, ChunkEmbeddings, ResolverLabel, [ResolverNormalized]
The cardinality of the dataset should not exceed 100.000 data points since searching in such a big KD-tree becomes impractical
This method is called inside the AnnotatorApproach's fit method
a Dataset containing ChunkTokens, ChunkEmbeddings, ClassifierLabel, ResolverLabel, [ResolverNormalized]
a trained ChunkEntityResolverModel