com.johnsnowlabs.nlp.annotators.disambiguation
NerDisambiguator
Companion object NerDisambiguator
class NerDisambiguator extends AnnotatorApproach[NerDisambiguatorModel] with DisambiguatorModelParams
Links words of interest, such as names of persons, locations and companies, from an input text document to a corresponding unique entity in a target Knowledge Base (KB). Words of interest are called Named Entities (NEs), mentions, or surface forms. The model needs extracted CHUNKS and SENTENCE_EMBEDDINGS type input from e.g. SentenceEmbeddings and NerConverter.
Example
Extracting Person identities
First define pipeline stages that extract entities and embeddings. Entities are filtered for PER type entities.
val data = Seq("The show also had a contestant named Donald Trump who later defeated Christina Aguilera ...") .toDF("text") val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained() .setInputCols("sentence", "token") .setOutputCol("embeddings") val sentence_embeddings = new SentenceEmbeddings() .setInputCols("sentence","embeddings") .setOutputCol("sentence_embeddings") val ner_model = NerDLModel.pretrained() .setInputCols("sentence", "token", "embeddings") .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols("sentence", "token", "ner") .setOutputCol("ner_chunk") .setWhiteList("PER")
Then the extracted entities can be disambiguated.
val disambiguator = new NerDisambiguator() .setS3KnowledgeBaseName("i-per") .setInputCols("ner_chunk", "sentence_embeddings") .setOutputCol("disambiguation") .setNumFirstChars(5) val nlpPipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, word_embeddings, sentence_embeddings, ner_model, ner_converter, disambiguator)) val model = nlpPipeline.fit(data) val result = model.transform(data)
Show results
result.selectExpr("explode(disambiguation)") .selectExpr("col.metadata.chunk as chunk", "col.result as result").show(5, false) +------------------+------------------------------------------------------------------------------------------------------------------------+ |chunk |result | +------------------+------------------------------------------------------------------------------------------------------------------------+ |Donald Trump |http://en.wikipedia.org/?curid=4848272, http://en.wikipedia.org/?curid=31698421, http://en.wikipedia.org/?curid=55907961| |Christina Aguilera|http://en.wikipedia.org/?curid=144171, http://en.wikipedia.org/?curid=6636454 | +------------------+------------------------------------------------------------------------------------------------------------------------+
- Grouped
- Alphabetic
- By Inheritance
- NerDisambiguator
- DisambiguatorModelParams
- HasFeatures
- AnnotatorApproach
- CanBeLazy
- DefaultParamsWritable
- MLWritable
- HasOutputAnnotatorType
- HasOutputAnnotationCol
- HasInputAnnotationCols
- Estimator
- PipelineStage
- Logging
- Params
- Serializable
- Serializable
- Identifiable
- AnyRef
- Any
- Hide All
- Show All
- Public
- All
Instance Constructors
Type Members
-
type
AnnotatorType = String
- Definition Classes
- HasOutputAnnotatorType
Value Members
-
final
def
!=(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
final
def
##(): Int
- Definition Classes
- AnyRef → Any
-
final
def
$[T](param: Param[T]): T
- Attributes
- protected
- Definition Classes
- Params
-
def
$$[T](feature: StructFeature[T]): T
- Attributes
- protected
- Definition Classes
- HasFeatures
-
def
$$[K, V](feature: MapFeature[K, V]): Map[K, V]
- Attributes
- protected
- Definition Classes
- HasFeatures
-
def
$$[T](feature: SetFeature[T]): Set[T]
- Attributes
- protected
- Definition Classes
- HasFeatures
-
def
$$[T](feature: ArrayFeature[T]): Array[T]
- Attributes
- protected
- Definition Classes
- HasFeatures
-
final
def
==(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
def
_fit(dataset: Dataset[_], recursiveStages: Option[PipelineModel]): NerDisambiguatorModel
- Attributes
- protected
- Definition Classes
- AnnotatorApproach
-
final
def
asInstanceOf[T0]: T0
- Definition Classes
- Any
-
def
beforeTraining(spark: SparkSession): Unit
- Definition Classes
- NerDisambiguator → AnnotatorApproach
-
final
def
checkSchema(schema: StructType, inputAnnotatorType: String): Boolean
- Attributes
- protected
- Definition Classes
- HasInputAnnotationCols
-
final
def
clear(param: Param[_]): NerDisambiguator.this.type
- Definition Classes
- Params
-
def
clone(): AnyRef
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native()
-
final
def
copy(extra: ParamMap): Estimator[NerDisambiguatorModel]
- Definition Classes
- AnnotatorApproach → Estimator → PipelineStage → Params
-
def
copyValues[T <: Params](to: T, extra: ParamMap): T
- Attributes
- protected
- Definition Classes
- Params
-
final
def
defaultCopy[T <: Params](extra: ParamMap): T
- Attributes
- protected
- Definition Classes
- Params
-
val
description: String
- Definition Classes
- NerDisambiguator → AnnotatorApproach
-
val
embeddingTypeParam: Param[String]
Can be 'bow' for word embeddings or 'sentence' for sentences (Default: sentence)
Can be 'bow' for word embeddings or 'sentence' for sentences (Default: sentence)
- Definition Classes
- DisambiguatorModelParams
-
final
def
eq(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
def
equals(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
def
explainParam(param: Param[_]): String
- Definition Classes
- Params
-
def
explainParams(): String
- Definition Classes
- Params
-
final
def
extractParamMap(): ParamMap
- Definition Classes
- Params
-
final
def
extractParamMap(extra: ParamMap): ParamMap
- Definition Classes
- Params
-
val
features: ArrayBuffer[Feature[_, _, _]]
- Definition Classes
- HasFeatures
-
def
finalize(): Unit
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( classOf[java.lang.Throwable] )
-
final
def
fit(dataset: Dataset[_]): NerDisambiguatorModel
- Definition Classes
- AnnotatorApproach → Estimator
-
def
fit(dataset: Dataset[_], paramMaps: Seq[ParamMap]): Seq[NerDisambiguatorModel]
- Definition Classes
- Estimator
- Annotations
- @Since( "2.0.0" )
-
def
fit(dataset: Dataset[_], paramMap: ParamMap): NerDisambiguatorModel
- Definition Classes
- Estimator
- Annotations
- @Since( "2.0.0" )
-
def
fit(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*): NerDisambiguatorModel
- Definition Classes
- Estimator
- Annotations
- @Since( "2.0.0" ) @varargs()
-
def
get[T](feature: StructFeature[T]): Option[T]
- Attributes
- protected
- Definition Classes
- HasFeatures
-
def
get[K, V](feature: MapFeature[K, V]): Option[Map[K, V]]
- Attributes
- protected
- Definition Classes
- HasFeatures
-
def
get[T](feature: SetFeature[T]): Option[Set[T]]
- Attributes
- protected
- Definition Classes
- HasFeatures
-
def
get[T](feature: ArrayFeature[T]): Option[Array[T]]
- Attributes
- protected
- Definition Classes
- HasFeatures
-
final
def
get[T](param: Param[T]): Option[T]
- Definition Classes
- Params
-
final
def
getClass(): Class[_]
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
-
final
def
getDefault[T](param: Param[T]): Option[T]
- Definition Classes
- Params
-
def
getEmbeddingType: String
Can be 'bow' for word embeddings or 'sentence' for sentences (Default: sentence)
Can be 'bow' for word embeddings or 'sentence' for sentences (Default: sentence)
- Definition Classes
- DisambiguatorModelParams
-
def
getInputCols: Array[String]
- Definition Classes
- HasInputAnnotationCols
-
def
getLazyAnnotator: Boolean
- Definition Classes
- CanBeLazy
-
def
getLevenshteinDistanceThresholdParam: Double
Levenshtein distance threshold to narrow results from prefix search (Default: 0.1)
Levenshtein distance threshold to narrow results from prefix search (Default: 0.1)
- Definition Classes
- DisambiguatorModelParams
-
def
getNarrowWithApproximateMatching: Boolean
Whether to narrow prefix search results with levenstein distance based matching (Default: true)
Whether to narrow prefix search results with levenstein distance based matching (Default: true)
- Definition Classes
- DisambiguatorModelParams
-
def
getNearMatchingGapParam: Int
Puts a limit on a string length (by trimming the candidate chunks) during levenshtein-distance based narrowing, len(candidate) - len(entity chunk) > nearMatchingGap (Default: 4).
Puts a limit on a string length (by trimming the candidate chunks) during levenshtein-distance based narrowing, len(candidate) - len(entity chunk) > nearMatchingGap (Default: 4).
- Definition Classes
- DisambiguatorModelParams
-
def
getNumFirstChars: Int
How many characters should be considered for initial prefix search in knowledge base
How many characters should be considered for initial prefix search in knowledge base
- Definition Classes
- DisambiguatorModelParams
-
final
def
getOrDefault[T](param: Param[T]): T
- Definition Classes
- Params
-
final
def
getOutputCol: String
- Definition Classes
- HasOutputAnnotationCol
-
def
getParam(paramName: String): Param[Any]
- Definition Classes
- Params
-
def
getPredictionLimit: Int
Limit on amount of predictions N for topN predictions (Default: 100)
Limit on amount of predictions N for topN predictions (Default: 100)
- Definition Classes
- DisambiguatorModelParams
-
def
getTokenSearch: Boolean
Whether to search by token or by chunk in knowledge base (Default: true)
Whether to search by token or by chunk in knowledge base (Default: true)
- Definition Classes
- DisambiguatorModelParams
-
final
def
hasDefault[T](param: Param[T]): Boolean
- Definition Classes
- Params
-
def
hasParam(paramName: String): Boolean
- Definition Classes
- Params
-
def
hashCode(): Int
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
-
def
initializeLogIfNecessary(isInterpreter: Boolean, silent: Boolean): Boolean
- Attributes
- protected
- Definition Classes
- Logging
-
def
initializeLogIfNecessary(isInterpreter: Boolean): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
val
inputAnnotatorTypes: Array[String]
Input annotator types: CHUNK, SENTENCE_EMBEDDINGS
Input annotator types: CHUNK, SENTENCE_EMBEDDINGS
- Definition Classes
- NerDisambiguator → HasInputAnnotationCols
-
final
val
inputCols: StringArrayParam
- Attributes
- protected
- Definition Classes
- HasInputAnnotationCols
-
final
def
isDefined(param: Param[_]): Boolean
- Definition Classes
- Params
-
final
def
isInstanceOf[T0]: Boolean
- Definition Classes
- Any
-
final
def
isSet(param: Param[_]): Boolean
- Definition Classes
- Params
-
def
isTraceEnabled(): Boolean
- Attributes
- protected
- Definition Classes
- Logging
-
val
knowledgeBase: Param[String]
Knowledge base path
-
val
lazyAnnotator: BooleanParam
- Definition Classes
- CanBeLazy
-
val
levenshteinDistanceThresholdParam: DoubleParam
Levenshtein distance threshold to narrow results from prefix search (Default: 0.1)
Levenshtein distance threshold to narrow results from prefix search (Default: 0.1)
- Definition Classes
- DisambiguatorModelParams
-
def
log: Logger
- Attributes
- protected
- Definition Classes
- Logging
-
def
logDebug(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logDebug(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logError(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logError(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logInfo(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logInfo(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logName: String
- Attributes
- protected
- Definition Classes
- Logging
-
def
logTrace(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logTrace(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logWarning(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logWarning(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
msgHelper(schema: StructType): String
- Attributes
- protected
- Definition Classes
- HasInputAnnotationCols
-
val
narrowWithApproximateMatching: BooleanParam
Whether to narrow prefix search results with levenstein distance based matching (Default: true)
Whether to narrow prefix search results with levenstein distance based matching (Default: true)
- Definition Classes
- DisambiguatorModelParams
-
final
def
ne(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
val
nearMatchingGapParam: IntParam
Puts a limit on a string length (by trimming the candidate chunks) during levenshtein-distance based narrowing, len(candidate) - len(entity chunk) > nearMatchingGap (Default: 4).
Puts a limit on a string length (by trimming the candidate chunks) during levenshtein-distance based narrowing, len(candidate) - len(entity chunk) > nearMatchingGap (Default: 4).
- Definition Classes
- DisambiguatorModelParams
-
final
def
notify(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
-
final
def
notifyAll(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
-
val
numFirstChars: IntParam
How many characters should be considered for initial prefix search in knowledge base
How many characters should be considered for initial prefix search in knowledge base
- Definition Classes
- DisambiguatorModelParams
-
def
onTrained(model: NerDisambiguatorModel, spark: SparkSession): Unit
- Definition Classes
- AnnotatorApproach
-
val
optionalInputAnnotatorTypes: Array[String]
- Definition Classes
- HasInputAnnotationCols
-
val
outputAnnotatorType: AnnotatorType
Output annotator types: DISAMBIGUATION
Output annotator types: DISAMBIGUATION
- Definition Classes
- NerDisambiguator → HasOutputAnnotatorType
-
final
val
outputCol: Param[String]
- Attributes
- protected
- Definition Classes
- HasOutputAnnotationCol
-
lazy val
params: Array[Param[_]]
- Definition Classes
- Params
-
val
predictionsLimit: IntParam
Limit on amount of predictions N for topN predictions (Default: 100)
Limit on amount of predictions N for topN predictions (Default: 100)
- Definition Classes
- DisambiguatorModelParams
- def resolveStorageName(database: String): String
-
val
s3KnowledgeBaseName: Param[String]
Knowledge base name in s3
-
def
save(path: String): Unit
- Definition Classes
- MLWritable
- Annotations
- @Since( "1.6.0" ) @throws( ... )
-
def
set[T](feature: StructFeature[T], value: T): NerDisambiguator.this.type
- Attributes
- protected
- Definition Classes
- HasFeatures
-
def
set[K, V](feature: MapFeature[K, V], value: Map[K, V]): NerDisambiguator.this.type
- Attributes
- protected
- Definition Classes
- HasFeatures
-
def
set[T](feature: SetFeature[T], value: Set[T]): NerDisambiguator.this.type
- Attributes
- protected
- Definition Classes
- HasFeatures
-
def
set[T](feature: ArrayFeature[T], value: Array[T]): NerDisambiguator.this.type
- Attributes
- protected
- Definition Classes
- HasFeatures
-
final
def
set(paramPair: ParamPair[_]): NerDisambiguator.this.type
- Attributes
- protected
- Definition Classes
- Params
-
final
def
set(param: String, value: Any): NerDisambiguator.this.type
- Attributes
- protected
- Definition Classes
- Params
-
final
def
set[T](param: Param[T], value: T): NerDisambiguator.this.type
- Definition Classes
- Params
-
def
setDefault[T](feature: StructFeature[T], value: () ⇒ T): NerDisambiguator.this.type
- Attributes
- protected
- Definition Classes
- HasFeatures
-
def
setDefault[K, V](feature: MapFeature[K, V], value: () ⇒ Map[K, V]): NerDisambiguator.this.type
- Attributes
- protected
- Definition Classes
- HasFeatures
-
def
setDefault[T](feature: SetFeature[T], value: () ⇒ Set[T]): NerDisambiguator.this.type
- Attributes
- protected
- Definition Classes
- HasFeatures
-
def
setDefault[T](feature: ArrayFeature[T], value: () ⇒ Array[T]): NerDisambiguator.this.type
- Attributes
- protected
- Definition Classes
- HasFeatures
-
final
def
setDefault(paramPairs: ParamPair[_]*): NerDisambiguator.this.type
- Attributes
- protected
- Definition Classes
- Params
-
final
def
setDefault[T](param: Param[T], value: T): NerDisambiguator.this.type
- Attributes
- protected[org.apache.spark.ml]
- Definition Classes
- Params
-
def
setEmbeddingType(v: String): NerDisambiguator.this.type
Can be 'bow' for word embeddings or 'sentence' for sentences (Default: sentence)
Can be 'bow' for word embeddings or 'sentence' for sentences (Default: sentence)
- Definition Classes
- DisambiguatorModelParams
-
final
def
setInputCols(value: String*): NerDisambiguator.this.type
- Definition Classes
- HasInputAnnotationCols
-
def
setInputCols(value: Array[String]): NerDisambiguator.this.type
- Definition Classes
- HasInputAnnotationCols
-
def
setKnowledgeBase(path: String): NerDisambiguator.this.type
Knowledge base path
-
def
setLazyAnnotator(value: Boolean): NerDisambiguator.this.type
- Definition Classes
- CanBeLazy
-
def
setLevenshteinDistanceThresholdParam(v: Double): NerDisambiguator.this.type
Levenshtein distance threshold to narrow results from prefix search (Default: 0.1)
Levenshtein distance threshold to narrow results from prefix search (Default: 0.1)
- Definition Classes
- DisambiguatorModelParams
-
def
setNarrowWithApproximateMatching(v: Boolean): NerDisambiguator.this.type
Whether to narrow prefix search results with levenstein distance based matching (Default: true)
Whether to narrow prefix search results with levenstein distance based matching (Default: true)
- Definition Classes
- DisambiguatorModelParams
-
def
setNearMatchingGapParam(v: Int): NerDisambiguator.this.type
Puts a limit on a string length (by trimming the candidate chunks) during levenshtein-distance based narrowing, len(candidate) - len(entity chunk) > nearMatchingGap (Default: 4).
Puts a limit on a string length (by trimming the candidate chunks) during levenshtein-distance based narrowing, len(candidate) - len(entity chunk) > nearMatchingGap (Default: 4).
- Definition Classes
- DisambiguatorModelParams
-
def
setNumFirstChars(v: Int): NerDisambiguator.this.type
How many characters should be considered for initial prefix search in knowledge base
How many characters should be considered for initial prefix search in knowledge base
- Definition Classes
- DisambiguatorModelParams
-
final
def
setOutputCol(value: String): NerDisambiguator.this.type
- Definition Classes
- HasOutputAnnotationCol
-
def
setPredictionLimit(v: Int): NerDisambiguator.this.type
Limit on amount of predictions N for topN predictions (Default: 100)
Limit on amount of predictions N for topN predictions (Default: 100)
- Definition Classes
- DisambiguatorModelParams
-
def
setS3KnowledgeBaseName(path: String): NerDisambiguator.this.type
Knowledge base name in s3
-
def
setTokenSearch(v: Boolean): NerDisambiguator.this.type
Whether to search by token or by chunk in knowledge base (Default: true)
Whether to search by token or by chunk in knowledge base (Default: true)
- Definition Classes
- DisambiguatorModelParams
-
final
def
synchronized[T0](arg0: ⇒ T0): T0
- Definition Classes
- AnyRef
-
def
toString(): String
- Definition Classes
- Identifiable → AnyRef → Any
-
val
tokenSearch: BooleanParam
Whether to search by token or by chunk in knowledge base (Default: true)
Whether to search by token or by chunk in knowledge base (Default: true)
- Definition Classes
- DisambiguatorModelParams
-
def
train(dataset: Dataset[_], recursivePipeline: Option[PipelineModel]): NerDisambiguatorModel
- Definition Classes
- NerDisambiguator → AnnotatorApproach
-
final
def
transformSchema(schema: StructType): StructType
- Definition Classes
- AnnotatorApproach → PipelineStage
-
def
transformSchema(schema: StructType, logging: Boolean): StructType
- Attributes
- protected
- Definition Classes
- PipelineStage
- Annotations
- @DeveloperApi()
-
val
uid: String
- Definition Classes
- NerDisambiguator → Identifiable
-
def
validate(schema: StructType): Boolean
- Attributes
- protected
- Definition Classes
- AnnotatorApproach
-
final
def
wait(): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long, arg1: Int): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native()
-
def
write: MLWriter
- Definition Classes
- DefaultParamsWritable → MLWritable
Inherited from DisambiguatorModelParams
Inherited from HasFeatures
Inherited from AnnotatorApproach[NerDisambiguatorModel]
Inherited from CanBeLazy
Inherited from DefaultParamsWritable
Inherited from MLWritable
Inherited from HasOutputAnnotatorType
Inherited from HasOutputAnnotationCol
Inherited from HasInputAnnotationCols
Inherited from Estimator[NerDisambiguatorModel]
Inherited from PipelineStage
Inherited from Logging
Inherited from Params
Inherited from Serializable
Inherited from Serializable
Inherited from Identifiable
Inherited from AnyRef
Inherited from Any
Parameters
Annotator types
Required input and expected output annotator types