Named Entity Recognizer - Conditional random field based

Description

This Named Entity Recognizer is based on a CRF Algorithm

Live Demo Open in Colab Download

How to use



    documentAssembler     = DocumentAssembler().setInputCol("text").setOutputCol("document")
    tokenizer             = Tokenizer().setInputCols(["document"]).setOutputCol("token")
    posTagger             = PerceptronModel.pretrained().setInputCols(["token", "document"]).setOutputCol("pos")
    embeds                = WordEmbeddingsModel.pretrained().setInputCols(["token", "document"]).setOutputCol("embeddings")
    nerCrf                = NerCrfModel.pretrained().setInputCols(["document", "token","pos", "embeddings"]).setOutputCol("ner")

    pipeline = Pipeline(stages=[documentAssembler, tokenizer, posTagger, embeds, nerCrf ])
    df = spark.createDataFrame(pd.DataFrame({'text': ["Donald Trump and Angela Merkel dont share many oppinions"]}))
    result = pipeline.fit(df).transform(df)
    result.select("ner.result").show(truncate = False )
    result.select("ner").show(truncate = False)

    val documentAssembler     = new DocumentAssembler().setInputCol("text").setOutputCol("document")
    val tokenizer             = new Tokenizer().setInputCols(Array("document")).setOutputCol("token")
    val posTagger             = PerceptronModel.pretrained().setInputCols(Array("token", "document")).setOutputCol("pos")
    val embeds                = WordEmbeddingsModel.pretrained().setInputCols(Array("token", "document")).setOutputCol("embeddings")
    val nerCrf                = NerCrfModel.pretrained().setInputCols(Array("document", "token","pos", "embeddings")).setOutputCol("ner")


    val pipeline              = new Pipeline().setStages(Array(documentAssembler, tokenizer, posTagger, embeds, nerCrf))
    val df = Seq("Donald Trump and Angela Merkel dont share many oppinions").toDF("text")
    val result = pipeline.fit(df).transform(df)
    result.select("ner.result").show(false)
    result.select("ner").show(false)

nlu.load('ner.crf').predoct("Donald Trump and Angela Merkel dont share many oppinions")

Results

+-------------------------------------------+
|result                                     |
+-------------------------------------------+
|[I-PER, I-PER, O, I-PER, I-PER, O, O, O, O]|
+-------------------------------------------+




+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|ner                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[[named_entity, 0, 5, I-PER, [word -> Donald], []], [named_entity, 7, 11, I-PER, [word -> Trump], []], [named_entity, 13, 15, O, [word -> and], []], [named_entity, 17, 22, I-PER, [word -> Angela], []], [named_entity, 24, 29, I-PER, [word -> Merkel], []], [named_entity, 31, 34, O, [word -> dont], []], [named_entity, 36, 40, O, [word -> share], []], [named_entity, 42, 45, O, [word -> many], []], [named_entity, 47, 55, O, [word -> oppinions], []]]|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+


Model Information

Model Name: ner_crf
Compatibility: Spark NLP 3.0.0+
License: Open Source
Edition: Official
Input Labels: [sentence, token, pos, embeddings]
Output Labels: [ner]
Language: en