Description
Named Entity Recognition annotators allow for a generic model to be trained by using a Deep Learning architecture (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN.
Deidentification NER (Portuguese) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It detects 7 entities. This NER model is trained with a combination of custom datasets and data augmentation techniques.
Predicted Entities
CONTACT
, NAME
, DATE
, ID
, SEX
, LOCATION
, PROFESSION
, AGE
Live Demo Open in Colab Copy S3 URI
How to use
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.WordEmbeddingsModel.pretrained("w2v_cc_300d", "pt")\
.setInputCols(["sentence","token"])\
.setOutputCol("word_embeddings")
clinical_ner = medical.NerModel.pretrained("ner_deid_generic", "pt", "clinical/models")\
.setInputCols(["sentence","token","word_embeddings"])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter()\
.setInputCols(["sentence","token","ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
clinical_ner,
ner_converter])
text = ['''
Detalhes do paciente.
Nome do paciente: Pedro Gonçalves
NHC: 2569870.
Endereço: Rua Das Flores 23.
Cidade/ Província: Porto.
Código Postal: 21754-987.
Dados de cuidados.
Data de nascimento: 10/10/1963.
Idade: 53 anos Sexo: Homen
Data de admissão: 17/06/2016.
Doutora: Maria Santos
''']
data = spark.createDataFrame([text]).toDF("text")
result = nlpPipeline.fit(data).transform(data)
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d", "pt")
.setInputCols(Array("sentence","token"))
.setOutputCol("word_embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_deid_generic", "pt", "clinical/models")
.setInputCols(Array("sentence","token","word_embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
clinical_ner,
ner_converter))
val text = """Detalhes do paciente.
Nome do paciente: Pedro Gonçalves
NHC: 2569870.
Endereço: Rua Das Flores 23.
Cidade/ Província: Porto.
Código Postal: 21754-987.
Dados de cuidados.
Data de nascimento: 10/10/1963.
Idade: 53 anos Sexo: Homen
Data de admissão: 17/06/2016.
Doutora: Maria Santos"""
val df = Seq(text).toDF("text")
val results = pipeline.fit(df).transform(df)
import nlu
nlu.load("pt.med_ner.deid.generic").predict("""
Detalhes do paciente.
Nome do paciente: Pedro Gonçalves
NHC: 2569870.
Endereço: Rua Das Flores 23.
Cidade/ Província: Porto.
Código Postal: 21754-987.
Dados de cuidados.
Data de nascimento: 10/10/1963.
Idade: 53 anos Sexo: Homen
Data de admissão: 17/06/2016.
Doutora: Maria Santos
""")
Results
+-----------------+---------+
|chunk |ner_label|
+-----------------+---------+
|Pedro Gonçalves |NAME |
|2569870 |ID |
|Rua Das Flores 23|LOCATION |
|Porto |LOCATION |
|21754-987 |LOCATION |
|10/10/1963 |DATE |
|53 |AGE |
|17/06/2016 |DATE |
|Maria Santos |NAME |
+-----------------+---------+
Model Information
Model Name: | ner_deid_generic |
Compatibility: | Healthcare NLP 3.4.2+ |
License: | Licensed |
Edition: | Official |
Input Labels: | [sentence, token, embeddings] |
Output Labels: | [ner] |
Language: | pt |
Size: | 15.0 MB |
References
- Custom John Snow Labs datasets
- Data augmentation techniques
Benchmarking
label tp fp fn total precision recall f1
CONTACT 191.0 2.0 2.0 193.0 0.9896 0.9896 0.9896
NAME 2640.0 82.0 52.0 2692.0 0.9699 0.9807 0.9752
DATE 1316.0 24.0 5.0 1321.0 0.9821 0.9962 0.9891
ID 54.0 3.0 9.0 63.0 0.9474 0.8571 0.9
SEX 669.0 9.0 8.0 677.0 0.9867 0.9882 0.9875
LOCATION 5784.0 149.0 206.0 5990.0 0.9749 0.9656 0.9702
PROFESSION 249.0 17.0 27.0 276.0 0.9361 0.9022 0.9188
AGE 536.0 14.0 10.0 546.0 0.9745 0.9817 0.9781
macro - - - - - - 0.9636
macro - - - - - - 0.9736