Extract textual entities in biomedical texts

Description

This model is based on the NERO corpus, capable of extracting general entities. This model is trained to refute the claims made in https://www.nature.com/articles/s41540-021-00200-x regarding Spark NLP’s performance and we hereby prove that we can get better than what is claimed. So, this model is not meant to be used in production.

Predicted Entities

Organismpart, Chromosome, Physicalphenomenon, Abstractconcept, Gene, Meas, Machineactivity, Warfarin, Gen, Aminoacidpeptide, Language, P, Quantityormeasurement, Disease, Process, Propernamedgeographicallocation, Duration, Medicalprocedureordevice, Citation, Geographicnotproper, Atom, Gp, Medicaldevice, Namedentity, Unpropernamedgeographicallocation, Persongroup, Unit, Bodypart, Unconjugated, Timepoint, Protein, Publishedsourceofinformation, Quantity, Dr, Organism, Nonproteinornucleicacidchemical, G, Researchactivity, Drug, Measurement, Cells, Journal, Relationshipphrase, Medicalprocedure, Geographiclocation, Groupofpeople, Person, Tissue, Mentalprocess, Facility, Chemical, Geneorproteingroup, Ion, Food, Aminoacid, N, Biologicalprocess, Cell, Researchactivty, Publicationorcitation, Molecularprocess, Experimentalfactor, Medicalfinding, Nucleicacid, Laboratoryexperimentalfactor, Relationship, Geographicallocation, Geneorprotein, Smallmolecule, Partofprotein, Thing, Quantityormeasure, Environmentalfactor, Intellectualproduct, R, Molecule, Time, Anatomicalpart, Cellcomponent, Nucleicacidsubstance

Open in Colab Copy S3 URI

How to use

document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")
         
sentence_detector = SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

clinical_ner = MedicalNerModel.pretrained("ner_nature_nero_clinical", "en", "clinical/models") \
  .setInputCols(["sentence", "token", "embeddings"]) \
  .setOutputCol("ner")

ner_converter = NerConverter()\
 	  .setInputCols(["sentence", "token", "ner"])\
 	  .setOutputCol("ner_chunk")

nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical,  clinical_ner, ner_converter])

model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

results = model.transform(spark.createDataFrame([["The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature."]], ["text"]))
val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")
         
val sentence_detector = new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
   .setInputCols(Array("sentence", "token"))
   .setOutputCol("embeddings")

val ner = MedicalNerModel.pretrained("ner_nature_nero_clinical", "en", "clinical/models") 
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner")

val ner_converter = new NerConverter()
 	.setInputCols(Array("sentence", "token", "ner"))
 	.setOutputCol("ner_chunk")

val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter))

val data = Seq("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""").toDS.toDF("text")

val result = pipeline.fit(data).transform(data)
import nlu
nlu.load("en.med_ner.nero_clinical.nature").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""")

Results

|    | chunk                                        | entity                |
|---:|:---------------------------------------------|:----------------------|
|  0 | perioral cyanosis                            | Medicalfinding        |
|  1 | One day                                      | Duration              |
|  2 | mom                                          | Namedentity           |
|  3 | tactile temperature                          | Quantityormeasurement |
|  4 | patient Tylenol                              | Chemical              |
|  5 | decreased p.o. intake                        | Medicalprocedure      |
|  6 | normal breast-feeding                        | Medicalfinding        |
|  7 | 20 minutes q.2h                              | Timepoint             |
|  8 | 5 to 10 minutes                              | Duration              |
|  9 | respiratory congestion                       | Medicalfinding        |
| 10 | past 2 days                                  | Duration              |
| 11 | parents                                      | Persongroup           |
| 12 | improvement                                  | Process               |
| 13 | albuterol treatments                         | Medicalprocedure      |
| 14 | ER                                           | Bodypart              |
| 15 | urine output                                 | Quantityormeasurement |
| 16 | 8 to 10 wet and 5 dirty diapers per 24 hours | Measurement           |
| 17 | 4 wet diapers per 24 hours                   | Measurement           |
| 18 | Mom                                          | Person                |
| 19 | diarrhea                                     | Medicalfinding        |
| 20 | bowel movements                              | Biologicalprocess     |
| 21 | soft in nature                               | Biologicalprocess     |


Model Information

Model Name: ner_nature_nero_clinical
Compatibility: Healthcare NLP 3.3.4+
License: Licensed
Edition: Official
Input Labels: [sentence, token, embeddings]
Output Labels: [ner]
Language: en
Size: 15.1 MB

References

This model is based on https://www.nature.com/articles/s41540-021-00200-x and a response to: https://static-content.springer.com/esm/art%3A10.1038%2Fs41540-021-00200-x/MediaObjects/41540_2021_200_MOESM1_ESM.pdf

Benchmarking

label	 tp	 fp	 fn	 prec	 rec	 f1
B-Atom	 11	 7	 48	 0.6111111	 0.18644068	 0.2857143
I-Laboratoryexperimentalfactor	 0	 3	 55	 0.0	 0.0	 0.0
B-Disease	 489	 232	 251	 0.6782247	 0.6608108	 0.6694045
B-Partofprotein	 77	 61	 67	 0.557971	 0.5347222	 0.54609925
B-Nonproteinornucleicacidchemical	 45	 58	 190	 0.4368932	 0.19148937	 0.2662722
I-Propernamedgeographicallocation	 60	 33	 31	 0.6451613	 0.6593407	 0.6521739
B-Bodypart	 648	 336	 343	 0.6585366	 0.65388495	 0.6562025
B-Protein	 832	 504	 440	 0.6227545	 0.6540881	 0.63803685
I-Unit	 0	 0	 4	 0.0	 0.0	 0.0
B-Chemical	 1390	 1066	 926	 0.5659609	 0.6001727	 0.5825649
B-Publicationorcitation	 2	 10	 9	 0.16666667	 0.18181819	 0.17391303
I-Smallmolecule	 34	 201	 353	 0.14468086	 0.087855294	 0.10932476
I-Abstractconcept	 0	 0	 4	 0.0	 0.0	 0.0
I-Nucleicacid	 449	 213	 249	 0.67824775	 0.6432665	 0.6602941
B-Drug	 306	 170	 213	 0.64285713	 0.5895954	 0.6150754
B-Thing	 0	 0	 19	 0.0	 0.0	 0.0
I-Citation	 6	 14	 17	 0.3	 0.26086956	 0.27906975
I-Aminoacid	 80	 43	 95	 0.6504065	 0.45714286	 0.53691274
B-Medicalprocedureordevice	 1	 0	 2	 1.0	 0.33333334	 0.5
I-Nucleicacidsubstance	 3	 3	 14	 0.5	 0.1764706	 0.26086956
I-Gp	 1244	 860	 427	 0.5912548	 0.7444644	 0.6590729
I-Geographicallocation	 0	 18	 17	 0.0	 0.0	 0.0
I-Molecule	 15	 114	 58	 0.11627907	 0.20547946	 0.14851485
B-R	 0	 0	 1	 0.0	 0.0	 0.0
I-Measurement	 1700	 893	 546	 0.6556113	 0.75690114	 0.7026245
I-Intellectualproduct	 197	 251	 311	 0.43973213	 0.38779527	 0.41213387
B-Anatomicalpart	 0	 22	 38	 0.0	 0.0	 0.0
B-Gp	 3597	 1426	 818	 0.71610594	 0.81472254	 0.7622378
B-Person	 105	 42	 80	 0.71428573	 0.5675676	 0.63253015
I-Aminoacidpeptide	 55	 38	 35	 0.5913978	 0.6111111	 0.6010929
B-Environmentalfactor	 24	 30	 40	 0.44444445	 0.375	 0.40677968
B-Cellcomponent	 188	 146	 191	 0.56287426	 0.49604222	 0.5273493
I-Groupofpeople	 1	 1	 10	 0.5	 0.09090909	 0.15384614
I-Chromosome	 39	 27	 12	 0.59090906	 0.7647059	 0.6666667
B-G	 0	 0	 1	 0.0	 0.0	 0.0
I-Publishedsourceofinformation	 122	 100	 131	 0.5495495	 0.48221344	 0.5136842
I-Disease	 710	 313	 239	 0.69403714	 0.74815595	 0.72008115
I-Time	 19	 32	 86	 0.37254903	 0.18095239	 0.24358974
I-Relationship	 41	 18	 33	 0.69491524	 0.5540541	 0.6165413
I-Nonproteinornucleicacidchemical	 32	 87	 257	 0.26890758	 0.11072665	 0.15686275
I-Molecularprocess	 1257	 1057	 589	 0.5432152	 0.68093175	 0.6043269
I-Persongroup	 587	 199	 233	 0.7468193	 0.71585363	 0.7310087
B-Laboratoryexperimentalfactor	 0	 2	 42	 0.0	 0.0	 0.0
I-Mentalprocess	 24	 30	 97	 0.44444445	 0.1983471	 0.2742857
B-Aminoacidpeptide	 33	 26	 44	 0.55932206	 0.42857143	 0.48529413
B-Food	 63	 29	 54	 0.6847826	 0.53846157	 0.6028708
B-Journal	 0	 0	 3	 0.0	 0.0	 0.0
I-Quantityormeasure	 0	 2	 4	 0.0	 0.0	 0.0
I-Cell	 1035	 212	 252	 0.829992	 0.8041958	 0.81689036
B-Tissue	 57	 41	 53	 0.5816327	 0.5181818	 0.5480769
I-Medicaldevice	 51	 58	 53	 0.4678899	 0.4903846	 0.47887325
B-Mentalprocess	 57	 49	 156	 0.5377358	 0.26760563	 0.35736677
I-Bodypart	 659	 406	 366	 0.61877936	 0.6429268	 0.630622
I-Researchactivity	 1073	 568	 410	 0.65386957	 0.7235334	 0.6869398
I-Atom	 11	 2	 51	 0.84615386	 0.17741935	 0.29333335
B-Namedentity	 173	 423	 504	 0.29026845	 0.25553915	 0.2717989
B-Quantityormeasure	 2	 2	 6	 0.5	 0.25	 0.33333334
B-Citation	 0	 2	 3	 0.0	 0.0	 0.0
I-Cellcomponent	 183	 166	 180	 0.5243553	 0.5041322	 0.51404494
B-Unit	 0	 0	 3	 0.0	 0.0	 0.0
I-Person	 41	 50	 33	 0.45054945	 0.5540541	 0.49696973
I-Quantityormeasurement	 202	 418	 557	 0.32580644	 0.26613966	 0.2929659
B-Organismpart	 23	 54	 41	 0.2987013	 0.359375	 0.32624114
B-Cell	 723	 206	 231	 0.7782562	 0.7578616	 0.76792353
I-Chemical	 1898	 1128	 832	 0.62723064	 0.6952381	 0.65948576
I-Medicalfinding	 1749	 1267	 1362	 0.5799072	 0.56219864	 0.5709156
B-Process	 1522	 1421	 1714	 0.51715934	 0.47033376	 0.49263635
I-Food	 75	 39	 60	 0.65789473	 0.5555556	 0.60240966
I-Duration	 344	 269	 169	 0.5611746	 0.6705653	 0.61101246
I-Experimentalfactor	 59	 173	 200	 0.25431034	 0.22779922	 0.24032587
I-Quantity	 742	 670	 750	 0.52549577	 0.49731904	 0.5110193
B-Physicalphenomenon	 1	 2	 11	 0.33333334	 0.083333336	 0.13333334
I-Medicalprocedureordevice	 3	 0	 3	 1.0	 0.5	 0.6666667
B-Aminoacid	 86	 41	 109	 0.6771653	 0.44102564	 0.5341615
B-Quantity	 613	 554	 645	 0.5252785	 0.4872814	 0.5055671
B-Cells	 0	 0	 2	 0.0	 0.0	 0.0
I-Gene	 134	 44	 95	 0.752809	 0.58515286	 0.65847665
B-Medicalfinding	 1913	 1389	 1296	 0.5793458	 0.59613585	 0.5876209
I-Tissue	 70	 58	 42	 0.546875	 0.625	 0.5833333
B-Molecule	 20	 48	 62	 0.29411766	 0.24390244	 0.26666665
I-Organism	 959	 373	 282	 0.71997	 0.7727639	 0.74543333
I-Medicalprocedure	 775	 475	 370	 0.62	 0.6768559	 0.64718163
B-Unpropernamedgeographicallocation	 7	 5	 25	 0.5833333	 0.21875	 0.3181818
I-Timepoint	 3	 9	 32	 0.25	 0.08571429	 0.12765957
I-Organismpart	 15	 47	 27	 0.24193548	 0.35714287	 0.28846154
I-Biologicalprocess	 798	 787	 835	 0.50347	 0.48867115	 0.49596024
B-Time	 47	 49	 97	 0.48958334	 0.3263889	 0.39166668
B-Experimentalfactor	 40	 100	 136	 0.2857143	 0.22727273	 0.2531646
B-Nucleicacid	 364	 234	 337	 0.6086956	 0.5192582	 0.5604311
B-Propernamedgeographicallocation	 112	 38	 22	 0.74666667	 0.8358209	 0.7887324
B-Publishedsourceofinformation	 252	 95	 125	 0.7262248	 0.66843504	 0.69613266
I-Unpropernamedgeographicallocation	 5	 4	 21	 0.5555556	 0.1923077	 0.2857143
I-Protein	 1114	 717	 449	 0.6084107	 0.7127319	 0.65645254
B-Molecularprocess	 907	 696	 609	 0.5658141	 0.59828496	 0.5815967
B-Quantityormeasurement	 171	 293	 438	 0.36853448	 0.28078818	 0.31873253
B-Intellectualproduct	 159	 175	 215	 0.4760479	 0.42513368	 0.44915253
B-Persongroup	 783	 245	 223	 0.76167315	 0.77833	 0.7699115
I-Cells	 0	 0	 2	 0.0	 0.0	 0.0
B-Researchactivity	 869	 448	 373	 0.65983295	 0.69967794	 0.67917156
I-Environmentalfactor	 20	 18	 35	 0.5263158	 0.36363637	 0.43010756
B-Gene	 64	 37	 105	 0.63366336	 0.37869823	 0.47407413
B-Groupofpeople	 3	 2	 10	 0.6	 0.23076923	 0.33333334
B-Geneorproteingroup	 241	 191	 203	 0.5578704	 0.5427928	 0.5502283
B-Facility	 119	 134	 180	 0.47035572	 0.3979933	 0.43115944
B-Timepoint	 10	 15	 30	 0.4	 0.25	 0.30769232
B-Organism	 869	 290	 297	 0.7497843	 0.745283	 0.7475268
B-Duration	 274	 186	 137	 0.59565216	 0.6666667	 0.6291619
I-Facility	 182	 212	 226	 0.46192893	 0.44607842	 0.45386532
I-Process	 589	 880	 1114	 0.40095302	 0.34586024	 0.3713745
B-Language	 0	 0	 2	 0.0	 0.0	 0.0
B-Medicaldevice	 29	 44	 56	 0.39726028	 0.34117648	 0.36708862
B-Ion	 54	 23	 37	 0.7012987	 0.5934066	 0.64285713
I-Partofprotein	 126	 131	 153	 0.49027237	 0.4516129	 0.47014928
B-Gen	 0	 0	 1	 0.0	 0.0	 0.0
B-Geneorprotein	 0	 1	 64	 0.0	 0.0	 0.0
I-Thing	 0	 0	 16	 0.0	 0.0	 0.0
I-Gen	 0	 0	 3	 0.0	 0.0	 0.0
I-Geneorproteingroup	 398	 345	 286	 0.5356662	 0.58187133	 0.5578136
B-Abstractconcept	 0	 0	 7	 0.0	 0.0	 0.0
B-Chromosome	 37	 24	 24	 0.60655737	 0.60655737	 0.60655737
B-Relationship	 133	 64	 59	 0.6751269	 0.6927083	 0.6838046
B-Smallmolecule	 19	 62	 229	 0.2345679	 0.076612905	 0.115501516
I-Physicalphenomenon	 0	 2	 7	 0.0	 0.0	 0.0
I-Ion	 102	 41	 126	 0.7132867	 0.4473684	 0.5498652
I-Drug	 173	 133	 169	 0.5653595	 0.50584793	 0.5339506
I-Anatomicalpart	 0	 63	 48	 0.0	 0.0	 0.0
B-Measurement	 639	 451	 298	 0.5862385	 0.68196374	 0.6304884
I-Publicationorcitation	 10	 46	 52	 0.17857143	 0.16129032	 0.16949153
B-Geographicallocation	 4	 16	 20	 0.2	 0.16666667	 0.18181819
I-Journal	 0	 0	 11	 0.0	 0.0	 0.0
B-Relationshipphrase	 0	 0	 1	 0.0	 0.0	 0.0
B-Nucleicacidsubstance	 2	 2	 16	 0.5	 0.11111111	 0.18181819
B-Biologicalprocess	 779	 703	 831	 0.525641	 0.48385093	 0.503881
I-Geneorprotein	 0	 2	 33	 0.0	 0.0	 0.0
B-Medicalprocedure	 820	 450	 346	 0.6456693	 0.703259	 0.6732348
I-Namedentity	 140	 447	 381	 0.23850085	 0.268714	 0.25270757
Macro-average	 41221  28282  28209  0.4370514 0.3767836 0.40468597
Micro-average	 41221  28282  28209  0.5930823 0.5937059 0.5933939