NLU Version 4.2.2
- support for Medical Summarizers
New Medical Summarizers:
- ‘en.summarize.clinical_jsl’
- ‘en.summarize.clinical_jsl_augmented’
- ‘en.summarize.biomedical_pubmed’
- ‘en.summarize.generic_jsl’
- ‘en.summarize.clinical_questions’
- ‘en.summarize.radiology’
- ‘en.summarize.clinical_guidelines_large’
- ‘en.summarize.clinical_laymen’
NLU Version 4.2.1
Bugfixes for saving and reloading pipelines on databricks
NLU Version 4.2.0
Support for Speech2Text, Images-Classification, Tabular Data, Zero-Shot-NER, via Wav2Vec2, Tapas, VIT , 4000+ New Models, 90+ Languages, in John Snow Labs NLU 4.2.0
We are incredibly excited to announce NLU 4.2.0 has been released with new 4000+ models in 90+ languages and support for new 8 Deep Learning Architectures. 4 new tasks are included for the very first time, Zero-Shot-NER, Automatic Speech Recognition, Image Classification and Table Question Answering powered by Wav2Vec 2.0, HuBERT, TAPAS, VIT, SWIN, Zero-Shot-NER.
Additionally, CamemBERT based architectures are available for Sequence and Token Classification powered by Spark-NLPs CamemBertForSequenceClassification and CamemBertForTokenClassification
Automatic Speech Recognition (ASR)
Demo Notebook Wav2Vec 2.0 and HuBERT enable ASR for the very first time in NLU. Wav2Vec2 is a transformer model for speech recognition that uses unsupervised pre-training on large amounts of unlabeled speech data to improve the accuracy of automatic speech recognition (ASR) systems. It is based on a self-supervised learning approach that learns to predict masked portions of speech signal, and has shown promising results in reducing the amount of labeled training data required for ASR tasks.
These Models are powered by Spark-NLP’s Wav2Vec2ForCTC Annotator
HuBERT models match or surpass the SOTA approaches for speech representation learning for speech recognition, generation, and compression. The Hidden-Unit BERT (HuBERT) approach was proposed for self-supervised speech representation learning, which utilizes an offline clustering step to provide aligned target labels for a BERT-like prediction loss.
These Models is powered by Spark-NLP’s HubertForCTC Annotator
Usage
You just need an audio-file on disk and pass the path to it or a folder of audio-files.
import nlu
# Let's download an audio file
!wget https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/audio/samples/wavs/ngm_12484_01067234848.wav
# Let's listen to it
from IPython.display import Audio
FILE_PATH = "ngm_12484_01067234848.wav"
asr_df = nlu.load('en.speech2text.wav2vec2.v2_base_960h').predict('ngm_12484_01067234848.wav')
asr_df
text |
---|
PEOPLE WHO DIED WHILE LIVING IN OTHER PLACES |
To test out HuBERT you just need to update the parameter for load()
asr_df = nlu.load('en.speech2text.hubert').predict('ngm_12484_01067234848.wav')
asr_df
Image Classification
For the first time ever NLU introduces state-of-the-art image classifiers based on
VIT and Swin giving you access to hundreds of image classifiers for various domains.
Inspired by the Transformer scaling successes in NLP, the researchers experimented with applying a standard Transformer directly to images, with the fewest possible modifications. To do so, images are split into patches and the sequence of linear embeddings of these patches were provided as an input to a Transformer. Image patches were actually treated the same way as tokens (words) in an NLP application. Image classification models were trained in supervised fashion.
You can check Scale Vision Transformers (ViT) Beyond Hugging Face article to learn deeper how ViT works and how it is implemeted in Spark NLP. This is Powerd by Spark-NLP’s VitForImageClassification Annotator
Swin is a hierarchical Transformer whose representation is computed with Shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks This is powerd by Spark-NLP’s Swin For Image Classification Swin Transformer: Hierarchical Vision Transformer using Shifted Windows by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.
Usage:
# Download an image
os.system('wget https://raw.githubusercontent.com/JohnSnowLabs/nlu/release/4.2.0/tests/datasets/ocr/vit/ox.jpg')
# Load VIT model and predict on image file
vit = nlu.load('en.classify_image.base_patch16_224').predict('ox.jpg')
Lets download a folder of images and predict on it
!wget -q https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/images/images.zip
import shutil
shutil.unpack_archive("images.zip", "images", "zip")
! ls /content/images/images/
Once we have image data its easy to label it, we just pass the folder with images to nlu.predict() and NLU will return a pandas DF with one row per image detected
nlu.load('en.classify_image.base_patch16_224').predict('/content/images/images')
To use SWIN we just update the parameter to load()
load('en.classify_image.swin.tiny').predict('/content/images/images')
Visual Table Question Answering
TapasForQuestionAnswering can load TAPAS Models with a cell selection head and optional aggregation head on top for question-answering tasks on tables (linear layers on top of the hidden-states output to compute logits and optional logits_aggregation), e.g. for SQA, WTQ or WikiSQL-supervised tasks. TAPAS is a BERT-based model specifically designed (and pre-trained) for answering questions about tabular data.
Powered by TAPAS: Weakly Supervised Table Parsing via Pre-training
Usage:
First we need a pandas dataframe on for which we want to ask questions. The so called “context”
import pandas as pd
context_df = pd.DataFrame({
'name':['Donald Trump','Elon Musk'],
'money': ['$100,000,000','$20,000,000,000,000'],
'married': ['yes','no'],
'age' : ['75','55'] })
context_df
Then we create an array of questions
questions = [
"Who earns less than 200,000,000?",
"Who earns more than 200,000,000?",
"Who earns 100,000,000?",
"How much money has Donald Trump?",
"Who is the youngest?",
]
questions
Now Combine the data, pass it to NLU and get answers for your questions
import nlu
# Now we combine both to a tuple and we are done! We can now pass this to the .predict() method
tapas_data = (context_df, questions)
# Lets load a TAPAS QA model and predict on (context,question).
# It will give us an aswer for every question in the questions array, based on the context in context_df
answers = nlu.load('en.answer_question.tapas.wtq.large_finetuned').predict(tapas_data)
answers
sentence | tapas_qa_UNIQUE_aggregation | tapas_qa_UNIQUE_answer | tapas_qa_UNIQUE_cell_positions | tapas_qa_UNIQUE_cell_scores | tapas_qa_UNIQUE_origin_question |
---|---|---|---|---|---|
Who earns less than 200,000,000? | NONE | Donald Trump | [0, 0] | 1 | Who earns less than 200,000,000? |
Who earns more than 200,000,000? | NONE | Elon Musk | [0, 1] | 1 | Who earns more than 200,000,000? |
Who earns 100,000,000? | NONE | Donald Trump | [0, 0] | 1 | Who earns 100,000,000? |
How much money has Donald Trump? | SUM | SUM($100,000,000) | [1, 0] | 1 | How much money has Donald Trump? |
Who is the youngest? | NONE | Elon Musk | [0, 1] | 1 | Who is the youngest? |
Zero-Shot NER
Demo Notebook
Based on John Snow Labs Enterprise-NLP ZeroShotNerModel
This architecture is based on RoBertaForQuestionAnswering
.
Zero shot models excel at generalization, meaning that the model can accurately predict entities in very different data sets without the need to fine tune the model or train from scratch for each different domain.
Even though a model trained to solve a specific problem can achieve better accuracy than a zero-shot model in this specific task,
it probably won’t be be useful in a different task.
That is where zero-shot models shows its usefulness by being able to achieve good results in various domains.
Usage:
We just need to load the zero-shot NER model and configure a set of entity definitions.
import nlu
# load zero-shot ner model
enterprise_zero_shot_ner = nlu.load('en.zero_shot.ner_roberta')
# Configure entity definitions
enterprise_zero_shot_ner['zero_shot_ner'].setEntityDefinitions(
{
"PROBLEM": [
"What is the disease?",
"What is his symptom?",
"What is her disease?",
"What is his disease?",
"What is the problem?",
"What does a patient suffer",
"What was the reason that the patient is admitted to the clinic?",
],
"DRUG": [
"Which drug?",
"Which is the drug?",
"What is the drug?",
"Which drug does he use?",
"Which drug does she use?",
"Which drug do I use?",
"Which drug is prescribed for a symptom?",
],
"ADMISSION_DATE": ["When did patient admitted to a clinic?"],
"PATIENT_AGE": [
"How old is the patient?",
"What is the gae of the patient?",
],
}
)
Then we can already use this pipeline to predict labels
# Predict entities
df = enterprise_zero_shot_ner.predict(
[
"The doctor pescribed Majezik for my severe headache.",
"The patient was admitted to the hospital for his colon cancer.",
"27 years old patient was admitted to clinic on Sep 1st by Dr."+
"X for a right-sided pleural effusion for thoracentesis.",
]
)
df
document | entities_zero_shot | entities_zero_shot_class | entities_zero_shot_confidence | entities_zero_shot_origin_chunk | entities_zero_shot_origin_sentence |
---|---|---|---|---|---|
The doctor pescribed Majezik for my severe headache. | Majezik | DRUG | 0.646716 | 0 | 0 |
The doctor pescribed Majezik for my severe headache. | severe headache | PROBLEM | 0.552635 | 1 | 0 |
The patient was admitted to the hospital for his colon cancer. | colon cancer | PROBLEM | 0.88985 | 0 | 0 |
27 years old patient was admitted to clinic on Sep 1st by Dr. X for a right-sided pleural effusion for thoracentesis. | 27 years old | PATIENT_AGE | 0.694308 | 0 | 0 |
27 years old patient was admitted to clinic on Sep 1st by Dr. X for a right-sided pleural effusion for thoracentesis. | Sep 1st | ADMISSION_DATE | 0.956461 | 1 | 0 |
27 years old patient was admitted to clinic on Sep 1st by Dr. X for a right-sided pleural effusion for thoracentesis. | a right-sided pleural effusion for thoracentesis | PROBLEM | 0.500266 | 2 | 0 |
New Notebooks
New Models Overview
Supported Languages are:
ab
, am
, ar
, ba
, bem
, bg
, bn
, ca
, co
, cs
, da
, de
, dv
, el
, en
, es
, et
, eu
, fa
, fi
, fon
, fr
, fy
, ga
, gam
, gl
, gu
, ha
, he
, hi
, hr
, hu
, id
, ig
, is
, it
, ja
, jv
, kin
, kn
, ko
, kr
, ku
, ky
, la
, lg
, lo
, lt
, lu
, luo
, lv
, lwt
, ml
, mn
, mr
, ms
, mt
, nb
, nl
, no
, pcm
, pl
, pt
, ro
, ru
, rw
, sg
, si
, sk
, sl
, sq
, st
, su
, sv
, sw
, swa
, ta
, te
, th
, ti
, tl
, tn
, tr
, tt
, tw
, uk
, unk
, ur
, uz
, vi
, wo
, xx
, yo
, yue
, zh
, zu
Automatic Speech Recognition Models Overview
Image Classification Models Overview
NLU Version 4.1.0
Approximately 1000 new state-of-the-art transformer models for Question Answering (QA) for over 10 languages, up to 700% speedup on GPU, 100+ Embeddings such as Bert, Bert Sentence, CamemBert, DistilBert, Roberta, Roberta Sentence, Universal Sentence Encoder, Word, XLM Roberta, XLM Roberta Sentence, 40 sequence classification models, +400 token classification odels for over 10 languages various Spark NLP helper methods and much more in 1 line of code with John Snow Labs NLU 4.1.0
NLU 4.1.0 Core Overview
-
On the NLU core side we have over 1000 new state-of-the-art models in over 10 languages.
-
Additionally up to 700% speedup transformer-based Word Embeddings on GPU and up to 97% speedup on CPU for tensorflow operations, support for Apple M1 chips, Pyspark 3.2 and 3.3 support. Ontop of this, we are now supporting Apple M1 based architectures and every Pyspark 3.X version, while deprecating support for Pyspark 2.X.
-
Finally, NLU-Core features various new helper methods for working with Spark NLP and embellishes now the entire universe of Annotators defined by Spark NLP.
NLU captures every Annotator of Spark NLP
The entire universe of Annotators in Spark NLP is now embellished by NLU Components by using generalizable annotation extractors methods and configs internally to support enable the new NLU util methods. The following annotator classes are newly captured:
- BertEmbeddings
- BertForQuestionAnswering
- BertForSequenceClassification
- BertForTokenClassification
- BertSentenceEmbeddings
- CamemBertEmbeddings
- ClassifierDLModel
- ContextSpellCheckerModel
- DistilBertEmbeddings
- DistilBertForSequenceClassification
- DistilBertForTokenClassification
- LemmatizerModel
- LongformerForTokenClassification
- NerCrfModel
- NerDLModel
- PerceptronModel
- RoBertaEmbeddings
- RoBertaForQuestionAnswering
- RoBertaForSequenceClassification
- RoBertaForTokenClassification
- RoBertaSentenceEmbeddings
- SentenceDetectorDLModel
- StopWordsCleaner
- T5Transformer
- UniversalSentenceEncoder
- WordEmbeddingsModel
- XlmRoBertaEmbeddings
- XlmRoBertaForTokenClassification
- XlmRoBertaSentenceEmbeddings
Embeddings
Embeddings provides dense vector representations for natural language by using a deep, pre-trained neural network with the Transformer architecture. On the NLU core side we have over 150 new embeddings models. We have new BertEmbeddings, BertSentenceEmbeddings, CamemBertEmbeddings, DistilBertEmbeddings, RoBertaEmbeddings, UniversalSentenceEncoder, XlmRoBertaEmbeddings, XlmRoBertaSentenceEmbeddings for in different languages.
- German BertEmbeddings
nlu.load("de.embed.electra.base").predict("""Ich liebe Spark NLP""")
token | word_embedding_electra |
---|---|
Ich | -0.09518987685441971, -0.016133345663547516 |
liebe | -0.07025116682052612, -0.35387516021728516 |
Spark | -0.33390265703201294, 0.08874476701021194 |
NLP | -0.2969835698604584, 0.1980721354484558 |
- English BertEmbeddings
text = ["I love NLP"]
df = nlu.load('en.embed_sentence.bert.pubmed').predict(text, output_level='token')
df
token | sentence_embedding_bert |
---|---|
I | -0.06332794576883316, -0.5097940564155579 |
love | -0.06332794576883316, -0.5097940564155579 |
NLP | -0.06332794576883316, -0.5097940564155579 |
- Japan BertEmbeddings
nlu.load("ja.embed.bert.base").predict("""私はSpark NLPを愛しています""")
token | word_embedding_bert |
---|---|
私はSpark | 0.3989057242870331, -0.20664098858833313 |
NLPを愛しています | 0.05264343321323395, -0.19963961839675903 |
- XLM RoBerta Embeddings MultiLanguage
text = ["I love NLP", "Me encanta usar SparkNLP"]
embeddings_df = nlu.load('xx.embed.xlmr_roberta.base_v2').predict(text, output_level='sentence')
embeddings_df
sentence | word_embedding_xlmr_roberta |
---|---|
I love NLP | -0.07450243085622787, 0.022609828040003777 |
Me encanta usar SparkNLP | 0.0961054190993309, 0.03734250366687775 |
- RoBerta Embeddings English
text = ["""I love Spark NLP"""]
embeddings_df = nlu.load('en.embed.roberta').predict(text, output_level='token')
embeddings_df
token | word_embedding_roberta |
---|---|
I | -0.06406927853822708, 0.16723069548606873 |
love | -0.06369957327842712, 0.21014901995658875 |
Spark | -0.1004200279712677, 0.03312099352478981 |
NLP | -0.09467814117670059, -0.02236207202076912 |
Question Answering
Question Answering models can retrieve the answer to a question from a given text, which is useful for searching for an answer in a document. On the NLU core side we have over 200+ new question answering models.
- Bert For Question Answering
nlu.load("answer_question.bert.base_uncased.by_ksabeh").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""")
answer_confidence | context | question |
---|---|---|
0.3143375 | “My name is Clara and I live in Berkeley. | What is my name? |
Sequence Classification
Sequence classification is the task of predicting a class label given a sequence of observations. On the NLU core side we have over 40 new sequence classification models.
- Bert For Sequence Classification
nlu.load("classify.bert.by_mrm8488").predict("""Camera - You are awarded a SiPix Digital Camera! call 09061221066 from landline. Delivery within 28 days.""")
classified_sequence | classified_sequence_confidence | sentence |
---|---|---|
1 | 0.89954 | Camera - You are awarded a SiPix Digital Camera! call 09061221066 from landline. |
0 | 0.93745 | Delivery within 28 days. |
- DistilBert For Sequence Classification
nlu.load("de.classify.distil_bert.base").predict("Natürlich kann ich von zuwanderern mehr erwarten. muss ich sogar. sie müssen die sprache lernen, sie müssen die gepflogenheiten lernen und sich in die gesellschaft einfügen. dass muss ich nicht weil ich mich schon in die gesellschaft eingefügt habe. egal wo du hin ziehst, nirgendwo wird dir soviel zucker in den arsch geblasen wie in deutschland.")
classified_sequence | classified_sequence_confidence | sentence |
---|---|---|
non_toxic | 0.955292 | Natürlich kann ich von zuwanderern mehr erwarten. |
non_toxic | 0.968591 | muss ich sogar. |
non_toxic | 0.841958 | sie müssen die sprache lernen, sie müssen die gepflogenheiten lernen und sich in die gesellschaft einfügen. |
non_toxic | 0.934119 | dass muss ich nicht weil ich mich schon in die gesellschaft eingefügt habe. |
non_toxic | 0.771795 | egal wo du hin ziehst, nirgendwo wird dir soviel zucker in den arsch geblasen wie in deutschland. |
- RoBerta For Sequence Classification
nlu.load("en.classify.roberta.finetuned").predict("I love you very much!")
classified_sequence | classified_sequence_confidence | sentence |
---|---|---|
LABEL_0 | 0.597792 | I love you very much! |
Lemmatizer
Lemmatization in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word’s lemma, or dictionary form. On the NLU core side we have over 30 new lemmatizer models.
ClassifierDLModel
ClassifierDL for generic Multi-class Text Classification. ClassifierDL uses the state-of-the-art Universal Sentence Encoder as an input for text classifications. The ClassifierDL annotator uses a deep learning model (DNNs) we have built inside TensorFlow and supports up to 100 classes. On the NLU core side we have over 5 new ClassifierDLModel models.
ContextSpellCheckerModel
Spell Checking is a sequence to sequence mapping problem. Given an input sequence, potentially containing a certain number of errors, ContextSpellChecker will rank correction sequences according to three things:
- Different correction candidates for each word — word level.
- The surrounding text of each word, i.e. it’s context — sentence level.
- The relative cost of different correction candidates according to the edit operations at the character level it requires — subword level.
On the NLU core side we have over 5 new ClassifierDLModel models.
Token Classification
Token classification is a natural language understanding task in which a label is assigned to some tokens in a text. Some popular token classification subtasks are Named Entity Recognition (NER) and Part-of-Speech (PoS) tagging. NER models could be trained to identify specific entities in a text, such as dates, individuals and places; and PoS tagging would identify, for example, which words in a text are verbs, nouns, and punctuation marks. We have new 463 models XlmRoBertaForTokenClassification, BertForTokenClassification, DistilBertForTokenClassification, DistilBertEmbeddings, LongformerForTokenClassification, RoBertaForTokenClassification for in different languages.
- BertForTokenClassification English
nlu.load("en.ner.bc5cdr.biobert.disease").predict("I love you very much!")
index | document | entities_wikiner_glove_840B_300 | entities_wikiner_glove_840B_300_class | entities_wikiner_glove_840B_300_confidence | entities_wikiner_glove_840B_300_origin_chunk | entities_wikiner_glove_840B_300_origin_sentence | word_embedding_glove |
---|---|---|---|---|---|---|---|
0 | I love you very much! | I love you very much! | MISC | 0.66433334 | 0 | 0 | [ 0.19410001 0.22603001 -0.43764001 ] |
- BertForTokenClassification German
nlu.load("de.ner.distil_bert.base_cased").predict("Ich liebe Spark NLP")
index | classified_token | document | entities_distil_bert | entities_distil_bert_class | entities_distil_bert_origin_chunk | entities_distil_bert_origin_sentence |
---|---|---|---|---|---|---|
0 | O,O,B-OTHderiv,O | Ich liebe Spark NLP | Spark | OTHderiv | 0 | 0 |
- XlmRoBertaForTokenClassification Igbo
nlu.load("ig.ner.xlmr_roberta.base").predict("Ahụrụ m n'anya na-atọ m ụtọ")
index | classified_token | document | entities_xlmr_roberta | entities_xlmr_roberta_class | entities_xlmr_roberta_origin_chunk | entities_xlmr_roberta_origin_sentence |
---|---|---|---|---|---|---|
0 | B-ORG,I-ORG,I-ORG,I-ORG,I-ORG,I-ORG | Ahụrụ m n’anya na-atọ m ụtọ | Ahụrụ m n’anya na-atọ m ụtọ | ORG | 0 | 0 |
NerCrfModel
This Named Entity Recognizer is based on a CRF Algorithm. Conditional random fields (CRFs) are a class of statistical modeling methods often applied in pattern recognition and machine learning and used for structured prediction. Whereas a classifier predicts a label for a single sample without considering “neighbouring” samples, a CRF can take context into account. To do so, the predictions are modelled as a graphical model, which represents the presence of dependencies between the predictions. What kind of graph is used depends on the application. For example, in natural language processing, “linear chain” CRFs are popular, for which each prediction is dependent only on its immediate neighbours. In image processing, the graph typically connects locations to nearby and/or similar locations to enforce that they receive similar predictions.
- NerCrfModel
nlu.load('en.ner.ner.crf').predict("Donald Trump and Angela Merkel dont share many oppinions")
index | document | entities_wikiner_glove_840B_300 | entities_wikiner_glove_840B_300_class | entities_wikiner_glove_840B_300_confidence | entities_wikiner_glove_840B_300_origin_chunk | entities_wikiner_glove_840B_300_origin_sentence | word_embedding_glove |
---|---|---|---|---|---|---|---|
0 | Donald Trump and Angela Merkel dont share many oppinions | Donald Trump | PER | 0.78524995 | 0 | 0 | [-0.074014 -0.23684999 0.17772 ] |
0 | Donald Trump and Angela Merkel dont share many oppinions | Angela Merkel | PER | 0.7701 | 1 | 0 | [-0.074014 -0.23684999 0.17772 ] |
NerDLModel
This Named Entity recognition annotator is a generic NER model based on Neural Networks. Neural Network architecture is Char CNNs - BiLSTM - CRF that achieves state-of-the-art in most datasets. This is the instantiated model of the NerDLApproach. For training your own model, please see the documentation of that class. We have new 6 models.
- NerDLModel Japanese
nlu.load('ja.ner.ner.base').predict("宮本茂氏は、日本の任天堂のゲームプロデューサーです。")
index | document | entities_xtreme_glove_840B_300 | word_embedding_glove |
---|---|---|---|
0 | 宮本茂氏は、日本の任天堂のゲームプロデューサーです。 | NaN | [0. 0. ] |
- NerDLModel English
text = ["My name is John!"]
nlu.load('en.ner.conll.ner.large').predict(text, output_level='token')
index | entities_wikiner_glove_840B_300 | entities_wikiner_glove_840B_300_class | entities_wikiner_glove_840B_300_confidence | entities_wikiner_glove_840B_300_origin_chunk | entities_wikiner_glove_840B_300_origin_sentence | token | word_embedding_glove |
---|---|---|---|---|---|---|---|
0 | My name is John! | MISC | 0.63266003 | 0 | 0 | My | [-2.19990000e-01 2.57800013e-01 -4.25859988e-01 ] |
0 | My name is John! | MISC | 0.63266003 | 0 | 0 | name | [ 2.32309997e-01 -2.41020005e-02] |
0 | My name is John! | MISC | 0.63266003 | 0 | 0 | is | [-8.49609971e-02 5.01999974e-01 2.38230010e-03] |
0 | My name is John! | MISC | 0.63266003 | 0 | 0 | John | [-2.96090007e-01 -8.18260014e-02 9.67490021e-03 ] |
0 | My name is John! | MISC | 0.63266003 | 0 | 0 | ! | [-2.65540004e-01 3.35310012e-01 2.18600005e-01 ] |
PerceptronModel
We have new 26 models.
StopWordsCleaner
This model removes ‘stop words’ from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions. We have new 33 models.
NLU Version 4.0.0
OCR Visual Tables into Pandas DataFrames from PDF/DOC(X)/PPT files, 1000+ new state-of-the-art transformer models for Question Answering (QA) for over 30 languages, up to 700% speedup on GPU, 20 Biomedical models for over 8 languages, 50+ Terminology Code Mappers between RXNORM, NDC, UMLS,ICD10, ICDO, UMLS, SNOMED and MESH, Deidentification in Romanian, various Spark NLP helper methods and much more in 1 line of code with John Snow Labs NLU 4.0.0
NLU 4.0 for OCR Overview
On the OCR side, we now support extracting tables from PDF/DOC(X)/PPT files into structured pandas dataframe, making it easier than ever before to analyze bulks of files visually!
Checkout the OCR Tutorial for extracting Tables
from Image/PDF/DOC(X) files to see this in action
These models grab all Table data from the files detected and return a list of Pandas DataFrames
,
containing Pandas DataFrame for every table detected
NLU Spell | Transformer Class |
---|---|
nlu.load(pdf2table ) |
PdfToTextTable |
nlu.load(ppt2table ) |
PptToTextTable |
nlu.load(doc2table ) |
DocToTextTable |
This is powerd by John Snow Labs Spark OCR Annotataors for PdfToTextTable, DocToTextTable, PptToTextTable
NLU 4.0 Core Overview
-
On the NLU core side we have over 1000+ new state-of-the-art models in over 30 languages for modern extractive transformer-based Question Answering problems powerd by the ALBERT/BERT/DistilBERT/DeBERTa/RoBERTa/Longformer Spark NLP Annotators trained on various SQUAD-like QA datasets for domains like Twitter, Tech, News, Biomedical COVID-19 and in various model subflavors like sci_bert, electra, mini_lm, covid_bert, bio_bert, indo_bert, muril, sapbert, bioformer, link_bert, mac_bert
-
Additionally up to 700% speedup transformer-based Word Embeddings on GPU and up to 97% speedup on CPU for tensorflow operations, support for Apple M1 chips, Pyspark 3.2 and 3.3 support. Ontop of this, we are now supporting Apple M1 based architectures and every Pyspark 3.X version, while deprecating support for Pyspark 2.X.
-
Finally, NLU-Core features various new helper methods for working with Spark NLP and embellishes now the entire universe of Annotators defined by Spark NLP and Spark NLP for healthcare.
NLU 4.0 for Healthcare Overview
-
On the healthcare side NLU features 20 Biomedical models for over 8 languages (English, French, Italian, Portuguese, Romanian, Catalan and Galician) detect entities like
HUMAN
andSPECIES
based on LivingNER corpus -
Romanian models for Deidentification and extracting Medical entities like
Measurements
,Form
,Symptom
,Route
,Procedure
,Disease_Syndrome_Disorder
,Score
,Drug_Ingredient
,Pulse
,Frequency
,Date
,Body_Part
,Drug_Brand_Name
,Time
,Direction
,Dosage
,Medical_Device
,Imaging_Technique
,Test
,Imaging_Findings
,Imaging_Test
,Test_Result
,Weight
,Clinical_Dept
andUnits
with SPELL and SPELL respectively -
English NER models for parsing entities in Clinical Trial Abstracts like
Age
,AllocationRatio
,Author
,BioAndMedicalUnit
,CTAnalysisApproach
,CTDesign
,Confidence
,Country
,DisorderOrSyndrome
,DoseValue
,Drug
,DrugTime
,Duration
,Journal
,NumberPatients
,PMID
,PValue
,PercentagePatients
,PublicationYear
,TimePoint
,Value
usingen.med_ner.clinical_trials_abstracts.pipe
and also Pathogen NER models forPathogen
,MedicalCondition
,Medicine
withen.med_ner.pathogen
andGENE_PROTEIN
withen.med_ner.biomedical_bc2gm.pipeline
-
First Public Health Model for Emotional Stress classification It is a PHS-BERT-based model and trained with the Dreaddit dataset using
en.classify.stress
-
50 + new Entity Mappers for problems like :
- Extract section headers in scientific articles and normalize them with
en.map_entity.section_headers_normalized
- Map medical abbreviates to their definitions with
en.map_entity.abbreviation_to_definition
- Map drugs to action and treatments with
en.map_entity.drug_to_action_treatment
- Map drug brand to their National Drug Code (NDC) with
en.map_entity.drug_brand_to_ndc
- Convert between terminologies using
en.<START_TERMINOLOGY>_to_<TARGET_TERMINOLOGY>
- This works for the terminologies
rxnorm
,ndc
,umls
,icd10cm
,icdo
,umls
,snomed
,mesh
snomed_to_icdo
snomed_to_icd10cm
rxnorm_to_umls
- This works for the terminologies
- powerd by Spark NLP for Healthcares ChunkMapper Annotator
- Extract section headers in scientific articles and normalize them with
Extract Tables from PDF files as Pandas DataFrames
Sample PDF:
nlu.load('pdf2table').predict('/path/to/sample.pdf')
Output of PDF Table OCR :
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear |
---|---|---|---|---|---|---|---|---|---|
21 | 6 | 160 | 110 | 3.9 | 2.62 | 16.46 | 0 | 1 | 4 |
21 | 6 | 160 | 110 | 3.9 | 2.875 | 17.02 | 0 | 1 | 4 |
22.8 | 4 | 108 | 93 | 3.85 | 2.32 | 18.61 | 1 | 1 | 4 |
21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 |
18.7 | 8 | 360 | 175 | 3.15 | 3.44 | 17.02 | 0 | 0 | 3 |
13.3 | 8 | 350 | 245 | 3.73 | 3.84 | 15.41 | 0 | 0 | 3 |
19.2 | 8 | 400 | 175 | 3.08 | 3.845 | 17.05 | 0 | 0 | 3 |
27.3 | 4 | 79 | 66 | 4.08 | 1.935 | 18.9 | 1 | 1 | 4 |
26 | 4 | 120.3 | 91 | 4.43 | 2.14 | 16.7 | 0 | 1 | 5 |
30.4 | 4 | 95.1 | 113 | 3.77 | 1.513 | 16.9 | 1 | 1 | 5 |
15.8 | 8 | 351 | 264 | 4.22 | 3.17 | 14.5 | 0 | 1 | 5 |
19.7 | 6 | 145 | 175 | 3.62 | 2.77 | 15.5 | 0 | 1 | 5 |
15 | 8 | 301 | 335 | 3.54 | 3.57 | 14.6 | 0 | 1 | 5 |
21.4 | 4 | 121 | 109 | 4.11 | 2.78 | 18.6 | 1 | 1 | 4 |
Extract Tables from DOC/DOCX files as Pandas DataFrames
Sample DOCX:
nlu.load('doc2table').predict('/path/to/sample.docx')
Output of DOCX Table OCR :
Screen Reader | Responses | Share |
---|---|---|
JAWS | 853 | 49% |
NVDA | 238 | 14% |
Window-Eyes | 214 | 12% |
System Access | 181 | 10% |
VoiceOver | 159 | 9% |
Extract Tables from PPT files as Pandas DataFrame
Sample PPT with two tables:
nlu.load('ppt2table').predict('/path/to/sample.docx')
Output of PPT Table OCR :
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa |
4.9 | 3 | 1.4 | 0.2 | setosa |
4.7 | 3.2 | 1.3 | 0.2 | setosa |
4.6 | 3.1 | 1.5 | 0.2 | setosa |
5 | 3.6 | 1.4 | 0.2 | setosa |
5.4 | 3.9 | 1.7 | 0.4 | setosa |
and
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
6.7 | 3.3 | 5.7 | 2.5 | virginica |
6.7 | 3 | 5.2 | 2.3 | virginica |
6.3 | 2.5 | 5 | 1.9 | virginica |
6.5 | 3 | 5.2 | 2 | virginica |
6.2 | 3.4 | 5.4 | 2.3 | virginica |
5.9 | 3 | 5.1 | 1.8 | virginica |
Span Classifiers for question answering
Albert, Bert, DeBerta, DistilBert, LongFormer, RoBerta, XlmRoBerta based Transformer Architectures are now avaiable for question answering with almost 1000 models avaiable for 35 unique languages powerd by their corrosponding Spark NLP XXXForQuestionAnswering Annotator Classes and in various tuning and dataset flavours.
<lang>.answer_question.<domain>.<datasets>.<annotator_class><tune info>.by_<username>
If multiple datasets or tune parameters are defined , they are connected with a _
.
These substrings define up the <domain>
part of the NLU reference
- Legal cuad
- COVID 19 Biomedical biosaq
- Biomedical Literature pubmed
- Twitter tweet
- Wikipedia wiki
- News news
- Tech tech
These substrings define up the <dataset>
part of the NLU reference
- Arabic SQUAD ARCD
- Turkish TQUAD
- German GermanQuad
- Indonesian AQG
- Korean KLUE, KORQUAD
- HindiCHAI
- Multi-LingualMLQA
- Multi-Lingualtydiqa
- Multi-Lingualxquad
These substrings define up the <dataset>
part of the NLU reference
- Alternative Eval method reqa
- Synthetic Data synqa
- Benchmark / Eval Method ABSA-Bench roberta_absa
- Arabic architecture type soqaol
These substrings define the <annotator_class>
substring, if it does not map to a sparknlp annotator
These substrings define the <tune_info>
substring, if it does not map to a sparknlp annotator
- Train tweaks :
multilingual
,mini_lm
,xtremedistiled
,distilled
,xtreme
,augmented
,zero_shot
- Size tweaks
xl
,xxl
,large
,base
,medium
,base
,small
,tiny
,cased
,uncased
- Dimension tweaks :
1024d
,768d
,512d
,256d
,128d
,64d
,32d
QA DataFormat
You need to use one of the Data formats below to pass context and question correctly to the model.
# use ||| to seperate question||context
data = 'What is my name?|||My name is Clara and I live in Berkeley'
# pass a tuple (question,context)
data = ('What is my name?','My name is Clara and I live in Berkeley')
# use pandas Dataframe, one column = question, one column=context
data = pd.DataFrame({
'question': ['What is my name?'],
'context': ["My name is Clara and I live in Berkely"]
})
# Get your answers with any of above formats
nlu.load("en.answer_question.squadv2.deberta").predict(data)
returns :
answer | answer_confidence | context | question |
---|---|---|---|
Clara | 0.994931 | My name is Clara and I live in Berkely | What is my name? |
New NLU helper Methods
You can see all features showcased in the notebook or on the new docs page for Spark NLP utils
nlu.viz(pipe,data)
Visualize input data with an already configured Spark NLP pipeline,
for Algorithms of type (Ner,Assertion, Relation, Resolution, Dependency)
using Spark NLP Display
Automatically infers applicable viz type and output columns to use for visualization.
Example:
# works with Pipeline, LightPipeline, PipelineModel,PretrainedPipeline List[Annotator]
ade_pipeline = PretrainedPipeline('explain_clinical_doc_ade', 'en', 'clinical/models')
text = """I have an allergic reaction to vancomycin.
My skin has be itchy, sore throat/burning/itchy, and numbness in tongue and gums.
I would not recommend this drug to anyone, especially since I have never had such an adverse reaction to any other medication."""
nlu.viz(ade_pipeline, text)
returns:
If a pipeline has multiple models candidates that can be used for a viz,
the first Annotator that is vizzable will be used to create viz.
You can specify which type of viz to create with the viz_type parameter
Output columns to use for the viz are automatically deducted from the pipeline, by using the
first annotator that provides the correct output type for a specific viz.
You can specify which columns to use for a viz by using the
corresponding ner_col, pos_col, dep_untyped_col, dep_typed_col, resolution_col, relation_col, assertion_col, parameters.
nlu.autocomplete_pipeline(pipe)
Auto-Complete a pipeline or single annotator into a runnable pipeline by harnessing NLU’s DAG Autocompletion algorithm and returns it as NLU pipeline.
The standard Spark pipeline is avaiable on the .vanilla_transformer_pipe
attribute of the returned nlu pipe
Every Annotator and Pipeline of Annotators defines a DAG
of tasks, with various dependencies that must be satisfied in topoligical order
.
NLU enables the completion of an incomplete DAG by finding or creating a path between
the very first input node which is almost always is DocumentAssembler/MultiDocumentAssembler
and the very last node(s), which is given by the topoligical sorting
the iterable annotators parameter.
Paths are created by resolving input features of annotators to the corrrosponding providers with matching storage references.
Example:
# Lets autocomplete the pipeline for a RelationExtractionModel, which as many input columns and sub-dependencies.
from sparknlp_jsl.annotator import RelationExtractionModel
re_model = RelationExtractionModel().pretrained("re_ade_clinical", "en", 'clinical/models').setOutputCol('relation')
text = """I have an allergic reaction to vancomycin.
My skin has be itchy, sore throat/burning/itchy, and numbness in tongue and gums.
I would not recommend this drug to anyone, especially since I have never had such an adverse reaction to any other medication."""
nlu_pipe = nlu.autocomplete_pipeline(re_model)
nlu_pipe.predict(text)
returns :
relation | relation_confidence | relation_entity1 | relation_entity2 | relation_entity2_class |
---|---|---|---|---|
1 | 1 | allergic reaction | vancomycin | Drug_Ingredient |
1 | 1 | skin | itchy | Symptom |
1 | 0.99998 | skin | sore throat/burning/itchy | Symptom |
1 | 0.956225 | skin | numbness | Symptom |
1 | 0.999092 | skin | tongue | External_body_part_or_region |
0 | 0.942927 | skin | gums | External_body_part_or_region |
1 | 0.806327 | itchy | sore throat/burning/itchy | Symptom |
1 | 0.526163 | itchy | numbness | Symptom |
1 | 0.999947 | itchy | tongue | External_body_part_or_region |
0 | 0.994618 | itchy | gums | External_body_part_or_region |
0 | 0.994162 | sore throat/burning/itchy | numbness | Symptom |
1 | 0.989304 | sore throat/burning/itchy | tongue | External_body_part_or_region |
0 | 0.999969 | sore throat/burning/itchy | gums | External_body_part_or_region |
1 | 1 | numbness | tongue | External_body_part_or_region |
1 | 1 | numbness | gums | External_body_part_or_region |
1 | 1 | tongue | gums | External_body_part_or_region |
nlu.to_pretty_df(pipe,data)
Annotates a Pandas Dataframe/Pandas Series/Numpy Array/Spark DataFrame/Python List strings /Python String
with given Spark NLP pipeline, which is assumed to be complete and runnable and returns it in a pythonic pandas dataframe format.
Example:
# works with Pipeline, LightPipeline, PipelineModel,PretrainedPipeline List[Annotator]
ade_pipeline = PretrainedPipeline('explain_clinical_doc_ade', 'en', 'clinical/models')
text = """I have an allergic reaction to vancomycin.
My skin has be itchy, sore throat/burning/itchy, and numbness in tongue and gums.
I would not recommend this drug to anyone, especially since I have never had such an adverse reaction to any other medication."""
# output is same as nlu.autocomplete_pipeline(re_model).nlu_pipe.predict(text)
nlu.to_pretty_df(ade_pipeline,text)
returns :
assertion | asserted_entitiy | entitiy_class | assertion_confidence |
---|---|---|---|
present | allergic reaction | ADE | 0.998 |
present | itchy | ADE | 0.8414 |
present | sore throat/burning/itchy | ADE | 0.9019 |
present | numbness in tongue and gums | ADE | 0.9991 |
Annotators are grouped internally by NLU into output levels token
,sentence
, document
,chunk
and relation
Same level annotators output columns are zipped and exploded together to create the final output df.
Additionally, most keys from the metadata dictionary in the result annotations will be collected and expanded into their own columns in the resulting Dataframe, with special handling for Annotators that encode multiple metadata fields inside of one, seperated by strings like |||
or :::
.
Some columns are omitted from metadata to reduce total amount of output columns, these can be re-enabled by setting metadata=True
For a given pipeline output level is automatically set to the last anntators output level by default.
This can be changed by defining to_preddty_df(pipe,text,output_level='my_level'
for levels token
,sentence
, document
,chunk
and relation
.
nlu.to_nlu_pipe(pipe)
Convert a pipeline or list of annotators into a NLU pipeline making .predict()
and .viz()
avaiable for every Spark NLP pipeline.
Assumes the pipeline is already runnable.
# works with Pipeline, LightPipeline, PipelineModel,PretrainedPipeline List[Annotator]
ade_pipeline = PretrainedPipeline('explain_clinical_doc_ade', 'en', 'clinical/models')
text = """I have an allergic reaction to vancomycin.
My skin has be itchy, sore throat/burning/itchy, and numbness in tongue and gums.
I would not recommend this drug to anyone, especially since I have never had such an adverse reaction to any other medication."""
nlu_pipe = nlu.to_nlu_pipe(ade_pipeline)
# Same output as nlu.to_pretty_df(pipe,text)
nlu_pipe.predict(text)
# same output as nlu.viz(pipe,text)
nlu_pipe.viz(text)
# Acces auto-completed Spark NLP big data pipeline,
nlu_pipe.vanilla_transformer_pipe.transform(spark_df)
returns :
assertion | asserted_entitiy | entitiy_class | assertion_confidence |
---|---|---|---|
present | allergic reaction | ADE | 0.998 |
present | itchy | ADE | 0.8414 |
present | sore throat/burning/itchy | ADE | 0.9019 |
present | numbness in tongue and gums | ADE | 0.9991 |
and
4 new Demo Notebooks
These notebooks showcase some of latest classifier models for Banking Queries, Intents in Text, Question and new s classification
- Notebook for Classification of Banking Queries
- Notebook for Classification of Intent in Texts
- Notebook for classification of Similar Questions
- Notebook for Classification of Questions vs Statements
- Notebook for Classification of News into 4 classes
NLU captures every Annotator of Spark NLP and Spark NLP for healthcare
The entire universe of Annotators in Spark NLP and Spark-NLP for healthcare is now embellished by NLU Components by using generalizable annotation extractors methods and configs internally to support enable the new NLU util methods. The following annotator classes are newly captured:
- AssertionFilterer
- ChunkConverter
- ChunkKeyPhraseExtraction
- ChunkSentenceSplitter
- ChunkFiltererApproach
- ChunkFilterer
- ChunkMapperApproach
- ChunkMapperFilterer
- DocumentLogRegClassifierApproach
- DocumentLogRegClassifierModel
- ContextualParserApproach
- ReIdentification
- NerDisambiguator
- NerDisambiguatorModel
- AverageEmbeddings
- EntityChunkEmbeddings
- ChunkMergeApproach
- ChunkMergeApproach
- IOBTagger
- NerChunker
- NerConverterInternalModel
- DateNormalizer
- PosologyREModel
- RENerChunksFilter
- ResolverMerger
- AnnotationMerger
- Router
- Word2VecApproach
- WordEmbeddings
- EntityRulerApproach
- EntityRulerModel
- TextMatcherModel
- BigTextMatcher
- BigTextMatcherModel
- DateMatcher
- MultiDateMatcher
- RegexMatcher
- TextMatcher
- NerApproach
- NerCrfApproach
- NerOverwriter
- DependencyParserApproach
- TypedDependencyParserApproach
- SentenceDetectorDLApproach
- SentimentDetector
- ViveknSentimentApproach
- ContextSpellCheckerApproach
- NorvigSweetingApproach
- SymmetricDeleteApproach
- ChunkTokenizer
- ChunkTokenizerModel
- RecursiveTokenizer
- RecursiveTokenizerModel
- Token2Chunk
- WordSegmenterApproach
- GraphExtraction
- Lemmatizer
- Normalizer
All NLU 4.0 for Healthcare Models
Some examples:
en.rxnorm.umls.mapping
Code:
nlu.load('en.rxnorm.umls.mapping').predict('1161611 315677')
mapped_entity_umls_code_origin_entity | mapped_entity_umls_code |
---|---|
1161611 | C3215948 |
315677 | C0984912 |
en.ner.clinical_trials_abstracts
Code:
nlu.load('en.ner.clinical_trials_abstracts').predict('A one-year, randomised, multicentre trial comparing insulin glargine with NPH insulin in combination with oral agents in patients with type 2 diabetes.')
Results:
entities_clinical_trials_abstracts | entities_clinical_trials_abstracts_class | entities_clinical_trials_abstracts_confidence | |
---|---|---|---|
0 | randomised | CTDesign | 0.9996 |
0 | multicentre | CTDesign | 0.9998 |
0 | insulin glargine | Drug | 0.99135 |
0 | NPH insulin | Drug | 0.96875 |
0 | type 2 diabetes | DisorderOrSyndrome | 0.999933 |
Code:
nlu.load('en.ner.clinical_trials_abstracts').viz('A one-year, randomised, multicentre trial comparing insulin glargine with NPH insulin in combination with oral agents in patients with type 2 diabetes.')
Results:
en.med_ner.pathogen
Code:
nlu.load('en.med_ner.pathogen').predict('Racecadotril is an antisecretory medication and it has better tolerability than loperamide. Diarrhea is the condition of having loose, liquid or watery bowel movements each day. Signs of dehydration often begin with loss of the normal stretchiness of the skin. While it has been speculated that rabies virus, Lyssavirus and Ephemerovirus could be transmitted through aerosols, studies have concluded that this is only feasible in limited conditions.')
Results:
entities_pathogen | entities_pathogen_class | entities_pathogen_confidence | |
---|---|---|---|
0 | Racecadotril | Medicine | 0.9468 |
0 | loperamide | Medicine | 0.9987 |
0 | Diarrhea | MedicalCondition | 0.9848 |
0 | dehydration | MedicalCondition | 0.6307 |
0 | rabies virus | Pathogen | 0.95685 |
0 | Lyssavirus | Pathogen | 0.9694 |
0 | Ephemerovirus | Pathogen | 0.6917 |
Code:
nlu.load('en.med_ner.pathogen').viz('Racecadotril is an antisecretory medication and it has better tolerability than loperamide. Diarrhea is the condition of having loose, liquid or watery bowel movements each day. Signs of dehydration often begin with loss of the normal stretchiness of the skin. While it has been speculated that rabies virus, Lyssavirus and Ephemerovirus could be transmitted through aerosols, studies have concluded that this is only feasible in limited conditions.')
Results:
es.med_ner.living_species.roberta
Code:
nlu.load('es.med_ner.living_species.roberta').predict('Lactante varón de dos años. Antecedentes familiares sin interés. Antecedentes personales: Embarazo, parto y periodo neonatal normal. En seguimiento por alergia a legumbres, diagnosticado con diez meses por reacción urticarial generalizada con lentejas y garbanzos, con dieta de exclusión a legumbres desde entonces. En ésta visita la madre describe episodios de eritema en zona maxilar derecha con afectación ocular ipsilateral que se resuelve en horas tras la administración de corticoides. Le ha ocurrido en 5-6 ocasiones, en relación con la ingesta de alimentos previamente tolerados. Exploración complementaria: Cacahuete, ac(ige)19.2 Ku.arb/l. Resultados: Ante la sospecha clínica de Síndrome de Frey, se tranquiliza a los padres, explicándoles la naturaleza del cuadro y se cita para revisión anual.')
Results:
entities_living_species | entities_living_species_class | entities_living_species_confidence | |
---|---|---|---|
0 | Lactante varón | HUMAN | 0.93175 |
0 | familiares | HUMAN | 1 |
0 | personales | HUMAN | 1 |
0 | neonatal | HUMAN | 0.9997 |
0 | legumbres | SPECIES | 0.9962 |
0 | lentejas | SPECIES | 0.9988 |
0 | garbanzos | SPECIES | 0.9901 |
0 | legumbres | SPECIES | 0.9976 |
0 | madre | HUMAN | 1 |
0 | Cacahuete | SPECIES | 0.998 |
0 | padres | HUMAN | 1 |
Code:
nlu.load('es.med_ner.living_species.roberta').viz('Lactante varón de dos años. Antecedentes familiares sin interés. Antecedentes personales: Embarazo, parto y periodo neonatal normal. En seguimiento por alergia a legumbres, diagnosticado con diez meses por reacción urticarial generalizada con lentejas y garbanzos, con dieta de exclusión a legumbres desde entonces. En ésta visita la madre describe episodios de eritema en zona maxilar derecha con afectación ocular ipsilateral que se resuelve en horas tras la administración de corticoides. Le ha ocurrido en 5-6 ocasiones, en relación con la ingesta de alimentos previamente tolerados. Exploración complementaria: Cacahuete, ac(ige)19.2 Ku.arb/l. Resultados: Ante la sospecha clínica de Síndrome de Frey, se tranquiliza a los padres, explicándoles la naturaleza del cuadro y se cita para revisión anual.')
Results:
All healthcare models added in NLU 4.0 :
All NLU 4.0 Core Models
All core models added in NLU 4.0 : Can be found on the NLU website because of Github Limitations
Minor Improvements
- IOB Schema Detection for Tokenclassifiers and adding NER Converting in those cases
- Tweaks in column name generation of most annotators
Bug Fixes
- fixed bug in multi lang parsing
- fixed bug for Normalizers
- fixed bug in fetching metadata for resolvers
- fixed bug in deducting outputlevel and inferring output columns
- fixed broken nlp_refs
NLU Version 3.4.4
600 new models with over 75 new languages including Ancient,Dead and Extinct languages, 155 languages total covered, 400% Tokenizer Speedup, 18x USE-Embeddings GPU speedup in John Snow Labs NLU 3.4.4
We are very excited to announce NLU 3.4.4 has been released with over 600 new model, over 75 new languages and 155 languages covered in total,
400% speedup for tokenizers and 18x speedup of UniversalSentenceEncoder on GPU.
On the general NLP side we have transformer based Embeddings and Token Classifiers powered by state of the art CamemBertEmbeddings and DeBertaForTokenClassification based
architectures as well as various new models for
Historical
, Ancient
,Dead
, Extinct
, Genetic
and Constructed
languages like
Old Church Slavonic
, Latin
, Sanskrit
, Esperanto
, Volapük
, Coptic
, Nahuatl
, Ancient Greek (to 1453)
, Old Russian
.
On the healthcare side we have Portuguese De-identification Models
, have NER
models for Gene detection and finally RxNorm Sentence resolution model for mapping and extracting pharmaceutical actions (e.g. analgesic, hypoglycemic)
as well as treatments (e.g. backache, diabetes).
General NLP Models
All general NLP models
First time language models covered
The languages for these models are covered for the very first time ever by NLU.
Number | Language Name(s) | NLU Reference | Spark NLP Reference | Task | Annotator Class | ISO-639-1 | ISO-639-2/639-5 | ISO-639-3 | Scope | Language Type |
---|---|---|---|---|---|---|---|---|---|---|
0 | Sanskrit | sa.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | sa | san | san | Individual | Ancient |
1 | Sanskrit | sa.lemma | lemma_vedic | Lemmatization | LemmatizerModel | sa | san | san | Individual | Ancient |
2 | Sanskrit | sa.pos | pos_vedic | Part of Speech Tagging | PerceptronModel | sa | san | san | Individual | Ancient |
3 | Sanskrit | sa.stopwords | stopwords_iso | Stop Words Removal | StopWordsCleaner | sa | san | san | Individual | Ancient |
4 | Volapük | vo.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | vo | vol | vol | Individual | Constructed |
5 | Nahuatl languages | nah.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | nan | nah | nan | Collective | Genetic |
6 | Aragonese | an.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | an | arg | arg | Individual | Living |
7 | Assamese | as.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | as | asm | asm | Individual | Living |
8 | Asturian, Asturleonese, Bable, Leonese | ast.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | nan | ast | ast | Individual | Living |
9 | Bashkir | ba.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | ba | bak | bak | Individual | Living |
10 | Bavarian | bar.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | nan | nan | bar | Individual | Living |
11 | Bishnupriya | bpy.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | nan | nan | bpy | Individual | Living |
12 | Burmese | my.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | my | 639-2/T: mya639-2/B: bur | mya | Individual | Living |
13 | Cebuano | ceb.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | nan | ceb | ceb | Individual | Living |
14 | Central Bikol | bcl.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | nan | nan | bcl | Individual | Living |
15 | Chechen | ce.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | ce | che | che | Individual | Living |
16 | Chuvash | cv.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | cv | chv | chv | Individual | Living |
17 | Corsican | co.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | co | cos | cos | Individual | Living |
18 | Dhivehi, Divehi, Maldivian | dv.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | dv | div | div | Individual | Living |
19 | Egyptian Arabic | arz.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | nan | nan | arz | Individual | Living |
20 | Emiliano-Romagnolo | eml.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | eml | nan | nan | Individual | Living |
21 | Erzya | myv.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | nan | myv | myv | Individual | Living |
22 | Georgian | ka.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | ka | 639-2/T: kat639-2/B: geo | kat | Individual | Living |
23 | Goan Konkani | gom.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | nan | nan | gom | Individual | Living |
24 | Javanese | jv.embed.distilbert | distilbert_embeddings_javanese_distilbert_small | Embeddings | DistilBertEmbeddings | jv | jav | jav | Individual | Living |
25 | Javanese | jv.embed.javanese_distilbert_small_imdb | distilbert_embeddings_javanese_distilbert_small_imdb | Embeddings | DistilBertEmbeddings | jv | jav | jav | Individual | Living |
26 | Javanese | jv.embed.javanese_roberta_small | roberta_embeddings_javanese_roberta_small | Embeddings | RoBertaEmbeddings | jv | jav | jav | Individual | Living |
27 | Javanese | jv.embed.javanese_roberta_small_imdb | roberta_embeddings_javanese_roberta_small_imdb | Embeddings | RoBertaEmbeddings | jv | jav | jav | Individual | Living |
28 | Javanese | jv.embed.javanese_bert_small_imdb | bert_embeddings_javanese_bert_small_imdb | Embeddings | BertEmbeddings | jv | jav | jav | Individual | Living |
29 | Javanese | jv.embed.javanese_bert_small | bert_embeddings_javanese_bert_small | Embeddings | BertEmbeddings | jv | jav | jav | Individual | Living |
30 | Kirghiz, Kyrgyz | ky.stopwords | stopwords_iso | Stop Words Removal | StopWordsCleaner | ky | kir | kir | Individual | Living |
31 | Letzeburgesch, Luxembourgish | lb.stopwords | stopwords_iso | Stop Words Removal | StopWordsCleaner | lb | ltz | ltz | Individual | Living |
32 | Letzeburgesch, Luxembourgish | lb.lemma | lemma_spacylookup | Lemmatization | LemmatizerModel | lb | ltz | ltz | Individual | Living |
33 | Letzeburgesch, Luxembourgish | lb.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | lb | ltz | ltz | Individual | Living |
34 | Ligurian | lij.stopwords | stopwords_iso | Stop Words Removal | StopWordsCleaner | nan | nan | lij | Individual | Living |
35 | Lombard | lmo.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | nan | nan | lmo | Individual | Living |
36 | Low German, Low Saxon | nds.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | nan | nds | nds | Individual | Living |
37 | Macedonian | mk.stopwords | stopwords_iso | Stop Words Removal | StopWordsCleaner | mk | 639-2/T: mkd639-2/B: mac | mkd | Individual | Living |
38 | Macedonian | mk.lemma | lemma_spacylookup | Lemmatization | LemmatizerModel | mk | 639-2/T: mkd639-2/B: mac | mkd | Individual | Living |
39 | Macedonian | mk.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | mk | 639-2/T: mkd639-2/B: mac | mkd | Individual | Living |
40 | Maithili | mai.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | nan | mai | mai | Individual | Living |
41 | Manx | gv.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | gv | glv | glv | Individual | Living |
42 | Mazanderani | mzn.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | nan | nan | mzn | Individual | Living |
43 | Minangkabau | min.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | nan | min | min | Individual | Living |
44 | Mingrelian | xmf.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | nan | nan | xmf | Individual | Living |
45 | Mirandese | mwl.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | nan | mwl | mwl | Individual | Living |
46 | Neapolitan | nap.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | nan | nap | nap | Individual | Living |
47 | Nepal Bhasa, Newari | new.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | nan | new | new | Individual | Living |
48 | Northern Frisian | frr.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | nan | frr | frr | Individual | Living |
49 | Northern Sami | sme.lemma | lemma_giella | Lemmatization | LemmatizerModel | se | sme | sme | Individual | Living |
50 | Northern Sami | sme.pos | pos_giella | Part of Speech Tagging | PerceptronModel | se | sme | sme | Individual | Living |
51 | Northern Sotho, Pedi, Sepedi | nso.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | nan | nso | nso | Individual | Living |
52 | Occitan (post 1500) | oc.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | oc | oci | oci | Individual | Living |
53 | Ossetian, Ossetic | os.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | os | oss | oss | Individual | Living |
54 | Pfaelzisch | pfl.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | nan | nan | pfl | Individual | Living |
55 | Piemontese | pms.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | nan | nan | pms | Individual | Living |
56 | Romansh | rm.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | rm | roh | roh | Individual | Living |
57 | Scots | sco.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | nan | sco | sco | Individual | Living |
58 | Sicilian | scn.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | nan | scn | scn | Individual | Living |
59 | Sinhala, Sinhalese | si.stopwords | stopwords_iso | Stop Words Removal | StopWordsCleaner | si | sin | sin | Individual | Living |
60 | Sinhala, Sinhalese | si.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | si | sin | sin | Individual | Living |
61 | Sundanese | su.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | su | sun | sun | Individual | Living |
62 | Sundanese | su.embed.sundanese_roberta_base | roberta_embeddings_sundanese_roberta_base | Embeddings | RoBertaEmbeddings | su | sun | sun | Individual | Living |
63 | Tagalog | tl.lemma | lemma_spacylookup | Lemmatization | LemmatizerModel | tl | tgl | tgl | Individual | Living |
64 | Tagalog | tl.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | tl | tgl | tgl | Individual | Living |
65 | Tagalog | tl.stopwords | stopwords_iso | Stop Words Removal | StopWordsCleaner | tl | tgl | tgl | Individual | Living |
66 | Tagalog | tl.embed.roberta_tagalog_large | roberta_embeddings_roberta_tagalog_large | Embeddings | RoBertaEmbeddings | tl | tgl | tgl | Individual | Living |
67 | Tagalog | tl.embed.roberta_tagalog_base | roberta_embeddings_roberta_tagalog_base | Embeddings | RoBertaEmbeddings | tl | tgl | tgl | Individual | Living |
68 | Tajik | tg.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | tg | tgk | tgk | Individual | Living |
69 | Tatar | tt.stopwords | stopwords_iso | Stop Words Removal | StopWordsCleaner | tt | tat | tat | Individual | Living |
70 | Tatar | tt.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | tt | tat | tat | Individual | Living |
71 | Tigrinya | ti.stopwords | stopwords_iso | Stop Words Removal | StopWordsCleaner | ti | tir | tir | Individual | Living |
72 | Tosk Albanian | als.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | nan | nan | als | Individual | Living |
73 | Tswana | tn.stopwords | stopwords_iso | Stop Words Removal | StopWordsCleaner | tn | tsn | tsn | Individual | Living |
74 | Turkmen | tk.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | tk | tuk | tuk | Individual | Living |
75 | Upper Sorbian | hsb.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | nan | hsb | hsb | Individual | Living |
76 | Venetian | vec.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | nan | nan | vec | Individual | Living |
77 | Vlaams | vls.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | nan | nan | vls | Individual | Living |
78 | Walloon | wa.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | wa | wln | wln | Individual | Living |
79 | Waray (Philippines) | war.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | nan | war | war | Individual | Living |
80 | Western Armenian | hyw.pos | pos_armtdp | Part of Speech Tagging | PerceptronModel | nan | nan | hyw | Individual | Living |
81 | Western Armenian | hyw.lemma | lemma_armtdp | Lemmatization | LemmatizerModel | nan | nan | hyw | Individual | Living |
82 | Western Frisian | fy.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | fy | fry | fry | Individual | Living |
83 | Western Panjabi | pnb.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | nan | nan | pnb | Individual | Living |
84 | Yakut | sah.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | nan | sah | sah | Individual | Living |
85 | Zeeuws | zea.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | nan | nan | zea | Individual | Living |
86 | Albanian | sq.stopwords | stopwords_iso | Stop Words Removal | StopWordsCleaner | sq | 639-2/T: sqi639-2/B: alb | sqi | Macrolanguage | Living |
87 | Albanian | sq.embed.w2v_cc_300d | w2v_cc_300d | Embeddings | WordEmbeddingsModel | sq | 639-2/T: sqi639-2/B: alb | sqi | Macrolanguage | Living |
88 | Azerbaijani |