BERT (Bidirectional Encoder Representations from Transformers) provides dense vector representations for natural language by using a deep, pre-trained neural network with the Transformer architecture. It was originally published by
Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova: “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, 2018.
The weights of this model are those released by the original BERT authors. This model has been pre-trained for Chinese on Wikipedia. For training, random input masking has been applied independently to word pieces (as in the original BERT paper).
How to use
embeddings = BertEmbeddings.pretrained("bert_base_chinese", "zh") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])
val embeddings = BertEmbeddings.pretrained("bert_base_chinese", "zh") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))
|Compatibility:||Spark NLP 3.1.0+|
|Input Labels:||[token, sentence]|