Chinese BERT Base

Description

BERT (Bidirectional Encoder Representations from Transformers) provides dense vector representations for natural language by using a deep, pre-trained neural network with the Transformer architecture. It was originally published by

Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova: “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, 2018.

The weights of this model are those released by the original BERT authors. This model has been pre-trained for Chinese on Wikipedia. For training, random input masking has been applied independently to word pieces (as in the original BERT paper).

Download Copy S3 URICopied!

How to use

embeddings = BertEmbeddings.pretrained("bert_base_chinese", "zh") \
      .setInputCols("sentence", "token") \
      .setOutputCol("embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])

Model Information

Model Name:	bert_base_chinese
Compatibility:	Spark NLP 3.1.0+
License:	Open Source
Edition:	Official
Input Labels:	[token, sentence]
Output Labels:	[embeddings]
Language:	zh
Case sensitive:	true

Data Source

https://huggingface.co/bert-base-chinese

PREVIOUSUniversal Sentence Encoder Multilingual (tfhub_use_multi)

NEXTBERTje A Dutch BERT model