Legal Embeddings BGE Base

Description

This model is a legal version of the BGE base model fine-tuned on in-house curated datasets. Reference: Xiao, S., Liu, Z., Zhang, P., & Muennighof, N. (2023). C-pack: Packaged resources to advance general chinese embedding. arXiv preprint arXiv:2309.07597.

Predicted Entities

Download Copy S3 URI

How to use

documentAssembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")

tokenizer = nlp.Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")

BGE_loaded = nlp.BertEmbeddings.pretrained("legembeddings_bge_base", "en", "legal/models")\
    .setInputCols(["document", "token"])\
    .setOutputCol("BGE")\

pipeline = nlp.Pipeline(
    stages = [
        documentAssembler,
        tokenizer,
        BGE_loaded
  ])

data = spark.createDataFrame([['''Receiving Party shall not use any Confidential Information for any purpose other than the purposes stated in Agreement.''']]).toDF("text")

model = pipeline.fit(data)
result = model.transform(data)
result.show(truncate=150)

Results

+----------------------------------------------------------------------------------------------------+
|                                                                                          embeddings|
+----------------------------------------------------------------------------------------------------+
|[-0.060075462, -0.26741037, 0.32553613, 0.13449538, 0.22019976, -0.35624868, 1.1038424, 0.8212698...|
|[-0.10228735, -0.3738884, 0.27723783, 0.17312518, 0.26656383, -0.24942908, 1.1518378, 0.7217457, ...|
|[-0.38215938, -0.5851373, 0.35209915, -0.30132422, -0.9744857, 0.5976255, 0.86980593, 0.5825193, ...|
|[-0.8023102, -0.1705234, 0.4355616, -0.16370925, -0.99943596, -0.13651904, 1.0603938, 0.76027215,...|
|[0.17291568, -0.74328834, 0.43998405, -0.1694346, -0.7754292, -0.025751337, 1.1425712, 0.43741557...|
|[-0.27675575, -0.17631046, 0.09160468, -0.22860324, -0.6295841, -0.11335259, 1.0146872, 0.6610859...|
|[-0.11538671, -0.31234437, 0.21929267, 0.10618421, 0.2265009, -0.37587893, 1.1389759, 0.7971325, ...|
|[0.009457495, -0.33288023, 0.2432522, 0.12458266, 0.2707794, -0.36873063, 1.0906105, 0.70786965, ...|
|[-0.295701, -0.61499435, 0.07829141, -0.74933016, -0.531358, -0.18479005, 1.1679127, 0.5615579, 0...|
|[-0.67664135, 0.12311895, 0.08994642, -0.07882077, -0.6767479, -0.16962644, 1.0955209, 0.6912421,...|
|[-0.33884412, -0.26324403, -0.03943791, 0.12610006, -0.6458304, -0.3981361, 0.6717623, 0.5545144,...|
|[-0.84253764, -0.18777902, -0.0011436939, -0.29669517, -0.008230045, -0.19728595, 0.9491053, 0.67...|
|[-0.70816183, -0.22422114, -0.07173601, -0.18688664, -0.1930152, -0.30726036, 0.8886021, 0.789013...|
|[-0.18011564, 0.055544622, 0.061416026, -0.110076465, -0.028466597, -0.27377772, 0.98722064, 0.91...|
|[-0.4780874, -0.28484517, -0.105963364, 0.060177833, -0.75987476, -0.36107045, 0.6527582, 0.53413...|
|[-0.39539725, -0.6021485, -0.018175352, -0.12834826, -0.71462053, -0.17749298, 0.8468195, 0.59975...|
|[-0.095429584, -0.8838102, 0.5930538, -0.33268213, 0.010708451, 0.06336981, 1.2200518, 0.9934566,...|
|[0.06960945, -0.17862234, 0.36319345, 0.28421152, 0.22127056, -0.4145783, 1.0451053, 1.0578575, 0...|
|[-0.07706641, -0.09056446, 0.47557953, -0.14709732, 0.37253422, -0.39098266, 1.2081625, 1.2230319...|
+----------------------------------------------------------------------------------------------------+

Model Information

Model Name:	legembeddings_bge_base
Compatibility:	Legal NLP 1.0.0+
License:	Licensed
Edition:	Official
Input Labels:	[sentence, token]
Output Labels:	[bert]
Language:	en
Size:	1.2 GB
Case sensitive:	true

References

In-house curated legal datasets.

PREVIOUSFinance E5 Embedding Large

NEXTFinancial Assertion of Aspect-Based Sentiment (md, Medium)