Benchmarks

 

Cluster Speed Benchmarks

NER (BiLSTM-CNN-Char Architecture) Benchmark Experiment

  • Dataset: 1000 Clinical Texts from MTSamples Oncology Dataset, approx. 500 tokens per text.
  • Driver : Standard_D4s_v3 - 16 GB Memory - 4 Cores
  • Enable Autoscaling : False
  • Cluster Mode : Standart
  • Worker :
    • Standard_D4s_v3 - 16 GB Memory - 4 Cores
    • Standard_D4s_v2 - 28 GB Memory - 8 Cores
  • Versions:
    • Databricks Runtime Version : 8.3(Scala 2.12, Spark 3.1.1)
    • spark-nlp Version: v5.4.1
    • spark-nlp-jsl Version : v5.4.1
    • Spark Version : v5.4.1
  • Spark NLP Pipeline:
# NER Pipelime
nlpPipeline = Pipeline(stages=[
      documentAssembler,
      sentenceDetector,
      tokenizer,  
      embeddings_clinical,  
      clinical_ner,  
      ner_converter
      ])

# Multi (2) NER Pipeline
nlpPipeline = Pipeline(stages=[
      documentAssembler,
      sentenceDetector,
      tokenizer,  
      embeddings_clinical,  
      clinical_ner,  
      ner_converter,
      clinical_ner,  
      ner_converter,
      ])

# Multi (4) NER Pipeline
nlpPipeline = Pipeline(stages=[
      documentAssembler,
      sentenceDetector,
      tokenizer,  
      embeddings_clinical,  
      clinical_ner,  
      ner_converter,
      clinical_ner,  
      ner_converter,
      clinical_ner,  
      ner_converter,
      clinical_ner,  
      ner_converter
      ])

# NER & RE Pipeline
nlpPipeline = Pipeline(stages=[
      documentAssembler,
      sentenceDetector,
      tokenizer,  
      embeddings_clinical,  
      clinical_ner,  
      ner_converter,
      pos_tagger,
      dependency_parser,
      re_model
      ])

NOTES:

  • The first experiment with 5 different cluster configurations : ner_chunk as a column in Spark NLP Pipeline (ner_converter) output data frame, exploded (lazy evaluation) as ner_chunk and ner_label. Then results were written as parquet and delta formats.

  • A second experiment with 2 different cluster configuration : Spark NLP Pipeline output data frame (except word_embeddings column) was written as parquet and delta formats.

  • In the first experiment with the most basic driver node and worker (1 worker x 4 cores) configuration selection, it took 4.64 mins and 4.53 mins to write 4 partitioned data as parquet and delta formats respectively.

  • With basic driver node and 8 workers (x8 cores) configuration selection, it took 40 seconds and 22 seconds to write 1000 partitioned data as parquet and delta formats respectively.

  • In the second experiment with basic driver node and 4 workers (x 4 cores) configuration selection, it took 1.41 mins as parquet and 1.42 mins as delta format to write 16 partitioned (exploded results) data. Without explode it took 1.08 mins as parquet and 1.12 mins as delta format to write the data frame.

  • Since given computation durations are highly dependent on different parameters including driver node and worker node configurations as well as partitions, results show that explode method increases duration %10-30 on chosen configurations.

NER Benchmark Tables

  • 4 Cores setup:
    • Driver: Standard_D4s_v3, 4 core, 16 GB memory
    • Worker: Standard_D4s_v3, 4 core, 16 GB memory, total worker number: 1
    • Input Data Count: 1000
action partition NER
timing
2_NER
timing
4_NER
timing
NER+RE
timing
write_parquet 4 4 min 47 sec 8 min 37 sec 19 min 34 sec 7 min 20 sec
write_deltalake 4 4 min 36 sec 8 min 50 sec 20 min 54 sec 7 min 49 sec
write_parquet 8 4 min 14 sec 8 min 32 sec 19 min 43 sec 7 min 27 sec
write_deltalake 8 4 min 45 sec 8 min 31 sec 20 min 42 sec 7 min 54 sec
write_parquet 16 4 min 20 sec 8 min 31 sec 19 min 13 sec 7 min 24 sec
write_deltalake 16 4 min 45 sec 8 min 56 sec 19 min 53 sec 7 min 35 sec
write_parquet 32 4 min 26 sec 8 min 16 sec 19 min 39 sec 7 min 22 sec
write_deltalake 32 4 min 37 sec 8 min 32 sec 20 min 11 sec 7 min 35 sec
write_parquet 64 4 min 25 sec 8 min 19 sec 18 min 57 sec 7 min 37 sec
write_deltalake 64 4 min 45 sec 8 min 43 sec 19 min 26 sec 7 min 46 sec
write_parquet 100 4 min 37 sec 8 min 40 sec 19 min 22 sec 7 min 50 sec
write_deltalake 100 4 min 48 sec 8 min 57 sec 20 min 1 sec 7 min 53 sec
write_parquet 1000 5 min 32 sec 9 min 49 sec 22 min 41 sec 8 min 46 sec
write_deltalake 1000 5 min 38 sec 9 min 55 sec 22 min 32 sec 8 min 42 sec
  • 8 Cores setup:
    • Driver: Standard_D4s_v3, 4 core, 16 GB memory
    • Worker: Standard_D4s_v3, 4 core, 16 GB memory, total worker number: 2
    • Input Data Count: 1000
action partition NER
timing
2_NER
timing
4_NER
timing
NER+RE
timing
write_parquet 4 3 min 28 sec 6 min 9 sec 13 min 46 sec 5 min 32 sec
write_deltalake 4 3 min 19 sec 6 min 18 sec 14 min 12 sec 5 min 34 sec
write_parquet 8 2 min 58 sec 4 min 56 sec 11 min 31 sec 4 min 37 sec
write_deltalake 8 2 min 38 sec 5 min 11 sec 11 min 50 sec 4 min 41 sec
write_parquet 16 2 min 43 sec 5 min 12 sec 11 min 27 sec 4 min 35 sec
write_deltalake 16 2 min 53 sec 5 min 5 sec 11 min 46 sec 4 min 41 sec
write_parquet 32 2 min 42 sec 4 min 55 sec 11 min 15 sec 4 min 25 sec
write_deltalake 32 2 min 45 sec 5 min 14 sec 11 min 41 sec 4 min 41 sec
write_parquet 64 2 min 39 sec 5 min 7 sec 11 min 22 sec 4 min 29 sec
write_deltalake 64 2 min 45 sec 5 min 11 sec 11 min 31 sec 4 min 30 sec
write_parquet 100 2 min 41 sec 5 min 0 sec 11 min 26 sec 4 min 37 sec
write_deltalake 100 2 min 42 sec 5 min 0 sec 11 min 43 sec 4 min 48 sec
write_parquet 1000 3 min 10 sec 5 min 36 sec 13 min 3 sec 5 min 10 sec
write_deltalake 1000 3 min 20 sec 5 min 44 sec 12 min 55 sec 5 min 14 sec
  • 16 Cores setup:
    • Driver: Standard_D4s_v3, 4 core, 16 GB memory
    • Worker: Standard_D4s_v3, 4 core, 16 GB memory, total worker number: 4
    • Input Data Count: 1000
action partition NER
timing
2_NER
timing
4_NER
timing
NER+RE
timing
write_parquet 4 3 min 13 sec 5 min 35 sec 12 min 8 sec 4 min 57 sec
write_deltalake 4 3 min 26 sec 6 min 8 sec 12 min 46 sec 5 min 12 sec
write_parquet 8 1 min 55 sec 3 min 35 sec 8 min 19 sec 3 min 8 sec
write_deltalake 8 2 min 3 sec 4 min 9 sec 8 min 35 sec 3 min 15 sec
write_parquet 16 1 min 36 sec 3 min 11 sec 7 min 14 sec 2 min 35 sec
write_deltalake 16 1 min 41 sec 3 min 2 sec 6 min 58 sec 2 min 39 sec
write_parquet 32 1 min 42 sec 3 min 16 sec 7 min 22 sec 2 min 41 sec
write_deltalake 32 1 min 42 sec 3 min 13 sec 7 min 14 sec 2 min 38 sec
write_parquet 64 1 min 24 sec 2 min 32 sec 5 min 57 sec 2 min 22 sec
write_deltalake 64 1 min 21 sec 2 min 42 sec 5 min 43 sec 2 min 25 sec
write_parquet 100 1 min 24 sec 2 min 39 sec 5 min 59 sec 2 min 16 sec
write_deltalake 100 1 min 28 sec 2 min 56 sec 5 min 48 sec 2 min 43 sec
write_parquet 1000 1 min 41 sec 2 min 44 sec 6 min 12 sec 2 min 27 sec
write_deltalake 1000 1 min 40 sec 2 min 53 sec 6 min 18 sec 2 min 34 sec
  • 32 Cores setup:
    • Driver: Standard_D4s_v3, 4 core, 16 GB memory
    • Worker: Standard_D4s_v3, 4 core, 16 GB memory, total worker number: 8
    • Input Data Count: 1000
action partition NER
timing
2_NER
timing
4_NER
timing
NER+RE
timing
write_parquet 4 3 min 24 sec 5 min 24 sec 16 min 50 sec 8 min 17 sec
write_deltalake 4 3 min 5 sec 4 min 15 sec 12 min 7 sec 4 min 45 sec
write_parquet 8 1 min 47 sec 2 min 57 sec 6 min 19 sec 2 min 42 sec
write_deltalake 8 1 min 32 sec 2 min 52 sec 6 min 12 sec 2 min 32 sec
write_parquet 16 1 min 0 sec 1 min 57 sec 4 min 23 sec 1 min 38 sec
write_deltalake 16 1 min 4 sec 1 min 55 sec 4 min 18 sec 1 min 40 sec
write_parquet 32 49 sec 1 min 42 sec 3 min 32 sec 1 min 21 sec
write_deltalake 32 54 sec 1 min 36 sec 3 min 41 sec 1 min 45 sec
write_parquet 64 1 min 13 sec 1 min 45 sec 3 min 42 sec 1 min 28 sec
write_deltalake 64 53 sec 1 min 30 sec 3 min 29 sec 1 min 39 sec
write_parquet 100 1 min 4 sec 1 min 27 sec 3 min 23 sec 1 min 23 sec
write_deltalake 100 46 sec 1 min 22 sec 3 min 27 sec 1 min 22 sec
write_parquet 1000 54 sec 1 min 31 sec 3 min 18 sec 1 min 20 sec
write_deltalake 1000 57 sec 1 min 30 sec 3 min 20 sec 1 min 20 sec
  • 64 Cores setup:
    • Driver: Standard_D4s_v3, 4 core, 16 GB memory
    • Worker: Standard_D4s_v2, 8 core, 28 GB memory, total worker number: 8
    • Input Data Count: 1000
action partition NER
timing
2_NER
timing
4_NER
timing
NER+RE
timing
write_parquet 4 1 min 36 sec 3 min 1 sec 6 min 32 sec 3 min 12 sec
write_deltalake 4 1 min 38 sec 3 min 2 sec 6 min 30 sec 3 min 18 sec
write_parquet 8 48 sec 1 min 32 sec 3 min 21 sec 1 min 38 sec
write_deltalake 8 51 sec 1 min 36 sec 3 min 26 sec 1 min 43 sec
write_parquet 16 28 sec 1 min 16 sec 2 min 2 sec 56 sec
write_deltalake 16 31 sec 57 sec 2 min 2 sec 58 sec
write_parquet 32 20 sec 39 sec 1 min 22 sec 50 sec
write_deltalake 32 22 sec 41 sec 1 min 45 sec 35 sec
write_parquet 64 17 sec 31 sec 1 min 8 sec 27 sec
write_deltalake 64 17 sec 32 sec 1 min 11 sec 29 sec
write_parquet 100 18 sec 33 sec 1 min 13 sec 30 sec
write_deltalake 100 20 sec 33 sec 1 min 32 sec 32 sec
write_parquet 1000 22 sec 36 sec 1 min 12 sec 31 sec
write_deltalake 1000 23 sec 34 sec 1 min 33 sec 52 sec

Clinical Bert For Token Classification Benchmark Experiment

  • Dataset : 7537 Clinical Texts from PubMed Dataset
  • Driver : Standard_DS3_v2 - 14GB Memory - 4 Cores
  • Enable Autoscaling : True
  • Cluster Mode : Standart
  • Worker :
    • Standard_DS3_v2 - 14GB Memory - 4 Cores
  • Versions :
    • Databricks Runtime Version : 10.0 (Apache Spark 3.2.0, Scala 2.12)
    • spark-nlp Version: v3.4.0
    • spark-nlp-jsl Version : v3.4.0
    • Spark Version : v3.2.0
  • Spark NLP Pipeline :
nlpPipeline = Pipeline(stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        ner_jsl_slim_tokenClassifier,
        ner_converter,
        finisher])

NOTES:

  • In this experiment, the bert_token_classifier_ner_jsl_slim model was used to measure the inference time of clinical bert for token classification models in the DataBricks environment.
  • In the first experiment, the data read from the parquet file is saved as parquet after processing.

  • In the second experiment, the data read from the delta table was written to the delta table after it was processed.

Bert For Token Classification Benchmark Table

Repartition Time
Read data from parquet 2 26.03 mins
64 10.84 mins
128 7.53 mins
1000 8.93 mins
Read data from delta table 2 40.50 mins
64 11.84 mins
128 6.79 mins
1000 6.92 mins

NER speed benchmarks across various Spark NLP and PySpark versions

This experiment compares the ClinicalNER runtime for different versions of PySpark and Spark NLP. In this experiment, all reports went through the pipeline 10 times and repeated execution 5 times, so we ran each report 50 times and averaged it, %timeit -r 5 -n 10 run_model(spark, model).

  • Driver: Standard Google Colab environment
  • Spark NLP Pipeline:
    nlpPipeline = Pipeline(
        stages=[
            documentAssembler,
            sentenceDetector,
            tokenizer,
            word_embeddings,
            clinical_ner,
            ner_converter
            ])
    
  • Dataset: File sizes:
    • report_1: ~5.34kb
    • report_2: ~8.51kb
    • report_3: ~11.05kb
    • report_4: ~15.67kb
    • report_5: ~35.23kb
  Spark NLP 4.0.0 (PySpark 3.1.2) Spark NLP 4.2.1 (PySpark 3.3.1) Spark NLP 4.2.1 (PySpark 3.1.2) Spark NLP 4.2.2 (PySpark 3.1.2) Spark NLP 4.2.2 (PySpark 3.3.1) Spark NLP 4.2.3 (PySpark 3.3.1) Spark NLP 4.2.3 (PySpark 3.1.2)
report_1 2.36066 3.33056 2.23723 2.27243 2.11513 2.19655 2.23915
report_2 2.2179 3.31328 2.15578 2.23432 2.07259 2.07567 2.16776
report_3 2.77923 2.6134 2.69023 2.76358 2.55306 2.4424 2.72496
report_4 4.41064 4.07398 4.66656 4.59879 3.98586 3.92184 4.6145
report_5 9.54389 7.79465 9.25499 9.42764 8.02252 8.11318 9.46555

Results show that the different versions can have some variance in the execution time, but the difference is not too relevant.

ChunkMapper and Sentence Entity Resolver Benchmark Experiment

  • Dataset: 100 Clinical Texts from MTSamples, approx. 705 tokens and 11 chunks per text.

  • Versions:
    • Databricks Runtime Version: 12.2 LTS(Scala 2.12, Spark 3.3.2)
    • spark-nlp Version: v5.2.0
    • spark-nlp-jsl Version: v5.2.0
    • Spark Version: v3.3.2
  • Spark NLP Pipelines:

ChunkMapper Pipeline:

    mapper_pipeline = Pipeline().setStages([
        document_assembler,
        sentence_detector,
        tokenizer,
        word_embeddings,
        ner_model,
        ner_converter,
        chunkerMapper])

Sentence Entity Resolver Pipeline:

    resolver_pipeline = Pipeline(
        stages = [
            document_assembler,
            sentenceDetectorDL,
            tokenizer,
            word_embeddings,
            ner_model,
            ner_converter,
            c2doc,
            sbert_embedder,
            rxnorm_resolver
      ])

ChunkMapper and Sentence Entity Resolver Pipeline:

mapper_resolver_pipeline = Pipeline(
    stages = [
        document_assembler,
        sentence_detector,
        tokenizer,
        word_embeddings,
        ner_model,
        ner_converter,
        chunkerMapper,
        cfModel,
        chunk2doc,
        sbert_embedder,
        rxnorm_resolver,
        resolverMerger
])

NOTES:

  • 3 different pipelines: The first pipeline with ChunkMapper, the second with Sentence Entity Resolver, and the third pipeline with ChunkMapper and Sentence Entity Resolver together.

  • 3 different configurations: Driver and worker types were kept as same in all cluster configurations. Number of workers were increased gradually and set as 2, 4, 8 for DataBricks. We choosed 3 different configurations for AWS EC2 machines that have same core with DataBricks.

  • NER models were kept as same in all pipelines: Pretrained ner_posology_greedy NER model was used in each pipeline.

Benchmark Tables

These figures might differ based on the size of the mapper and resolver models. The larger the models, the higher the inference times. Depending the success rate of mappers (any chunk coming in caught by the mapper successfully), the combined mapper and resolver timing would be less than resolver-only timing.

If the resolver-only timing is equal or very close to the combined mapper and resolver timing, it means that mapper is not capable of catching/ mapping any chunk. In that case, try playing with various parameters in mapper or retrain/ augment the mapper.

  • DataBricks Config: 8 CPU Core, 32GiB RAM (2 worker, Standard_DS3_v2)
  • AWS Config: 8 CPU Cores, 14GiB RAM (c6a.2xlarge)
Partition DataBricks
mapper timing
AWS
mapper timing
DataBricks
resolver timing
AWS
resolver timing
DataBricks
mapper and resolver timing
AWS
mapper and resolver timing
4 23 sec 11 sec 4.36 mins 3.02 mins 2.40 mins 1.58 mins
8 15 sec 9 sec 3.21 mins 2.27 mins 1.48 mins 1.35 mins
16 18 sec 10 sec 2.52 mins 2.14 mins 2.04 mins 1.25 mins
32 13 sec 11 sec 2.22 mins 2.26 mins 1.38 mins 1.35 mins
64 14 sec 12 sec 2.36 mins 2.11 mins 1.50 mins 1.26 mins
100 14 sec 30 sec 2.21 mins 2.07 mins 1.36 mins 1.34 mins
1000 21 sec 21 sec 2.23 mins 2.08 mins 1.43 mins 1.40 mins
  • DataBricks Config: 16 CPU Core,64GiB RAM (4 worker, Standard_DS3_v2)
  • AWS Config: 16 CPU Cores, 27GiB RAM (c6a.4xlarge)
Partition DataBricks
mapper timing
AWS
mapper timing
DataBricks
resolver timing
AWS
resolver timing
DataBricks
mapper and resolver timing
AWS
mapper and resolver timing
4 32.5 sec 11 sec 4.19 mins 2.53 mins 2.58 mins 1.48 mins
8 15.1 sec 7 sec 2.25 mins 1.43 mins 1.38 mins 1.04 mins
16 9.52 sec 6 sec 1.50 mins 1.28 mins 1.15 mins 1.00 mins
32 9.16 sec 6 sec 1.47 mins 1.24 mins 1.09 mins 59 sec
64 9.32 sec 7 sec 1.36 mins 1.23 mins 1.03 mins 57 sec
100 9.97 sec 20 sec 1.48 mins 1.34 mins 1.11 mins 1.02 mins
1000 12.5 sec 13 sec 1.31 mins 1.26 mins 1.03 mins 58 sec
  • DataBricks Config: 32 CPU Core, 128GiB RAM (8 worker, Standard_DS3_v2)
  • AWS Config: 32 CPU Cores, 58GiB RAM (c6a.8xlarge)
Partition DataBricks
mapper timing
AWS
mapper timing
DataBricks
resolver timing
AWS
resolver timing
DataBricks
mapper and resolver timing
AWS
mapper and resolver timing
4 37.3 sec 12 sec 4.46 mins 2.37 mins 2.52 mins 1.47 mins
8 26.7 sec 7 sec 2.46 mins 1.39 mins 1.37 mins 1.04 mins
16 8.85 sec 7 sec 1.27 mins 1.30 mins 1.06 mins 1.02 mins
32 7.74 sec 7 sec 1.38 mins 1.00 mins 54.5 sec 43 sec
64 7.22 sec 7 sec 1.23 mins 1.07 mins 55.6 sec 48 sec
100 6.32 sec 10 sec 1.16 mins 1.08 mins 50.9 sec 45 sec
1000 8.37 sec 10 sec 59.6 sec 1.02 mins 49.3 sec 41 sec

ONNX and Base Embeddings in Resolver

  • Dataset: 100 Custom Clinical Texts, approx. 595 tokens per text
  • Versions:
    • spark-nlp Version: v5.2.2
    • spark-nlp-jsl Version : v5.2.1
    • Spark Version : v3.2.1
  • Instance Type:
    • 8 CPU Cores 52GiB RAM (Colab Pro - High RAM)
nlp_pipeline = Pipeline(
    stages = [
        document_assembler,
        sentenceDetectorDL,
        tokenizer,
        word_embeddings,
        clinical_ner,
        ner_converter,
  ])

embedding_pipeline = PipelineModel(
    stages = [
        c2doc,
        sbiobert_embeddings # base or onnx version
  ])

resolver_pipeline = PipelineModel(
    stages = [
        rxnorm_resolver
  ])

Results Table

Partition preprocessing embeddings resolver onnx_embeddings resolver_with_onnx_embeddings
4 25 sec 25 sec 7 min 46 sec 9 sec 8 min 29 sec
8 21 sec 25 sec 5 min 12 sec 9 sec 4 min 53 sec
16 21 sec 25 sec 4 min 41 sec 9 sec 4 min 30 sec
32 20 sec 24 sec 5 min 4 sec 9 sec 4 min 34 sec
64 21 sec 24 sec 4 min 44 sec 9 sec 5 min 2 sec
128 20 sec 25 sec 5 min 4 sec 10 sec 4 min 51 sec
256 22 sec 26 sec 4 min 34 sec 10 sec 5 min 13 sec
512 24 sec 27 sec 4 min 46 sec 12 sec 4 min 22 sec
1024 29 sec 30 sec 4 min 24 sec 14 sec 4 min 29 sec

Deidentification Benchmarks

Deidentification Comparison Experiment on Clusters

  • Dataset: 1000 Clinical Texts from MTSamples, approx. 503 tokens and 6 chunks per text.

  • Versions:
    • spark-nlp Version: v5.2.0
    • spark-nlp-jsl Version : v5.2.0
    • Spark Version: v3.3.2
    • DataBricks Config: 32 CPU Core, 128GiB RAM (8 worker)
    • AWS Config: 32 CPU Cores, 58GiB RAM (c6a.8xlarge)
    • Colab Config: 8 CPU Cores 52GiB RAM (Colab Pro - High RAM)
  • Spark NLP Pipelines:

Deidentification Pipeline:

deid_pipeline = Pipeline().setStages([
      document_assembler,
      sentence_detector,
      tokenizer,
      word_embeddings,
      deid_ner,
      ner_converter,
      deid_ner_enriched,
      ner_converter_enriched,
      chunk_merge,
      ssn_parser,
      account_parser,
      dln_parser,
      plate_parser,
      vin_parser,
      license_parser,
      country_parser,
      age_parser,
      date_parser,
      phone_parser1,
      phone_parser2,
      ids_parser,
      zip_parser,
      med_parser,
      email_parser,
      chunk_merge1,
      chunk_merge2,
      deid_masked_rgx,
      deid_masked_char,
      deid_masked_fixed_char,
      deid_obfuscated,
      finisher])

Dataset: 1000 Clinical Texts from MTSamples, approx. 503 tokens and 21 chunks per text.

Partition AWS
result timing
DataBricks
result timing
Colab
result timing
1024 1 min 3 sec 1 min 55 sec 5 min 45 sec
512 56 sec 1 min 26 sec 5 min 15 sec
256 50 sec 1 min 20 sec 5 min 4 sec
128 45 sec 1 min 21 sec 5 min 11 sec
64 46 sec 1 min 31 sec 5 min 3 sec
32 46 sec 1 min 26 sec 5 min 0 sec
16 56 sec 1 min 43 sec 5 min 3 sec
8 1 min 21 sec 2 min 33 sec 5 min 3 sec
4 2 min 26 sec 4 min 53 sec 6 min 3 sec

Deidentification Pipelines Speed Comparison

  • Deidentification Pipelines Benchmarks

    This benchmark provides valuable insights into the efficiency and scalability of deidentification pipelines in different computational environments.

    • Dataset: 100000 Clinical Texts from MTSamples, approx. 508 tokens and 26.44 chunks per text.
    • Versions:[May-2024]
      • spark-nlp Version: v5.3.2
      • spark-nlp-jsl Version: v5.3.2
      • Spark Version: v3.4.0
    • Instance Type:
      • DataBricks Config:
        • 32 CPU Core, 128GiB RAM (8 workers) (2.7 $/hr)
        data_count partition Databricks
        100000 512 1h 42m 55s
      • AWS EC2 instance Config:
        • 32 CPU cores, 64GiB RAM (c6a.8xlarge $1.224/h)
        data_count partition AWS
        100000 512 1h 9m 56s
  • Deidentification Pipelines Speed Comparison

    This benchmark presents a detailed comparison of various deidentification pipelines applied to a dataset of 10,000 custom clinical texts, aiming to anonymize sensitive information for research and analysis. The comparison evaluates the elapsed time and processing stages of different deidentification pipelines. Each pipeline is characterized by its unique combination of Named Entity Recognition (NER), deidentification methods, rule-based NER, clinical embeddings, and chunk merging processes.

    • Dataset: 10K Custom Clinical Texts with 1024 partitions, approx. 500 tokens and 14 chunks per text.
    • Versions:
      • spark-nlp Version: v5.3.1
      • spark-nlp-jsl Version: v5.3.1
      • Spark Version: v3.4.0
    • Instance Type:
      • 8 CPU Cores 52GiB RAM (Colab Pro - High RAM)
Deidentification Pipeline Name Elapsed Time Stages
clinical_deidentification_subentity_optimized 67 min 44 seconds 1 NER, 1 Deidentification, 13 Rule-based NER, 1 clinical embedding, 2 chunk merger
clinical_deidentification_generic_optimized 68 min 31 seconds 1 NER, 1 Deidentification, 13 Rule-based NER, 1 clinical embedding, 2 chunk merger
clinical_deidentification_generic 86 min 24 seconds 1 NER, 4 Deidentification, 13 Rule-based NER, 1 clinical embedding, 2 chunk merger
clinical_deidentification_subentity 99 min 41 seconds 1 NER, 4 Deidentification, 13 Rule-based NER, 1 clinical embedding, 2 chunk merger
clinical_deidentification 117 min 44 seconds 2 NER, 1 Deidentification, 13 Rule-based NER, 1 clinical embedding, 3 chunk merger
clinical_deidentification_nameAugmented 134 min 27 seconds 2 NER, 4 Deidentification, 13 Rule-based NER, 1 clinical embedding, 3 chunk merger
clinical_deidentification_glove 146 min 51 seconds 2 NER, 4 Deidentification, 8 Rule-based NER, 1 clinical embedding, 3 chunk merger
clinical_deidentification_obfuscation_small 147 min 06 seconds 1 NER, 1 Deidentification, 2 Rule-based NER, 1 clinical embedding, 1 chunk merger
clinical_deidentification_slim 154 min 37 seconds 2 NER, 4 Deidentification, 15 Rule-based NER, 1 glove embedding, 3 chunk merger
clinical_deidentification_multi_mode_output 154 min 50 seconds 2 NER, 4 Deidentification, 13 Rule-based NER, 1 clinical embedding, 3 chunk merger
clinical_deidentification_obfuscation_medium 205 min 40 seconds 2 NER, 1 Deidentification, 2 Rule-based NER, 1 clinical embedding, 1 chunk merger

PS: The reason why pipelines with the same stages have different costs is due to the layers of the NER model and the hardcoded regexes in Deidentification.

partition clinical deidendification clinical deidendification
zeroshot_medium
clinical deidendification
docwise_medium_wip
clinical deidendification
zeroshot_large
clinical deidendification
docwise_large_wip
4 295.8 520.8 862.7 1537.9 1832.4
8 195.0 345.6 577.0 1013.9 1228.3
16 133.3 227.2 401.8 666.2 835.2
32 109.5 160.9 305.3 456.9 614.7
64 92.0 166.8 291.5 465.0 584.9
100 79.3 174.1 274.8 495.3 587.8
1000 56.3 181.4 270.7 502.4 556.4

Deidentification Pipelines Cost Benchmarks

  • Versions: [March-2024]
    • spark-nlp Version: v5.2.2
    • spark-nlp-jsl Version : v5.2.1
    • Spark Version : v3.4.1
  • EMR
    • Dataset: 10K Custom Clinical Texts, approx. 500 tokens & 15 chunks per text
    • EMR Version: ERM.6.15.0
    • Instance Type:
      • Primary: c5.4xlarge, 16 vCore, 32 GiB memory
      • Worker: m5.16xlarge, 64 vCore, 256 GiB memory, 4 workers
    • Price 12.97 $/hr
  • EC2 instance
    • Dataset: 10K Custom Clinical Texts, approx. 500 tokens & 15 chunks per text
    • Instance Type: c5.18xlarge, 72 vCore, 144 GiB memory, Single Instance
    • Price 3.06 $/hr
  • Databricks
    • Dataset: 1K Clinical Texts from MT Samples, approx. 503 tokens & 21 chunks per text
    • Instance Type: 32 CPU Core, 128GiB RAM , 8 workers
    • Price 2.7 $/hr

Utilized Pretrained DEID Pipelines:

Optimized Pipeline:

from sparknlp.pretrained import PretrainedPipeline

pipeline_optimized = PretrainedPipeline("clinical_deidentification_subentity_optimized", "en", "clinical/models")

pipeline_optimized = Pipeline().setStages(
    [document_assembler,
    sentence_detector,
    tokenizer,
    word_embeddings,
    deid_ner,
    ner_converter,
    ssn_parser,
    account_parser
    dln_parser,
    plate_parser,
    vin_parser,
    license_parser,
    country_extracter,
    age_parser,
    date_matcher,
    phone_parser,
    zip_parser,
    med_parser,
    email_parser,
    merger_parser,
    merger_chunks,
    deid_ner_obfus,
    finisher]

Base Pipeline:

from sparknlp.pretrained import PretrainedPipeline

pipeline_base = PretrainedPipeline("clinical_deidentification", "en", "clinical/models")

pipeline_base = Pipeline().setStages([
    document_assembler,
    sentence_detector,
    tokenizer,
    word_embeddings,
    deid_ner,
    ner_converter,
    deid_ner_enriched,
    ner_converter_enriched,
    chunk_merge,
    ssn_parser,
    account_parser,
    dln_parser,
    plate_parser,
    vin_parser,
    license_parser,
    country_parser,
    age_parser,
    date_parser,
    phone_parser1,
    phone_parser2,
    ids_parser,
    zip_parser,
    med_parser,
    email_parser,
    chunk_merge1,
    chunk_merge2,
    deid_masked_rgx,
    deid_masked_char,
    deid_masked_fixed_char,
    deid_obfuscated,
    finisher])
Partition EMR
Base Pipeline
EMR
Optimized Pipeline
EC2 Instance
Base Pipeline
EC2 Instance
Optimized Pipeline
Databricks
Base Pipeline
Databricks
Optimized Pipeline
1024 5 min 1 sec 2 min 45 sec 7 min 6 sec 3 min 26 sec 10 min 10 sec 6 min 2 sec
512 4 min 52 sec 2 min 30 sec 6 min 56 sec 3 min 41 sec 10 min 16 sec 6 min 11 sec
256 4 min 50 sec 2 min 30 sec 9 min 10 sec 5 min 18 sec 10 min 22 sec 6 min 14 sec
128 4 min 55 sec 2 min 30 sec 14 min 30 sec 7 min 51 sec 10 min 21 sec 5 min 53 sec
64 6 min 24 sec 3 min 8 sec 18 min 59 sec 9 min 9 sec 12 min 42 sec 6 min 50 sec
32 7 min 15 sec 3 min 43 sec 18 min 47.2 sec 9 min 18 sec 12 min 55 sec 7 min 40 sec
16 11 min 6 sec 4 min 57 sec 12 min 47.5 sec 6 min 14 sec 15 min 59 sec 9 min 18 sec
8 19 min 13 se 8 min 8 sec 16 min 52 sec 8 min 48 sec 22 min 40 sec 13 min 26 sec

Estimated Minimum Costs:

  • EMR Base Pipeline: partition number: 256, 10K cost:$1.04, 1M cost:$104.41
  • EMR Optimized Pipeline: partition number: 256, 10K cost:$0.54, 1M cost:$54.04
  • EC2 Instance Base Pipeline: partition number: 512, 10K cost:$0.36, 1M cost:$35.70
  • EC2 Instance Optimized Pipeline: partition number: 1024, 10K cost:$0.18, 1M cost:$17.85
  • DataBricks Base Pipeline: partition number: 1024, 10K cost:$0.46, 1M cost:$45.76
  • DataBricks Optimized Pipeline: partition number: 1024, 10K cost:$0.27, 1M cost:$27.13

RxNorm Benchmark: Healthcare NLP & GPT-4 & Amazon

Motivation

Accurately mapping medications to RxNorm codes is crucial for several reasons like safer patient care, improved billing and reimbursement, enhanced research, etc. In this benchmark, you can find these tools’ performance and cost comparisons.

Ground Truth

To ensure a fair comparison of these tools, we enlisted the assistance of human annotators. Medical annotation experts from John Snow Labs utilized the Generative AI Lab to annotate 79 clinical in-house documents.

Benchmark Tools

  • Healthcare NLP: Two distinct RxNorm models within the library was used.
  • GPT-4: GPT-4 (Turbo) and GPT-4o models.

  • Amazon: Amazon Comprehend Medical service

Evaluation Notes

  • Healthcare NLP returns up to 25 closest results, and Amazon Medical Comprehend returns up to five results, both sorted starting from the closest one. In contrast, the GPT-4 returns only one result, so its scores are reflected similarly in both charts.
  • Since the performance of GPT-4 and GPT-4o is almost identical according to the official announcement, and we used both versions for the accuracy calculation. Additionally, the GPT-4 returns only one result, which means you will see the same results in both evaluation approaches.
  • Two approaches were adopted for evaluating these tools, given that the model outputs may not precisely match the annotations:
    • Top-3: Compare the annotations to see if they appear in the first three results.
    • Top-5: Compare the annotations to see if they appear in the first five results.

Accuracy Results

  • Top-3 Results:

top_3

  • Top-5 Results:

top_5

Price Analysis Of The Tools

Since we don’t have such a small dataset in real world, we calculated the price of these tools according to 1M clinical notes. 

  • Open AI Pricing: We created a prompt to achieve better results, which costs $3.476 on GPT-4 and $1.738 GPT-4o model for the 79 documents. This means that for processing 1 million notes, the estimated cost would be $44,000 for the GPT-4 and $22,000 for the GPT-4o.

  • Amazon Comprehend Medical Pricing: According to the price calculator, obtaining RxNorm predictions for 1M documents, with an average of 9,700 characters per document, costs $24,250.

  • Healthcare NLP Pricing: When using John Snow Labs-Healthcare NLP Prepaid product on an EC2-32 CPU (c6a.8xlarge at $1,2 per hour) machine, obtaining the RxNorm codes for medications (excluding the NER stage) from approximately 80 documents takes around 2 minutes. Based on this, processing 1M documents and extracting RxNorm codes would take about 25,000 minutes (416 hours, or 18 days), costing $500 for infrastructure and $4,000 for the license (considering a 1-month license price of $7,000). Thus, the total cost for Healthcare NLP is approximately $4,500.

Conclusion

Based on the evaluation results:

  • The sbiobertresolve_rxnorm_augmented model of Spark NLP for Healthcare consistently provides the most accurate results in each top_k comparison.
  • The biolordresolve_rxnorm_augmented model of Spark NLP for Healthcare outperforms Amazon Comprehend Medical and GPT-4 models in mapping terms to their RxNorm codes.
  • The GPT-4 model could only return one result, reflected similarly in both charts and has proven to be the least accurate.

If you want to process 1M documents and extract RxNorm codes for medication entities (excluding the NER stage), the total cost:

  • With Healthcare NLP is about $4,500, including the infrastructure costs.
  • $24,250 with Amazon Comprehend Medical
  • $44,000 with the GPT-4 and $22,000 with the GPT-4o.

Therefore, Healthcare NLP is almost 5 times cheaper than its closest alternative, not to mention the accuracy differences (Top 3: Healthcare NLP 82.7% vs Amazon 55.8% vs GPT-4 8.9%).

Accuracy & Cost Table

Top-3 Accuracy Top-5 Accuracy Cost
Healthcare NLP 82.7% 84.6% $4,500
Amazon Comprehend Medical 55.8% 56.2% $24,250
GPT-4 (Turbo) 8.9% 8.9% $44,000
GPT-4o 8.9% 8.9% $22,000

AWS EMR Cluster Benchmark

  • Dataset: 340 Custom Clinical Texts, approx. 235 tokens per text
  • Versions:
    • EMR Version: ERM.6.15.0
    • spark-nlp Version: v5.2.2
    • spark-nlp-jsl Version : v5.2.1
    • Spark Version : v3.4.1
  • Instance Type:
    • Primary: m4.4xlarge, 16 vCore, 64 GiB memory
    • Worker : m4.4xlarge, 16 vCore, 64 GiB memory

Spark NLP Pipeline:

ner_pipeline = Pipeline(stages = [
        document_assembler,
        sentence_detector,
        tokenizer,
        word_embeddings,
        ner_jsl,
        ner_jsl_converter])

resolver_pipeline = Pipeline(stages = [
        document_assembler,
        sentence_detector,
        tokenizer,
        word_embeddings,
        ner_jsl,
        ner_jsl_converter,
        chunk2doc,
        sbert_embeddings,
        snomed_resolver]) 

NOTES:

ner_jsl model is used as ner model.The inference time was calculated. The timer started with model.transform(df) and ended with writing results in parquet format.

The sbiobertresolve_snomed_findings model is used as the resolver model. The inference time was calculated. The timer started with model.transform(df) and ended with writing results (snomed_code and snomed_code_definition) in parquet format and 722 entities saved.

Results Table

Partition NER Timing NER and Resolver Timing
4 24.7 seconds 1 minutes 8.5 seconds
8 23.6 seconds 1 minutes 7.4 seconds
16 22.6 seconds 1 minutes 6.9 seconds
32 23.2 seconds 1 minutes 5.7 seconds
64 22.8 seconds 1 minutes 6.7 seconds
128 23.7 seconds 1 minutes 7.4 seconds
256 23.9 seconds 1 minutes 6.1 seconds
512 23.8 seconds 1 minutes 8.4 seconds
1024 25.9 seconds 1 minutes 10.2 seconds

CPU NER Benchmarks

NER (BiLSTM-CNN-Char Architecture) CPU Benchmark Experiment

  • Dataset: 1000 Clinical Texts from MTSamples Oncology Dataset, approx. 500 tokens per text.
  • Versions:
    • spark-nlp Version: v3.4.4
    • spark-nlp-jsl Version: v3.5.2
    • Spark Version: v3.1.2
  • Spark NLP Pipeline:
  nlpPipeline = Pipeline(stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,  
        embeddings_clinical,  
        clinical_ner,  
        ner_converter
        ])

NOTE:

  • Spark NLP Pipeline output data frame (except word_embeddings column) was written as parquet format in transform benchmarks.
Plarform Process Repartition Time
2 CPU cores, 13 GB RAM (Google COLAB) LP (fullAnnotate) - 16min 52s
Transform (parquet) 10 4min 47s
100 4min 16s
1000 5min 4s
16 CPU cores, 27 GB RAM (AWS EC2 machine) LP (fullAnnotate) - 14min 28s
Transform (parquet) 10 1min 5s
100 1min 1s
1000 1min 19s

GPU vs CPU benchmark

This section includes a benchmark for MedicalNerApproach(), comparing its performance when running in m5.8xlarge CPU vs a Tesla V100 SXM2 GPU, as described in the Machine Specs section below.

Big improvements have been carried out from version 3.3.4, so please, make sure you use at least that version to fully levearge Spark NLP capabilities on GPU.

Machine specs

CPU

An AWS m5.8xlarge machine was used for the CPU benchmarking. This machine consists of 32 vCPUs and 128 GB of RAM, as you can check in the official specification webpage available here

GPU

A Tesla V100 SXM2 GPU with 32GB of memory was used to calculate the GPU benchmarking.

Versions

The benchmarking was carried out with the following Spark NLP versions:

Spark version: 3.0.2

Hadoop version: 3.2.0

SparkNLP version: 3.3.4

SparkNLP for Healthcare version: 3.3.4

Spark nodes: 1

Benchmark on MedicalNerDLApproach()

This experiment consisted of training a Name Entity Recognition model (token-level), using our class NerDLApproach(), using Bert Word Embeddings and a Char-CNN-BiLSTM Neural Network. Only 1 Spark node was used for the training.

We used the Spark NLP class MedicalNer and it’s method Approach() as described in the documentation.

The pipeline looks as follows: Benchmark on MedicalNerDLApproach

Dataset

The size of the dataset was small (17K), consisting of:

Training (rows): 14041

Test (rows): 3250

Training params

Different batch sizes were tested to demonstrate how GPU performance improves with bigger batches compared to CPU, for a constant number of epochs and learning rate.

Epochs: 10

Learning rate: 0.003

Batch sizes: 32, 64, 256, 512, 1024, 2048

Results

Even for this small dataset, we can observe that GPU is able to beat the CPU machine by a 62% in training time and a 68% in inference times. It’s important to mention that the batch size is very relevant when using GPU, since CPU scales much worse with bigger batch sizes than GPU.

Training times depending on batch (in minutes)

Training times depending on batch

Batch size CPU GPU
32 9.5 10
64 8.1 6.5
256 6.9 3.5
512 6.7 3
1024 6.5 2.5
2048 6.5 2.5

Inference times (in minutes)

Although CPU times in inference remain more or less constant regardless the batch sizes, GPU time experiment good improvements the bigger the batch size is.

CPU times: ~29 min

Batch size GPU
32 10
64 6.5
256 3.5
512 3
1024 2.5
2048 2.5

Inference times

Performance metrics

A macro F1-score of about 0.92 (0.90 in micro) was achieved, with the following charts extracted from the MedicalNerApproach() logs:

Inference times

Takeaways: How to get the best of the GPU

You will experiment big GPU improvements in the following cases:

  1. Embeddings and Transformers are used in your pipeline. Take into consideration that GPU will performance very well in Embeddings / Transformer components, but other components of your pipeline may not leverage as well GPU capabilities;
  2. Bigger batch sizes get the best of GPU, while CPU does not scale with bigger batch sizes;
  3. Bigger dataset sizes get the best of GPU, while may be a bottleneck while running in CPU and lead to performance drops;

MultiGPU Inference on Databricks

In this part, we will give you an idea on how to choose appropriate hardware specifications for Databricks. Here is a few different hardwares, their prices, as well as their performance: MultiGPU Inference on Databricks

Apparently, GPU hardware is the cheapest among them although it performs the best. Let’s see how overall performance looks like:

MultiGPU Inference on Databricks

Figure above clearly shows us that GPU should be the first option of ours.

In conclusion, please find the best specifications for your use case since these benchmarks might depend on dataset size, inference batch size, quickness, pricing and so on.

Please refer to this video for further info: https://events.johnsnowlabs.com/webinar-speed-optimization-benchmarks-in-spark-nlp-3-making-the-most-of-modern-hardware?hsCtaTracking=a9bb6358-92bd-4cf3-b97c-e76cb1dfb6ef%7C4edba435-1adb-49fc-83fd-891a7506a417

MultiGPU training

Currently, we don’t support multiGPU training, meaning training 1 model in different GPUs in parallel. However, you can train different models in different GPUs.

MultiGPU inference

Spark NLP can carry out MultiGPU inference if GPUs are in different cluster nodes. For example, if you have a cluster with different GPUs, you can repartition your data to match the number of GPU nodes and then coalesce to retrieve the results back to the master node.

Currently, inference on multiple GPUs on the same machine is not supported.

Where to look for more information about Training

Please, take a look at the Spark NLP and Spark NLP for Healthcare Training sections, and feel free to reach us out in case you want to maximize the performance on your GPU.

Spark NLP vs Spacy Pandas UDF with Arrow Benchmark

This benchmarking report aims to provide a comprehensive comparison between two NLP frameworks on Spark clusters: Spark NLP and SpaCy, specifically in the context of Pandas UDF with Arrow optimization.

Spark NLP is a distributed NLP library built on top of Apache Spark, designed to handle large-scale NLP tasks efficiently. On the other hand, SpaCy is a popular NLP library in single-machine environments.

In this benchmark, we evaluate the performance of both frameworks using Pandas UDF with Arrow, a feature that enhances data transfer between Apache Arrow and Pandas DataFrames, potentially leading to significant performance gains. We will use Spacy as a UDF in Spark to compare the performance of both frameworks.

The benchmark covers a range of common NLP tasks, including Named Entity Recognition (NER) and getting Roberta sentence embeddings.

We calculated the time for both arrow enabled and disabled pandas udf for each task. We reset the notebook before each task to ensure that the results are not affected by the previous task.

Machine specs

Azure Databricks Standard_DS3_v2 machine (6 workers + 1 driver) was used for the CPU benchmarking. This machine consists of 4 CPUs and 14 GB of RAM.

Versions

The benchmarking was carried out with the following versions:

Spark version: 3.1.2

SparkNLP version: 5.1.0

spaCy version: 3.6.1

Spark nodes: 7 (1 driver, 6 workers)

Dataset

The size of the dataset is (120K), consisting of news articles that can be found here.

Benchmark on Named Entity Recognition (NER)

Named Entity Recognition (NER) is the process of identifying and classifying named entities in a text into predefined categories such as person names, organizations, locations, etc. In this benchmark, we compare the performance of Spark NLP and SpaCy in recognizing named entities in a text column.

The following pipeline shows how to recognize named entities in a text column using Spark NLP:

glove_embeddings = WordEmbeddingsModel.pretrained('glove_100d').\
    setInputCols(["document", 'token']).\
    setOutputCol("embeddings")

public_ner = NerDLModel.pretrained("ner_dl", 'en') \
    .setInputCols(["document", "token", "embeddings"]) \
    .setOutputCol("ner")

pipeline = Pipeline(stages=[document_assembler,
                              tokenizer,
                              glove_embeddings,
                              public_ner
                           ])

SpaCy uses the following pandas UDF to recognize named entities in a text column. We exclude the tagger, parser, attribute ruler, and lemmatizer components to make the comparison fair.

nlp_ner = spacy.load("en_core_web_sm", exclude=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"])

# Define a UDF to  perform NER
@pandas_udf(ArrayType(StringType()))
def ner_with_spacy(text_series):
    entities_list = []
    for text in text_series:
        doc = nlp_ner(text)
        entities = [f"{ent.text}:::{ent.label_}" for ent in doc.ents]
        entities_list.append(entities)
    return pd.Series(entities_list)

Benchmark on Getting Roberta Sentence Embeddings

In this benchmark, we compare the performance of Spark NLP and SpaCy in getting Roberta sentence embeddings for a text column.

The following pipeline shows how to get Roberta embeddings for a text column using Spark NLP:

embeddings = RoBertaSentenceEmbeddings.pretrained("sent_roberta_base", "en") \
      .setInputCols("document") \
      .setOutputCol("embeddings")

pipeline= Pipeline(stages=[document_assembler,
                            embeddings
                           ])

SpaCy uses the following pandas UDF.

nlp_embeddings = spacy.load("en_core_web_trf")

# Define a UDF to get sentence embeddings 
@pandas_udf(ArrayType(FloatType()))
def embeddings_with_spacy(text_series):
    embeddings_list = []
    for text in text_series:
        doc = nlp_embeddings(text)
        embeddings = doc._.trf_data.tensors[-1][0]
        embeddings_list.append(embeddings)
    return pd.Series(embeddings_list)

Results

Both frameworks were tested on a dataset of 120K rows. SpaCy was tested with and without Arrow enabled. Both frameworks utilized distributed computing to process the data in parallel.

The following table shows the time taken by each framework to perform the tasks mentioned above:

Task Spark NLP Spacy UDF with Arrow Spacy UDF without Arrow
NER extract 3min 35sec 4min 49sec 5min 4sec
Roberta Embeddings 22min 16sec 29min 27sec 29min 30sec

Conclusions

In our analysis, we delved into the performance of two Natural Language Processing (NLP) libraries: Spark NLP and SpaCy. While Spark NLP, seamlessly integrated with Apache Spark, excels in managing extensive NLP tasks on distributed systems and large datasets, SpaCy is used particularly in single-machine environments.

The results of our evaluation highlight clear disparities in processing times across the assessed tasks. In NER extraction, Spark NLP demonstrated exceptional efficiency, completing the task in a mere 3 minutes and 35 seconds. In contrast, Spacy UDF with Arrow and Spacy UDF without Arrow took 4 minutes and 49 seconds, and 5 minutes and 4 seconds, respectively. Moving on to the generation of Roberta embeddings, Spark NLP once again proved its prowess, completing the task in 22 minutes and 16 seconds. Meanwhile, Spacy UDF with Arrow and Spacy UDF without Arrow required 29 minutes and 27 seconds, and 29 minutes and 30 seconds, respectively.

These findings unequivocally affirm Spark NLP’s superiority for NER extraction tasks, and its significant time advantage for tasks involving Roberta embeddings.

Additional Comments

  • Scalability:

Spark NLP: Built on top of Apache Spark, Spark NLP is inherently scalable and distributed. It is designed to handle large-scale data processing with distributed computing resources. It is well-suited for processing vast amounts of data across multiple nodes.

SpaCy with pandas UDFs: Using SpaCy within a pandas UDF (User-Defined Function) and Arrow for efficient data transfer can bring SpaCy’s abilities into the Spark ecosystem. However, while Arrow optimizes the serialization and deserialization between JVM and Python processes, the scalability of this approach is still limited by the fact that the actual NLP processing is single-node (by SpaCy) for each partition of your Spark DataFrame.

  • Performance:

Spark NLP: Since it’s natively built on top of Spark, it is optimized for distributed processing. The performance is competitive, especially when you are dealing with vast amounts of data that need distributed processing.

SpaCy with pandas UDFs: SpaCy is fast for single-node processing. The combination of SpaCy with Arrow-optimized UDFs can be performant for moderate datasets or tasks. However, you might run into bottlenecks when scaling to very large datasets unless you have a massive Spark cluster.

  • Ecosystem Integration:

Spark NLP: Being a Spark-native library, Spark NLP integrates seamlessly with other Spark components, making it easier to build end-to-end data processing pipelines.

SpaCy with pandas UDFs: While the integration with Spark is possible, it’s a bit more ‘forced.’ It requires careful handling, especially if you’re trying to ensure optimal performance.

  • Features & Capabilities:

Spark NLP: Offers a wide array of NLP functionalities, including some that are tailored for the healthcare domain. It’s continuously evolving and has a growing ecosystem.

SpaCy: A popular library for NLP with extensive features, optimizations, and pre-trained models. However, certain domain-specific features in Spark NLP might not have direct counterparts in SpaCy.

  • Development & Maintenance:

Spark NLP: As with any distributed system, development and debugging might be more complex. You have to consider factors inherent to distributed systems.

SpaCy with pandas UDFs: Development might be more straightforward since you’re essentially working with Python functions. However, maintaining optimal performance with larger datasets and ensuring scalability can be tricky.

Last updated