There are third party projects that can integrate with Spark NLP. These packages need to be installed separately to be used.
If you’d like to integrate your application with Spark NLP, please send us a message!
Comet is a meta machine learning platform designed to help AI practitioners and teams build reliable machine learning models for real-world applications by streamlining the machine learning model lifecycle. By leveraging Comet, users can track, compare, explain and reproduce their machine learning experiments.
Comet can easily integrated into the Spark NLP workflow with the a dedicated
CometLogger to log training and evaluation metrics,
pipeline parameters and NER visualization made with sparknlp-display.
|Python API: CometLogger|
# Metrics while training an annotator can be logged with for example: import sparknlp from sparknlp.base import * from sparknlp.annotator import * from sparknlp.logging.comet import CometLogger spark = sparknlp.start() OUTPUT_LOG_PATH = "./run" logger = CometLogger() document = DocumentAssembler().setInputCol("text").setOutputCol("document") embds = ( UniversalSentenceEncoder.pretrained() .setInputCols("document") .setOutputCol("sentence_embeddings") ) multiClassifier = ( MultiClassifierDLApproach() .setInputCols("sentence_embeddings") .setOutputCol("category") .setLabelColumn("labels") .setBatchSize(128) .setLr(1e-3) .setThreshold(0.5) .setShufflePerEpoch(False) .setEnableOutputLogs(True) .setOutputLogsPath(OUTPUT_LOG_PATH) .setMaxEpochs(1) ) logger.monitor(logdir=OUTPUT_LOG_PATH, model=multiClassifier) trainDataset = spark.createDataFrame( [("Nice.", ["positive"]), ("That's bad.", ["negative"])], schema=["text", "labels"], ) pipeline = Pipeline(stages=[document, embds, multiClassifier]) pipeline.fit(trainDataset) logger.end() # If you are using a jupyter notebook, it is possible to display the live web # interface with logger.experiment.display(tab='charts')
Spark NLP uses Spark MLlib Pipelines, what are natively supported by MLFlow. MLFlow is, as stated in their official webpage, an open source platform for the machine learning lifecycle, that includes:
- Mlflow Tracking: Record and query experiments: code, data, config, and results
- MLflow Projects: Package data science code in a format to reproduce runs on any platform
- MLflow Models: Deploy machine learning models in diverse serving environments
- Model Registry: Store, annotate, discover, and manage models in a central repository
For more information, please see the complete guide at Experiment Tracking.