sparknlp.logging.comet.CometLogger

class sparknlp.logging.comet.CometLogger(workspace=None, project_name=None, comet_mode=None, experiment_id=None, tags=None, **experiment_kwargs)[source]

Bases: object

Logger class for Comet integration

Comet is a meta machine learning platform designed to help AI practitioners and teams build reliable machine learning models for real-world applications by streamlining the machine learning model lifecycle. By leveraging Comet, users can track, compare, explain and reproduce their machine learning experiments.

To log a Spark NLP annotator, it will need an “outputLogPath” parameter, as the CometLogger reads the log file generated during the training process.

For more examples see the Spark NLP Workshop.

Parameters
workspacestr, optional

Name of the workspace in Comet, by default None

project_namestr, optional

Name of the project in Comet, by default None

comet_modestr, optional

Mode of logging, by default None. If set to “offline” then offline mode will be used, otherwise online.

experiment_idstr, optional

Id of the experiment, if it is reused, by default None

tagsList[str], optional

List of tags for the experiment, by default None

Raises
ImportError

If the package comet-ml is not installed

Examples

Metrics while training an annotator can be logged with for example:

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from sparknlp.logging.comet import CometLogger
>>> spark = sparknlp.start()

To run an online experiment, the logger is defined like so.

>>> OUTPUT_LOG_PATH = "./run"
>>> logger = CometLogger()

Then the experiment can start like so

>>> document = DocumentAssembler() \
...     .setInputCol("text")\
...     .setOutputCol("document")
>>> embds = UniversalSentenceEncoder.pretrained() \
...     .setInputCols("document") \
...     .setOutputCol("sentence_embeddings")
>>> multiClassifier = MultiClassifierDLApproach() \
...     .setInputCols("sentence_embeddings") \
...     .setOutputCol("category") \
...     .setLabelColumn("labels") \
...     .setBatchSize(128) \
...     .setLr(1e-3) \
...     .setThreshold(0.5) \
...     .setShufflePerEpoch(False) \
...     .setEnableOutputLogs(True) \
...     .setOutputLogsPath(OUTPUT_LOG_PATH) \
...     .setMaxEpochs(1)
>>> logger.monitor(logdir=OUTPUT_LOG_PATH, model=multiClassifier)
>>> trainDataset = spark.createDataFrame(
...     [("Nice.", ["positive"]), ("That's bad.", ["negative"])],
...     schema=["text", "labels"],
... )
>>> pipeline = Pipeline(stages=[document, embds, multiClassifier])
>>> pipeline.fit(trainDataset)
>>> logger.end()

If you are using a jupyter notebook, it is possible to display the live web interface with

>>> logger.experiment.display(tab='charts')
Attributes
experimentcomet_ml.Experiment

Object representing the Comet experiment

Methods

__init__([workspace, project_name, ...])

end()

Ends the experiment and the logger.

log_asset(asset_path[, metadata, step])

Uploads an asset to comet.

log_asset_data(asset, name[, overwrite, ...])

Uploads the data given to comet (str, binary, or JSON).

log_completed_run(log_file_path)

Submit logs of training metrics after a run has completed.

log_metrics(metrics[, step, epoch, prefix])

Submits logs of an evaluation metrics.

log_parameters(parameters[, step])

Logs a dictionary (or dictionary-like object) of multiple parameters.

log_pipeline_parameters(pipeline[, stages])

Iterates over the different stages in a pyspark PipelineModel object and logs the parameters to Comet.

log_visualization(html[, name])

Uploads a NER visualization from Spark NLP Display to comet.

monitor(logdir, model[, interval])

Monitors the training of the model and submits logs to comet, given by an interval.

end()[source]

Ends the experiment and the logger. Submits all outstanding logs to comet.

log_asset(asset_path, metadata=None, step=None)[source]

Uploads an asset to comet.

Parameters
asset_pathstr

Path to the asset

metadatastr, optional

Some additional data to attach to the the audio asset. Must be a JSON-encodable dict, by default None

stepint, optional

Used to associate a specific step, by default None, by default None

log_asset_data(asset, name, overwrite=False, metadata=None, step=None)[source]

Uploads the data given to comet (str, binary, or JSON).

Parameters
assetstr or bytes or dict

Data to be saved as asset

namestr

A custom file name to be displayed

overwritebool, optional

If True will overwrite all existing assets with the same name, by default False

metadatadict, optional

Some additional data to attach to the the asset data. Must be a JSON-encodable dict, by default None

stepint, optional

Used to associate a specific step, by default None, by default None

log_completed_run(log_file_path)[source]

Submit logs of training metrics after a run has completed.

Parameters
log_file_pathstr

Path to log file containing training metrics

log_metrics(metrics, step=None, epoch=None, prefix=None)[source]

Submits logs of an evaluation metrics.

Parameters
metricsdict

Dictionary with key value pairs corresponding to the measured metric and its value

stepint, optional

Used to associate a specific step, by default None

epochint, optional

Used to associate a specific epoch, by default None

prefixstr, optional

Name prefix for this metric, by default None. This can be used to identify for example different features by name.

Examples

In this example, sklearn is used to retrieve the metrics.

>>> from sklearn.preprocessing import MultiLabelBinarizer
>>> from sklearn.metrics import classification_report
>>> prediction = model.transform(testDataset)
>>> preds_df = prediction.select('labels', 'category.result').toPandas()
>>> mlb = MultiLabelBinarizer()
>>> y_true = mlb.fit_transform(preds_df['labels'])
>>> y_pred = mlb.fit_transform(preds_df['result'])
>>> report = classification_report(y_true, y_pred, output_dict=True)

Iterate over the report and log the metrics:

>>> for key, value in report.items():
...     logger.log_metrics(value, prefix=key)
>>> logger.end()

If you are using Spark NLP in a notebook, then you can display the metrics directly with

>>> logger.experiment.display(tab='metrics')
log_parameters(parameters, step=None)[source]

Logs a dictionary (or dictionary-like object) of multiple parameters.

Parameters
parametersdict

Parameters in a key : value form

stepint, optional

Used to associate a specific step, by default None, by default None

log_pipeline_parameters(pipeline, stages=None)[source]

Iterates over the different stages in a pyspark PipelineModel object and logs the parameters to Comet.

Parameters
pipelinepyspark.ml.PipelineModel

PipelineModel object

stagesList[str], optional

Names of the stages of the pipeline to include, by default None (logs all)

Examples

The pipeline model contains the annotators of Spark NLP, that were fitted to a dataframe.

>>> logger.log_pipeline_parameters(pipeline_model)
log_visualization(html, name='viz.html')[source]

Uploads a NER visualization from Spark NLP Display to comet.

Parameters
htmlstr

HTML of the spark NLP Display visualization

namestr, optional

Name for the visualization in comet, by default “viz.html”

Examples

This example has NER chunks (NER extracted by e.g. NerDLModel and converted by a NerConverter) extracted in the colum “ner_chunk”.

>>> from sparknlp_display import NerVisualizer
>>> logger = CometLogger()
>>> for idx, result in enumerate(results.collect()):
...     viz = NerVisualizer().display(
...         result=result,
...         label_col='ner_chunk',
...         document_col='document',
...         return_html=True
...     )
...     logger.log_visualization(viz, name=f'viz-{idx}.html')
monitor(logdir, model, interval=10)[source]

Monitors the training of the model and submits logs to comet, given by an interval.

To log a Spark NLP annotator, it will need an “outputLogPath” parameter, as the CometLogger reads the log file generated during the training process.

If you are not able to monitor the live training, you can still log the training at the end with log_completed_run().

Parameters
logdirstr

Path to the output of the logs

modelAnnotatorApproach

Annotator to monitor

intervalint, optional

Interval for refreshing, by default 10