Annotation

The basic result of a Spark NLP operation is an annotation. It’s structure includes:

  • annotatorType: the type of annotator that generated the current annotation

  • begin: the begin of the matched content relative to raw-text

  • end: the end of the matched content relative to raw-text

  • result: the main output of the annotation

  • metadata: content of matched result and additional information

  • embeddings: (new in 2.0) contains vector mappings if required

This object is automatically generated by annotators after a transform process. No manual work is required. However, it is important to clearly understand the structure of an annotation to be able too efficiently use it.

For example, the annotation could look like this (using Pretrained Pipelines):

>>> from sparknlp.pretrained import PretrainedPipeline
>>> explain_document_pipeline = PretrainedPipeline("explain_document_ml")
explain_document_ml download started this may take some time.
Approx size to download 9.1 MB
[OK!]
>>> data = spark.createDataFrame([["We are very happy about Spark NLP"]]).toDF("text")
>>> result = explain_document_pipeline.model.transform(data).selectExpr("explode(pos)")
>>> result.show(truncate=False)
+---------------------------------------+
|col                                    |
+---------------------------------------+
|[pos, 0, 1, PRP, [word -> We], []]     |
|[pos, 3, 5, VBP, [word -> are], []]    |
|[pos, 7, 10, RB, [word -> very], []]   |
|[pos, 12, 16, JJ, [word -> happy], []] |
|[pos, 18, 22, IN, [word -> about], []] |
|[pos, 24, 28, NNP, [word -> Spark], []]|
|[pos, 30, 32, NNP, [word -> NLP], []]  |
+---------------------------------------+