Licensed Release Notes

 

2.4.6

Overview

We release Spark NLP Enterprise 2.4.6 to fix some minor bugs.

Bugfixes

  • Updated IDF value calculation to be probabilistic based log[(N - df_t) / df_t + 1] as opposed to log[N / df_t]
  • TFIDF cosine distance was being calculated with the rooted norms rather than with the original squared norms
  • Validation of label cols is now performed at the beginning of EnsembleEntityResolver
  • Environment Variable for License value named jsl.settings.license
  • Now DocumentLogRegClassifier can be serialized from Python (bug introduced with the implementation of RecursivePipelines, LazyAnnotator attribute)

2.4.5

Overview

We are glad to announce Spark NLP Enterprise 2.4.5. As a new feature we are happy to introduce our new EnsembleEntityResolver which allows our Entity Resolution architecture to scale up in multiple orders of magnitude and handle datasets of millions of records on a sub-log computation increase We also enhanced our ChunkEntityResolverModel with 5 new distance calculations with weighting-array and aggregation-strategy params that results in more levers to finetune its performance against a given dataset.

New Features

  • EnsembleEntityResolver consisting of an integrated TFIDF-Logreg classifier in the first layer + Multiple ChunkEntityResolvers in the second layer (one per each class)
  • Five (5) new distances calculations for ChunkEntityResolver, namely:
    • Token Based: TFIDF-Cosine, Jaccard, SorensenDice
    • Character Based: JaroWinkler and Levenshtein
  • Weight parameter that works as a multiplier for each distance result to be considered during their aggregation
  • Three (3) aggregation strategies for the enabled distance in a particular instance, namely: AVERAGE, MAX and MIN

Enhancements

  • ChunkEntityResolver can now compute distances over all the neighbours found and return the metadata just for the best alternatives that meet the threshold; before it would calculate them over the neighbours and return them all in the metadata
  • ChunkEntityResolver now has an extramassPenalty parameter to accoun for penalization of token-length difference in compared strings
  • Metadata for the ChunkEntityResolver has been updated accordingly to reflect all new features
  • StringDistances class has been included in utils to aid in the calculation and organization of different types of distances for Strings
  • HasFeaturesJsl trait has been included to support the serialization of Features including [T] <: AnnotatorModel[T] types

Bugfixes

  • Frequency calculation for WMD in ChunkEntityResolver has been adjusted to account for real word count representation
  • AnnotatorType for DocumentLogRegClassifier has been changed to CATEGORY to align with classifiers in Open Source library

Deprecations

  • Legacy EntityResolver{Approach, Model} classes have been deprecated in favor of ChunkEntityResolver classes
  • ChunkEntityResolverSelector classes has been deprecated in favor of EnsembleEntityResolver

2.4.2

Overview

We are glad to announce Spark NLP Enterprise 2.4.2. As a new feature we are happy to introduce our new Disambiguation Annotator, which will let the users resolve different kind of entities based on Knowledge bases provided in the form of Records in a RocksDB database. We also enhanced / fixed DocumentLogRegClassifier, ChunkEntityResolverModel and ChunkEntityResolverSelector Annotators.

New Features

  • Disambiguation Annotator (NerDisambiguator and NerDisambiguatorModel) which accepts annotator types CHUNK and SENTENCE_EMBEDDINGS and returns DISAMBIGUATION annotator type. This output annotation type includes all the matches in the result and their similarity scores in the metadata.

Enhancements

  • ChunkEntityResolver Annotator now supports both EUCLIDEAN and COSINE distance for the KNN search and WMD calculation.

Bugfixes

  • Fixed a bug in DocumentLogRegClassifier Annotator to support its serialization to disk.
  • Fixed a bug in ChunkEntityResolverSelector Annotator to group by both SENTENCE and CHUNK at the time of forwarding tokens and embeddings to the lazy annotators.
  • Fixed a bug in ChunkEntityResolverModel in which the same exact embeddings was not included in the neighbours.

2.4.1

Overview

Introducing Spark NLP Enterprise 2.4.1 after all the feedback we received in the form of issues and suggestions on our different communication channels. Even though 2.4.0 was very stable, version 2.4.1 is here to address minor bug fixes that we summarize in the following lines.

Bugfixes

  • Changing the license Spark property key to be “jsl” instead of “sparkjsl” as the latter generates inconsistencies
  • Fix the alignment logic for tokens and chunks in the ChunkEntityResolverSelector because when tokens and chunks did not have the same begin-end indexes the resolution was not executed

2.4.0

Overview

We are glad to announce Spark NLP Enterprise 2.4.0. This is an important release because of several refactorizations achieved in the core library, plus the introduction of several state of the art algorithms, new features and enhancements. We have included several architecture and performance improvements, that aim towards making the library more robust in terms of storage handling for Big Data. In the NLP aspect, we have introduced a ContextualParser, DocumentLogRegClassifier and a ChunkEntityResolverSelector. These last two Annotators also target performance time and memory consumption by lowering the order of computation and data loaded to memory in each step when designed following a hierarchical pattern. We have put a big effort on this one, so please enjoy and share your comments. Your words are always welcome through all our different channels. Thank you very much for your important doubts, bug reports and feedback; they are always welcome and much appreciated.

New Features

  • BigChunkEntityResolver Annotator: New experimental approach to reduce memory consumption at expense of disk IO.
  • ContextualParser Annotator: New entity parser that works based on context parameters defined in a JSON file.
  • ChunkEntityResolverSelector Annotator: New AnnotatorModel that takes advantage of the RecursivePipelineModel + LazyAnnotator pattern to annotate with different LazyAnnotators at runtime.
  • DocumentLogregClassifier Annotator: New Annotator that provides a wrapped TFIDF Vectorizer + LogReg Classifier for TOKEN AnnotatorTypes (either at Document level or Chunk level)

Enhancements

  • normalizedColumn Param is no longer required in ChunkEntityResolver Annotator (defaults to the labelCol Param value).
  • ChunkEntityResolverMetadata now has more data to infer whether the match is meaningful or not.

Bugfixes

  • Fixed a bug on ContextSpellChecker Annotator where unrecognized tokens would cause an exception if not in vocabulary.
  • Fixed a bug on ChunkEntityResolver Annotator where undetermined results were coming out of negligible confidence scores for matches.
  • Fixed a bug on ChunkEntityResolver Annotator where search would fail if the neighbours Param was grater than the number of nodes in the tree. Now it returns up to the number of nodes in the tree.

Deprecations

  • OCR Moves to its own JSL Spark OCR project.

Infrastructure

  • Spark NLP License is now required to utilize the library. Please follow the instructions on the shared email.
Last updated