Additonal Resources
Browse through our collection of videos, blogs to deepen your knowledge and experience with spark-nlp
Spark-NLP FAQ
What are the requirements of
SparkNLP?
The library works mainly on top of Spark, over time we have included the use of Amazon AWS free API, Google TensorFlow and Facebook RocksDB.
Make sure you have a working Spark environment and supply SparkNLP as a jar to the SparkSession classpath. The rest should be handled automatically.
Does SparkNLP rely on any other NLP
library?
No, SparkNLP is self contained and all algorithms are implemented
within the code base.
What do I need to learn in
order to use the library?
Either Scala or Python, and then, mostly Spark
and SparkML. SparkNLP uses the same logic and syntax than any other machine
learning transformer in Spark, and can be included within the same pipelines. So
only some review on the examples and you can get going.
Can I save trained models or
pipelines?
Yes, the same way you would do it for any other Spark ML
component.
Is PySpark NLP equally good than Scala counterpart?
The answer to this question compares to Spark in Scala vs Python. We rely on the exact same implementation technique.
Spark NLP doesn't use python UDF so serialization happens on the JVM and not in inter-process communication. Feature-wise, it is about the same, with
a few exceptions in some utility classes (such as the ResourceHelper or Annotation implicit functions)
Can I contribute?
Yes! Any kind of contribution is welcome, feedback, ideas,
management, documentation, testing, corpus for training and testing, development or
even code review. Refer to the contribute page for more information.
Installation and infrastructure
Does Spark-NLP work with Jupyter Notebooks / Apache Zeppelin?
Yes, check out our get started page or checkout GitHub's README page in our library.
Does Spark-NLP work with Databricks, Cloudera, Azure, |insert-your-preferred-framework-here|?
Spark-NLP has been tested on various frameworks and should work as expected. Sometimes a few tricks here and there are required but feel free
to report any issue or jump into our Slack channel for help.
When using Spark-NLP in python I get ‘JavaPackage’ object is not callable error.
This means your python setup is not finding the Spark-NLP jar in the JVM classpath. Make sure you properly use --jars or --packages for your defined framework.
Library Concepts
What are annotator types?
Each annotator has a type that may be shared with other
annotators. Whenever an annotator requires another annotator by a type, it means you
can provide in inputCols any annotator’s output column that has such type,
for instance Normalizer or SpellChecker are both token type annotators and
either or both may be used for a Sentiment Analysis model.
When should I use LightPipelines?
Sometimes you need to test only a few sentences or run against a very small dataset that fits on a single machine.
For these scenarios, LightPipelines are much faster than Spark's Pipeline, since there isn't a driver-executor relationship going on.
LightPiplines rely on multi-threaded parallel computation that make them much faster, particularly for real-time or streaming
demands. Roughly estimates show that LightPipelines are worth it when working with below 50k rows (depends on the size of sentences and the pipeline itself)
Should I always use RecursivePipelines instead of SparkML Pipelines?
As of this writing, there isn't much benefit for RecursivePipelines, but, the answer is ideally yes.
Even though only a few annotators utilize RecursivePipelines internally, the default behavior falls back
to that of the common Pipeline, so there's nothing to lose.
How do word embeddings work?
We use an internal cluster-supported technique to index a database of vectors. We then inject the information into
all spark workers to be able to locally resolve at a high-speed all word-embedding mappings. RocksDB Indexing is utilized.
Models and Support
Are there any pretrained models to work with?
Yes, there is a pretrained API that automatically downloads pre-trained models from our S3 servers.
Links for downloading for offline use are available in the GitHub README.
Notice all models are not meant for production use, but barely as an example for experimentation. Only English models are available.
Finally, aside from Pipelines, all downloadable models are specific stages of a pipeline. There is no pretrained model "for everything".
Spark-NLP is a modular library.
I need help, is there any support or chat?
We use Slack. Check our homepage for requesting an invite. John Snow Labs offers commercial support for the library with data,
pretrained models and infrastructure help (Big Data and Security). Find out more in https://www.johnsnowlabs.com
Troubleshooting
I am getting a Java Core Dump when running OCR transformation using Tesseract
Add LC_ALL=C environment variable
Getting 'org.apache.pdfbox.filter.MissingImageReaderException: Cannot read JPEG2000 image: Java Advanced Imaging (JAI) Image I/O Tools are not installed' when running an OCR transformation
add '--packages com.github.jai-imageio:jai-imageio-jpeg2000:1.3.0' to your Spark job. This library is non-free thus we can't include it as a Spark-NLP dependency by default