Johnsnowlabs provides the following nodes which can be used inside the Haystack Framework for scalable pre-processing&embedding on spark clusters. With this you can create Easy-Scalable&Production-Grade LLM&RAG applications. See the Haystack with Johnsnowlabs Tutorial Notebook
JohnSnowLabsHaystackProcessor
Pre-Process you documents in a scalable fashion in Haystack based on Spark-NLP’s DocumentCharacterTextSplitter and supports all of it’s parameters
# Create Pre-Processor which is connected to spark-cluster
from johnsnowlabs.llm import embedding_retrieval
processor = embedding_retrieval.JohnSnowLabsHaystackProcessor(
chunk_overlap=2,
chunk_size=20,
explode_splits=True,
keep_seperators=True,
patterns_are_regex=False,
split_patterns=["\n\n", "\n", " ", ""],
trim_whitespace=True,
)
# Process document distributed on a spark-cluster
processor.process(some_documents)
JohnSnowLabsHaystackEmbedder
Scalable Embedding computation with any Sentence Embedding from John Snow Labs in Haystack You must provide the NLU reference of a sentence embeddings to load it. If you want to use GPU with the Embedding Model, set GPU=True on localhost, it will start a spark-session with GPU jars. For clusters, you must setup cluster-env correctly, using nlp.install_to_databricks() is recommended.
from johnsnowlabs.llm import embedding_retrieval
from haystack.document_stores import InMemoryDocumentStore
# Write some processed data to Doc store, so we can retrieve it later
document_store = InMemoryDocumentStore(embedding_dim=512)
document_store.write_documents(some_documents)
# Create Embedder which connects is connected to spark-cluster
retriever = embedding_retrieval.JohnSnowLabsHaystackEmbedder(
embedding_model='en.embed_sentence.bert_base_uncased',
document_store=document_store,
use_gpu=False,
)
# Compute Embeddings distributed in a cluster
document_store.update_embeddings(retriever)