Utilities for Haystack

 

Johnsnowlabs provides the following nodes which can be used inside the Haystack Framework for scalable pre-processing&embedding on spark clusters. With this you can create Easy-Scalable&Production-Grade LLM&RAG applications. See the Haystack with Johnsnowlabs Tutorial Notebook

JohnSnowLabsHaystackProcessor

Pre-Process you documents in a scalable fashion in Haystack based on Spark-NLP’s DocumentCharacterTextSplitter and supports all of it’s parameters

# Create Pre-Processor which is connected to spark-cluster
from johnsnowlabs.llm import embedding_retrieval
processor = embedding_retrieval.JohnSnowLabsHaystackProcessor(
    chunk_overlap=2,
    chunk_size=20,
    explode_splits=True,
    keep_seperators=True,
    patterns_are_regex=False,
    split_patterns=["\n\n", "\n", " ", ""],
    trim_whitespace=True,
)
# Process document distributed on a spark-cluster
processor.process(some_documents)

JohnSnowLabsHaystackEmbedder

Scalable Embedding computation with any Sentence Embedding from John Snow Labs in Haystack You must provide the NLU reference of a sentence embeddings to load it. If you want to use GPU with the Embedding Model, set GPU=True on localhost, it will start a spark-session with GPU jars. For clusters, you must setup cluster-env correctly, using nlp.install_to_databricks() is recommended.

from johnsnowlabs.llm import embedding_retrieval
from haystack.document_stores import InMemoryDocumentStore

# Write some processed data to Doc store, so we can retrieve it later
document_store = InMemoryDocumentStore(embedding_dim=512)
document_store.write_documents(some_documents)

# Create Embedder which connects is connected to spark-cluster 
retriever = embedding_retrieval.JohnSnowLabsHaystackEmbedder(
    embedding_model='en.embed_sentence.bert_base_uncased',
    document_store=document_store,
    use_gpu=False,
)

# Compute Embeddings distributed in a cluster
document_store.update_embeddings(retriever)

Last updated