SparkNLP - Notebook

Last updated:

Notebook

Sentiment Analysis using John Snow Labs Spark-NLP

The following code will produce an entire machine learning pipeline aiming to retrieve sentiment analysis from a target distributed dataframe (or dataset). This will utilize several annotators required for such analysis and even an optional one (the spell checker, which is of type Token anyways) which is injected in between the sentiment analysis and the tokens it requires. ViveknSentimentAnalysis is the annotator utilized and it is training with two corpus, one containing positive sentiment examples, and another one containing negative examples. Some additional parameters have been used to fine tune the pipeline, and the Finisher transformer makes sure our final result is exactly the sentiment result of every row.

#Imports
import sys
from pyspark.ml import Pipeline
from pyspark.sql.functions import udf, col
from pyspark.sql.types import *
from sparknlp.annotator import *
from sparknlp.annotation import *
from sparknlp.base import DocumentAssembler
                                                                                                         In [ ]:
#Load the input data to be annotated
data = spark. \
        read. \
        parquet("../sentiment.parquet"). \
        limit(1000)
data.cache()
data.count()
data.show()

### Define the dataframe                                                               In [ ]:
document_assembler = DocumentAssembler() \
            .setInputCol("text")
### Transform input to appropriate schema
#assembled = document_assembler.transform(data)

### Sentence detector
sentence_detector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")
#sentence_data = sentence_detector.transform(checked)
                                                                                                         In [ ]:
### Tokenizer
tokenizer = Tokenizer() \
            .setInputCols(["sentence"]) \
            .setOutputCol("token")
#tokenized = tokenizer.transform(assembled)
                                                                                                         In [ ]:
### Spell Checker
spell_checker = NorvigSweetingApproach() \
            .setInputCols(["token"]) \
            .setOutputCol("spell")
#checked = spell_checker.fit(tokenized).transform(tokenized)
                                                                                                         In [ ]:
sentiment_detector = ViveknSentimentApproach() \
    .setInputCols(["spell", "sentence"]) \
    .setOutputCol("sentiment") \
    .setPositiveSource("../positive") \
    .setNegativeSource("../negative") \
    .setPruneCorpus(False)

finisher = Finisher() \
    .setInputCols(["sentiment"]) \
    .setIncludeKeys(True) \
    .setCleanAnnotations(False)
                                                                                                         In [ ]:
pipeline = Pipeline(stages=[
    document_assembler,
    sentence_detector,
    tokenizer,
    spell_checker,
    sentiment_detector,
    finisher
])

sentiment_data = pipeline.fit(data).transform(data)
sentiment_data.show()
 

Spark Summit Europe 2017