sparknlp.training.PubTator

class sparknlp.training.PubTator[source]

Bases: sparknlp.internal.ExtendedJavaWrapper

The PubTator format includes medical papers’ titles, abstracts, and tagged chunks.

For more information see PubTator Docs and MedMentions Docs.

readDataset() is used to create a Spark DataFrame from a PubTator text file.

Input File Format:

25763772        0       5       DCTN4   T116,T123       C4308010
25763772        23      63      chronic Pseudomonas aeruginosa infection        T047    C0854135
25763772        67      82      cystic fibrosis T047    C0010674
25763772        83      120     Pseudomonas aeruginosa (Pa) infection   T047    C0854135
25763772        124     139     cystic fibrosis T047    C0010674

Examples

>>> from sparknlp.training import PubTator
>>> pubTatorFile = "./src/test/resources/corpus_pubtator_sample.txt"
>>> pubTatorDataSet = PubTator().readDataset(spark, pubTatorFile)
>>> pubTatorDataSet.show(1)
+--------+--------------------+--------------------+--------------------+-----------------------+---------------------+-----------------------+
|  doc_id|      finished_token|        finished_pos|        finished_ner|finished_token_metadata|finished_pos_metadata|finished_label_metadata|
+--------+--------------------+--------------------+--------------------+-----------------------+---------------------+-----------------------+
|25763772|[DCTN4, as, a, mo...|[NNP, IN, DT, NN,...|[B-T116, O, O, O,...|   [[sentence, 0], [...| [[word, DCTN4], [...|   [[word, DCTN4], [...|
+--------+--------------------+--------------------+--------------------+-----------------------+---------------------+-----------------------+

Methods

__init__()

apply()

new_java_array(pylist, java_class)

ToDo: Inspired from spark 2.0.

new_java_array_integer(pylist)

new_java_array_string(pylist)

new_java_obj(java_class, *args)

readDataset(spark, path[, isPaddedToken])

Reads the dataset from an external resource.

new_java_array(pylist, java_class)

ToDo: Inspired from spark 2.0. Review if spark changes

readDataset(spark, path, isPaddedToken=True)[source]

Reads the dataset from an external resource.

Parameters
sparkpyspark.sql.SparkSession

Initiated Spark Session with Spark NLP

pathstr

Path to the resource

isPaddedTokenstr, optional

Whether tokens are padded, by default True

Returns
pyspark.sql.DataFrame

Spark Dataframe with the data