sparknlp.training.pub_tator#

Contains helper classes for PubTator datasets.

Module Contents#

Classes#

PubTator

The PubTator format includes medical papers’ titles, abstracts, and

class PubTator[source]#

The PubTator format includes medical papers’ titles, abstracts, and tagged chunks.

For more information see PubTator Docs and MedMentions Docs.

readDataset() is used to create a Spark DataFrame from a PubTator text file.

Input File Format:

25763772        0       5       DCTN4   T116,T123       C4308010
25763772        23      63      chronic Pseudomonas aeruginosa infection        T047    C0854135
25763772        67      82      cystic fibrosis T047    C0010674
25763772        83      120     Pseudomonas aeruginosa (Pa) infection   T047    C0854135
25763772        124     139     cystic fibrosis T047    C0010674

Examples

>>> from sparknlp.training import PubTator
>>> pubTatorFile = "./src/test/resources/corpus_pubtator_sample.txt"
>>> pubTatorDataSet = PubTator().readDataset(spark, pubTatorFile)
>>> pubTatorDataSet.show(1)
+--------+--------------------+--------------------+--------------------+-----------------------+---------------------+-----------------------+
|  doc_id|      finished_token|        finished_pos|        finished_ner|finished_token_metadata|finished_pos_metadata|finished_label_metadata|
+--------+--------------------+--------------------+--------------------+-----------------------+---------------------+-----------------------+
|25763772|[DCTN4, as, a, mo...|[NNP, IN, DT, NN,...|[B-T116, O, O, O,...|   [[sentence, 0], [...| [[word, DCTN4], [...|   [[word, DCTN4], [...|
+--------+--------------------+--------------------+--------------------+-----------------------+---------------------+-----------------------+
readDataset(self, spark, path, isPaddedToken=True)[source]#

Reads the dataset from an external resource.

Parameters:
sparkpyspark.sql.SparkSession

Initiated Spark Session with Spark NLP

pathstr

Path to the resource

isPaddedTokenstr, optional

Whether tokens are padded, by default True

Returns:
pyspark.sql.DataFrame

Spark Dataframe with the data