pubmed_parser: Possible error while parsing structured abstracts.

Hi, first of all big thanks for this life-saver of a package.

I think there is some problem with parsing XML for structured abstracts. Consider the following example:

         <Abstract>
          <AbstractText Label="" NlmCategory="UNASSIGNED">
            <b>Patient: Female, 16</b>
            <b>Final Diagnosis: Pelvic mass</b>
            <b>Symptoms: None</b>
            <b>Medication: None</b>
            <b>Clinical Procedure: CT • MRI</b>
            <b>Specialty: Diagnostic radiology • pediatrics.</b>
          </AbstractText>
          <AbstractText Label="OBJECTIVE" NlmCategory="OBJECTIVE">
            <b>Unusual presentation of unknown etiology, Rare disease, Mistake in diagnosis.</b>
          </AbstractText>
          <AbstractText Label="BACKGROUND" NlmCategory="BACKGROUND">Müllerian anomalies encompass a wide variety of malformations in the female genital tract, usually associated with renal and anorectal malformations. Of these anomalies, approximately 11% are uterus didelphys, which occurs when midline fusion of the müllerian ducts is arrested to a variable extent.</AbstractText>
          <AbstractText Label="CASE REPORT" NlmCategory="METHODS">We report the case of a 16-year-old female with uterine didelphys, jejunal malrotation, hematometra, hematosalpinx, and bilateral subcentimeter homogenous circular cystic-like renal lesions, who initially presented with left lower quadrant abdominal pain, non-bloody vomiting, and a history of irregular menstrual periods. Initial CT was confusing for an adnexal cystic mass, but further imaging disclosed the above müllerian anomalies.</AbstractText>
          <AbstractText Label="CONCLUSIONS" NlmCategory="CONCLUSIONS">Müllerian anomalies may mimic other, more common, adnexal lesions; thus, adequate evaluation of suspicious cystic adnexal masses with multiple and advanced imaging modalities such as MRI is essential for adequate diagnosis and management.</AbstractText>
        </Abstract>

The parse returned by medline_parser is as follows:

'Patient: Female, 16\n            Final Diagnosis: Pelvic mass\n            Symptoms: None\n            Medication: None\n            Clinical Procedure: CT \u2022 MRI\n            Specialty: Diagnostic radiology \u2022 pediatrics.'

As you can see, it completely misses a major portion of the text. I wonder if this is the case for all structured abstracts or only limited ones. As additional info, the file I’m using is ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline/medline17n0763.xml.gz and the PMID of the abstract is 23826455.

Thanks!

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 1
  • Comments: 23 (12 by maintainers)

Most upvoted comments

Okay, that works too 😃

The check for UNASSIGNED is still failing. Replacing ‘is not’ by != fixes it. Also, we want to remove the content under UNASSIGNED section right? In which case, we may want to put abstract_list.append(stringify_children(abstract).strip()) inside the if condition.

No problem! Thanks for being so prompt on this. Let me know if I can contribute in any other way.

Yes, there’s a dilemma here. Maybe we should simply use Label because` that’s the way it’s displayed on the website as well?

Or do something like NLMCategory - Label: and reduce it to just one of them in case of repetitions like OBJECTIVE - OBJECTIVE:

What do you think?