pubmed_parser: Possible error while parsing structured abstracts.
Hi, first of all big thanks for this life-saver of a package.
I think there is some problem with parsing XML for structured abstracts. Consider the following example:
<Abstract>
<AbstractText Label="" NlmCategory="UNASSIGNED">
<b>Patient: Female, 16</b>
<b>Final Diagnosis: Pelvic mass</b>
<b>Symptoms: None</b>
<b>Medication: None</b>
<b>Clinical Procedure: CT • MRI</b>
<b>Specialty: Diagnostic radiology • pediatrics.</b>
</AbstractText>
<AbstractText Label="OBJECTIVE" NlmCategory="OBJECTIVE">
<b>Unusual presentation of unknown etiology, Rare disease, Mistake in diagnosis.</b>
</AbstractText>
<AbstractText Label="BACKGROUND" NlmCategory="BACKGROUND">Müllerian anomalies encompass a wide variety of malformations in the female genital tract, usually associated with renal and anorectal malformations. Of these anomalies, approximately 11% are uterus didelphys, which occurs when midline fusion of the müllerian ducts is arrested to a variable extent.</AbstractText>
<AbstractText Label="CASE REPORT" NlmCategory="METHODS">We report the case of a 16-year-old female with uterine didelphys, jejunal malrotation, hematometra, hematosalpinx, and bilateral subcentimeter homogenous circular cystic-like renal lesions, who initially presented with left lower quadrant abdominal pain, non-bloody vomiting, and a history of irregular menstrual periods. Initial CT was confusing for an adnexal cystic mass, but further imaging disclosed the above müllerian anomalies.</AbstractText>
<AbstractText Label="CONCLUSIONS" NlmCategory="CONCLUSIONS">Müllerian anomalies may mimic other, more common, adnexal lesions; thus, adequate evaluation of suspicious cystic adnexal masses with multiple and advanced imaging modalities such as MRI is essential for adequate diagnosis and management.</AbstractText>
</Abstract>
The parse returned by medline_parser is as follows:
'Patient: Female, 16\n Final Diagnosis: Pelvic mass\n Symptoms: None\n Medication: None\n Clinical Procedure: CT \u2022 MRI\n Specialty: Diagnostic radiology \u2022 pediatrics.'
As you can see, it completely misses a major portion of the text. I wonder if this is the case for all structured abstracts or only limited ones. As additional info, the file I’m using is ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline/medline17n0763.xml.gz and the PMID of the abstract is 23826455.
Thanks!
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 1
- Comments: 23 (12 by maintainers)
Okay, that works too 😃
The check for UNASSIGNED is still failing. Replacing ‘is not’ by != fixes it. Also, we want to remove the content under UNASSIGNED section right? In which case, we may want to put
abstract_list.append(stringify_children(abstract).strip())inside the if condition.No problem! Thanks for being so prompt on this. Let me know if I can contribute in any other way.
Yes, there’s a dilemma here. Maybe we should simply use
Labelbecause` that’s the way it’s displayed on the website as well?Or do something like
NLMCategory - Label:and reduce it to just one of them in case of repetitions likeOBJECTIVE - OBJECTIVE:What do you think?