Just a quick post – no bells or whistles – I needed a simple example for a document I’m writing, so I decided to post it here as well.
I grabbed the main textual part of this PubMed Central article, Functional Analysis of a Breast Cancer-Associated FGFR2 Single Nucleotide Polymorphism Using Zinc Finger Mediated Genome Editing, from the web (pasted here raw/verbatim, with no preprocessing):
Last year (2017) I wrote a biomedical sentence splitter, that I used to “chunk” the biomedical text above, giving
For clarity and to emphasize that the sentences were split from one another, in the
textarea (above) I added blank lines between the sentences (using Neovim, with the
s/\n/\r\r/g regex expression; see my StackExchange answer).
The chunked file contains 272 sentences:
cat pmc_3827080.chunked | wc -l 272
I also wrote several variants of a sklearn-based TF-IDF classifier in Python 3, for various end uses. The chunked text, evaluated against one of those classifiers (trained against a small biomedical corpus), gives the following (n=43 and 49, respectively) keywords:
TF-IDF with custom stopwords TF-IDF with default stopwords ---------------------------- ----------------------------- cell 0.0587962963 cell 0.0215363744 allele 0.0305555556 allele 0.0111921316 rs2981578 0.0240740741 rs2981578 0.0088180431 fgfr2 0.0226851852 snp 0.0088180431 clone 0.0222222222 fgfr2 0.0083093098 breast 0.0203703704 clone 0.0081397321 cancer 0.0203703704 control 0.0075684193 risk 0.0185185185 breast 0.0074614211 mcf7 0.0171296296 cancer 0.0074614211 snp 0.0171296296 line 0.0072918433 erα 0.0148148148 risk 0.0067831101 foxa1 0.0138888889 heterozygous 0.0066135323 expression 0.0129629630 mcf7 0.0062743768 sample 0.0115740741 using 0.0057656435 dna 0.0106481481 binding 0.0054264880 locus 0.0101851852 erα 0.0054264880 genome 0.0097222222 used 0.0054264880 factor 0.0078703704 foxa1 0.0050873325 gene 0.0078703704 zfn 0.0050873325 mrna 0.0076883801 expression 0.0047481770 site 0.0076883801 g 0.0042394438 difference 0.0064814815 sample 0.0042394438 population 0.0064814815 fig 0.0040698660 patient 0.0060185185 dna 0.0039002883 intron 0.0055555556 study 0.0039002883 genotype 0.0046296296 data 0.0037307105 sequence 0.0046296296 locus 0.0037307105 target 0.0046296296 genome 0.0035611328 value 0.0046296296 medium 0.0033915550 runx2 0.0042944664 editing 0.0030523995 cells 0.0042400211 level 0.0030523995 analysis 0.0041666667 positive 0.0030523995 cycle 0.0041666667 factor 0.0028828218 haplotype 0.0041666667 gene 0.0028828218 marker 0.0041666667 showed 0.0028828218 proliferation 0.0041666667 mrna 0.0028288877 stimulation 0.0041666667 site 0.0028184683 transcription 0.0041666667 frequency 0.0025436663 allelic 0.0037037037 assay 0.0023740885 ase 0.0037037037 difference 0.0023740885 effect 0.0037037037 population 0.0023740885 oestrogen 0.0037037037 associated 0.0022045108 result 0.0037037037 patient 0.0022045108 copy 0.0020349330 increased 0.0020349330 intron 0.0020349330 performed 0.0020349330 relative 0.0020349330 sigma 0.0020349330
- In case you’re wondering, “ase” – above – is not a suffix but rather an acronym: “allele specific expression”.
- rs2981578 is a single-nucleotide polymorphism (SNP), a variant of which (ibid.) is associated with breast cancer (see also).
- To remove “noise” from those lists, I use a custom stopwords list (currently ~282,060 entries) that is automatically called by my TF-TDF scripts (and elsewhere, as needed).
- While at first glance the TF-IDF results with the default stopwords list (third column, above) doesn’t look too bad, many of the keywords returned (g; used; using; fig; increased; showed; …) are non-useful adjectives, verbs, adverbs and other parts of speech that are of limited use in biomedical information extraction and other downstream tasks: e.g., named entity recognition, or relation extraction via noun-phrase chunking.
Additionally, when other texts are included – authors; affiliations; materials and methods; references; etc. – these keyword lists become obfuscated with non-biomedical and (depending on the end use) non-relevant words. One approach to relation extraction, for example, is to enrich keyword/phrase lists for named entities, prior to relation extraction.
[Conclusions from other, preliminary work on the subject.]
I also ran that chunked text through a Python 3-based
TextRank algorithm, that I wrote. Here I return the top 10 ranked sentences (non-optimized; once again for ease-of-reading here, I added blank lines between the sentences):
Not bad, but we can do better, e.g. abstractive summarization!
[Text summarization largely falls into two algorithmic approaches: extractive summarization (above), which summarizes text by copying parts of the input, and abstractive summarization systems (below) that generate new phrases, possibly rephrasing or using words that were not in the original text.]
Here is a particularly good example of abstractive summarization: arXiv:1705.04304
Source (Richard Socher | SalesForce.com):
Your tldr by an ai: a deep reinforced model for abstractive summarization* ]
This is the source for the demo, above (n=10 sentences),
… and this is the output (as shown in the graphic, above):
The bottleneck is no longer access to information; now it's our ability to keep up. AI can be trained on a variety of different types of texts and summary lengths. A model that can generate long, coherent , and meaningful summaries remains an open research problem.
I ran that source text through my TF-IDF classifier, yielding these keywords:
model 0.1111111111111111 algorithm 0.08333333333333333 information 0.08333333333333333 summarization 0.08333333333333333 bottleneck 0.027777777777777776 combination 0.027777777777777776 contribution 0.027777777777777776 generation 0.027777777777777776 improvement 0.027777777777777776 language 0.027777777777777776 method 0.027777777777777776 multi-sentence 0.027777777777777776 reinforcement 0.027777777777777776 research 0.027777777777777776 result 0.027777777777777776 variety 0.027777777777777776
Better! Even this crude example of extractive summarization does an adequate job: the top 5-ranked sentences provide a reasonable summary.
- The Google PageRank Algorithm and How It Works (local copy) | provides the background for “Discussing TextRank …” (below)
- Discussing TextRank – A Unsupervised Algorithm for Extracting Meaning from Text (local copy) | discussion (reddit)