Just a quick post – no bells or whistles – I needed a simple example for a document I’m writing, so I decided to post it here as well.   

I grabbed the main textual part of this PubMed Central article, Functional Analysis of a Breast Cancer-Associated FGFR2 Single Nucleotide Polymorphism Using Zinc Finger Mediated Genome Editing, from the web (pasted here raw/verbatim, with no preprocessing):

Last year (2017) I wrote a biomedical sentence splitter, that I used to “chunk” the biomedical text above, giving

For clarity and to emphasize that the sentences were split from one another, in the textarea (above) I added blank lines between the sentences (using Neovim, with the s/\n/\r\r/g regex expression; see my StackExchange answer).

The chunked file contains 272 sentences:

cat pmc_3827080.chunked | wc -l

I also wrote several variants of a sklearn-based TF-IDF classifier in Python 3, for various end uses. The chunked text, evaluated against one of those classifiers (trained against a small biomedical corpus), gives the following (n=43 and 49, respectively) keywords:

TF-IDF with custom stopwords    TF-IDF with default stopwords
----------------------------    -----------------------------
cell            0.0587962963    cell            0.0215363744
allele          0.0305555556    allele          0.0111921316
rs2981578       0.0240740741    rs2981578       0.0088180431
fgfr2           0.0226851852    snp             0.0088180431
clone           0.0222222222    fgfr2           0.0083093098
breast          0.0203703704    clone           0.0081397321
cancer          0.0203703704    control         0.0075684193
risk            0.0185185185    breast          0.0074614211
mcf7            0.0171296296    cancer          0.0074614211
snp             0.0171296296    line            0.0072918433
erα             0.0148148148    risk            0.0067831101
foxa1           0.0138888889    heterozygous    0.0066135323
expression      0.0129629630    mcf7            0.0062743768
sample          0.0115740741    using           0.0057656435
dna             0.0106481481    binding         0.0054264880
locus           0.0101851852    erα             0.0054264880
genome          0.0097222222    used            0.0054264880
factor          0.0078703704    foxa1           0.0050873325
gene            0.0078703704    zfn             0.0050873325
mrna            0.0076883801    expression      0.0047481770
site            0.0076883801    g               0.0042394438
difference      0.0064814815    sample          0.0042394438
population      0.0064814815    fig             0.0040698660
patient         0.0060185185    dna             0.0039002883
intron          0.0055555556    study           0.0039002883
genotype        0.0046296296    data            0.0037307105
sequence        0.0046296296    locus           0.0037307105
target          0.0046296296    genome          0.0035611328
value           0.0046296296    medium          0.0033915550
runx2           0.0042944664    editing         0.0030523995
cells           0.0042400211    level           0.0030523995
analysis        0.0041666667    positive        0.0030523995
cycle           0.0041666667    factor          0.0028828218
haplotype       0.0041666667    gene            0.0028828218
marker          0.0041666667    showed          0.0028828218
proliferation   0.0041666667    mrna            0.0028288877
stimulation     0.0041666667    site            0.0028184683
transcription   0.0041666667    frequency       0.0025436663
allelic         0.0037037037    assay           0.0023740885
ase             0.0037037037    difference      0.0023740885
effect          0.0037037037    population      0.0023740885
oestrogen       0.0037037037    associated      0.0022045108
result          0.0037037037    patient         0.0022045108
                                copy            0.0020349330
                                increased       0.0020349330
                                intron          0.0020349330
                                performed       0.0020349330
                                relative        0.0020349330
                                sigma           0.0020349330


  • In case you’re wondering, “ase” – above – is not a suffix but rather an acronym: “allele specific expression”.
  • rs2981578 is a single-nucleotide polymorphism (SNP), a variant of which (ibid.) is associated with breast cancer (see also).
  • To remove “noise” from those lists, I use a custom stopwords list (currently ~282,060 entries) that is automatically called by my TF-TDF scripts (and elsewhere, as needed).
  • While at first glance the TF-IDF results with the default stopwords list (third column, above) doesn’t look too bad, many of the keywords returned (g; used; using; fig; increased; showed; …) are non-useful adjectives, verbs, adverbs and other parts of speech that are of limited use in biomedical information extraction and other downstream tasks: e.g., named entity recognition, or relation extraction via noun-phrase chunking.
    Additionally, when other texts are included – authors; affiliations; materials and methods; references; etc. – these keyword lists become obfuscated with non-biomedical and (depending on the end use) non-relevant words. One approach to relation extraction, for example, is to enrich keyword/phrase lists for named entities, prior to relation extraction.
    [Conclusions from other, preliminary work on the subject.]

I also ran that chunked text through a Python 3-based TextRank algorithm, that I wrote. Here I return the top 10 ranked sentences (non-optimized; once again for ease-of-reading here, I added blank lines between the sentences):

Not bad, but we can do better, e.g. abstractive summarization!

[Text summarization largely falls into two algorithmic approaches: extractive summarization (above), which summarizes text by copying parts of the input, and abstractive summarization systems (below) that generate new phrases, possibly rephrasing or using words that were not in the original text.]

Here is a particularly good example of abstractive summarization: arXiv:1705.04304

abstractive summation
Source (Richard Socher | SalesForce.com):
Your tldr by an ai: a deep reinforced model for abstractive summarization* ]

This is the source for the demo, above (n=10 sentences),

… and this is the output (as shown in the graphic, above):

The bottleneck is no longer access to information; now it's our ability to keep up.
AI can be trained on a variety of different types of texts and summary lengths.
A model that can generate long, coherent , and meaningful summaries remains an
open research problem.

I ran that source text through my TF-IDF classifier, yielding these keywords:

model           0.1111111111111111
algorithm       0.08333333333333333
information     0.08333333333333333
summarization   0.08333333333333333
bottleneck      0.027777777777777776
combination     0.027777777777777776
contribution    0.027777777777777776
generation      0.027777777777777776
improvement     0.027777777777777776
language        0.027777777777777776
method          0.027777777777777776
multi-sentence  0.027777777777777776
reinforcement   0.027777777777777776
research        0.027777777777777776
result          0.027777777777777776
variety         0.027777777777777776


I also ran that text through my TextRank classifier (a non-optimized  keyword / keyphrase extractor and extractive summarizer), giving the following top 10 ranked sentences.

Better! Even this crude example of extractive summarization does an adequate job: the top 5-ranked sentences provide a reasonable summary.