Background

I am evaluating some contextual language models for biomedical natural language processing (BioNLP). Several platforms support these models, including


Visualizations

Several of the “general-use” packages mentioned above provide opportunities for visualizing natural language tags and embeddings. For example, spaCy visualizers [note this one] allow the visualization of color-coded entities – as I describe here (I also describe those CRAFT corpus entity labels here.)

spacy_tagged_text_browser-2019-11-26c.png

spacy_tagged_text_browser-2019-11-26a.png

Likewise, I was intrigued by this example, Visualizing spaCy vectors in TensorBoard, on the spaCy examples page. It’s apparently possible to view those embeddings (tensors) in the TensorFlow Embedding Projector [example]!

I was looking at Flair embeddings at the time (2019-11-27; awaiting the anticipated release of a BioFlair pretrained model), so I thought I’d try to demo the viewing of those embeddings in TensorFlow’s Projector.

  • Note: there is currently (late Nov 2019) a bug in Torch / PyTorch (used by Flair) that prevents the installation of that code in Python 3.8. Consequently, as my Arch Linux environment is Py3.8 and there currently is no Py3.8 install wheel for Torch/PyTorch [screenshot
    ; Py3.8 code expected ~mid-Dec 2019], I cannot currently install Flair in that environment.

    My solution was to create a Py3.7 venv (which I describe here on StackOverflow), then do the various installations.

Having installed Flair, Torch / PyTorch, TensorFlow, etc. in that Py3.7 venv, I proceeded to figure out how to load the Flair embeddings in TF Projector. The following code provides a step-by-step explanation.


Flair Embeddings (Tensors) → Tensorflow TensorBoard Embedding Projector

[ Click here to read the following code as a single (monochromatic, plain text) file in the browser. ]


Install TensorBoard (Py3.7 venv):


# Install Python 3.7 in Python 3.8 env:
#   https://stackoverflow.com/a/58964629/1904943

# Test (in terminal):

  [victoria@victoria ~]$ date
    Wed 20 Nov 2019 04:25:38 PM PST

  [victoria@victoria ~]$ p37    ## ~/.bashrc alias
    [Python 3.7 venv (source ~/venv/py3.7/bin/activate)]

  (py3.7) [victoria@victoria ~]$ env | grep -i virtual
    VIRTUAL_ENV=/home/victoria/venv/py3.7

  (py3.7) [victoria@victoria ~]$ python --version
    Python 3.7.4

  (py3.7) [victoria@victoria ~]$ pip install --upgrade pip
    ...
    Successfully installed pip-19.3.1

  ## https://github.com/lanpa/tensorboardX
  ## Also installs (if I recall) tensorflow, other dependencies:
  (py3.7) [victoria@victoria ~]$ pip install tensorboardX    ## << note: capital X
    ...
    ## If needed:  pip install moviepy

  (py3.7) [victoria@victoria ~]$ pip install flair
    ...
    Successfully installed
      Cython-0.29.14
      SudachiPy-0.4.0
      attrs-19.3.0
      backcall-0.1.0
      boto-2.49.0
      boto3-1.10.23
      botocore-1.13.23
      bpemb-0.3.0
      certifi-2019.9.11
      cffi-1.13.2
      chardet-3.0.4
      click-7.0
      cloudpickle-1.2.2
      cycler-0.10.0
      dartsclone-0.6
      decorator-4.4.1
      deprecated-1.2.7
      docutils-0.15.2
      flair-0.4.4
      future-0.18.2
      gensim-3.8.1
      hyperopt-0.2.2
      idna-2.8
      importlib-metadata-0.23
      ipython-7.6.1
      ipython-genutils-0.2.0
      jedi-0.15.1
      jmespath-0.9.4
      joblib-0.14.0
      kiwisolver-1.1.0
      kytea-0.1.4
      langdetect-1.0.7
      matplotlib-3.1.1
      more-itertools-7.2.0
      mpld3-0.3
      natto-py-0.9.0
      networkx-2.2
      numpy-1.17.4
      packaging-19.2
      parso-0.5.1
      pexpect-4.7.0
      pickleshare-0.7.5
      pillow-6.2.1
      pluggy-0.13.0
      prompt-toolkit-2.0.10
      ptyprocess-0.6.0
      py-1.8.0
      pycparser-2.19
      pygments-2.4.2
      pymongo-3.9.0
      pyparsing-2.4.5
      pytest-5.3.0
      python-dateutil-2.8.1
      regex-2019.11.1
      requests-2.22.0
      s3transfer-0.2.1
      sacremoses-0.0.35
      scikit-learn-0.21.3
      scipy-1.3.2
      segtok-1.5.7
      sentencepiece-0.1.83
      six-1.13.0
      sklearn-0.0
      smart-open-1.9.0
      sortedcontainers-2.1.0
      sqlitedict-1.6.0
      tabulate-0.8.6
      tiny-tokenizer-3.0.1
      torch-1.3.1
      torchvision-0.4.2
      tqdm-4.38.0
      traitlets-4.3.3
      transformers-2.1.1
      urllib3-1.24.3
      wcwidth-0.1.7
      wrapt-1.11.2
      zipp-0.6.0

  (py3.7) [victoria@victoria ~]$ python
    Python 3.7.4 (default, Nov 20 2019, 11:36:53) 
    [GCC 9.2.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.

  >>> import flair    ## works, yea!!  :-D
  >>> 

Start TensorBoard:


[victoria@victoria tensorflow]$ cd /mnt/Vancouver/apps/tensorflow/

[victoria@victoria tensorflow]$ date; pwd; echo; ls -l

  Thu 28 Nov 2019 10:50:19 AM PST
  /mnt/Vancouver/apps/tensorflow

  total 928
  -rw-------  1 victoria victoria  19305 Nov 28 10:49 _readme-tensorflow-victoria.txt
  drwxr-xr-x 11 victoria victoria   4096 Nov 26 16:45 runs

[victoria@victoria tensorflow]$ tensorboard --logdir runs/
  Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
  TensorBoard 2.0.0 at http://localhost:6006/ (Press CTRL+C to quit)
  ...

Obtain Flair embeddings for test sentence:


from flair.embeddings import FlairEmbeddings, Sentence
from flair.models import SequenceTagger
from flair.embeddings import StackedEmbeddings

sentence = Sentence('The RAS-MAPK signalling cascade serves as a central node in transducing signals from membrane receptors to the nucleus.')

tagger = SequenceTagger.load('ner')
tagger.predict(sentence)

embeddings_f = FlairEmbeddings('pubmed-forward')
embeddings_b = FlairEmbeddings('pubmed-backward')

stacked_embeddings = StackedEmbeddings([
    embeddings_f,
    embeddings_b,
])

stacked_embeddings.embed(sentence)

tokens = [str(token).split()[2] for token in sentence]
print(tokens)
'''
  ['The', 'RAS-MAPK', 'signalling', 'cascade', 'serves', 'as', 'a', 'central', 'node', 'in', 'transducing', 'signals', 'from', 'membrane', 'receptors', 'to', 'the', 'nucleus.']
'''

for token in sentence:
    print(token)
    print(token.embedding)
    print(token.embedding.shape)

'''
  Token: 1 The
  tensor([ 0.0077, -0.0227, -0.0004,  ...,  0.1377, -0.0003,  0.0028])
  torch.Size([2300])
  Token: 2 RAS-MAPK
  tensor([-0.0007, -0.1601, -0.0274,  ...,  0.1982,  0.0013,  0.0042])
  torch.Size([2300])
  Token: 3 signalling
  tensor([ 4.2534e-03, -3.1018e-01, -3.9660e-01,  ...,  5.9336e-02,
          -9.4445e-05,  1.0025e-02])
  torch.Size([2300])
  Token: 4 cascade
  tensor([ 0.0026, -0.0087, -0.1398,  ..., -0.0037,  0.0012,  0.0274])
  torch.Size([2300])
  Token: 5 serves
  tensor([-0.0005, -0.0164, -0.0233,  ..., -0.0013,  0.0039,  0.0004])
  torch.Size([2300])
  Token: 6 as
  tensor([ 3.8261e-03, -7.6409e-02, -1.8632e-02,  ..., -2.8906e-03,
          -4.4556e-04,  5.6909e-05])
  torch.Size([2300])
  Token: 7 a
  tensor([ 0.0035, -0.0207,  0.1700,  ..., -0.0193,  0.0017,  0.0006])
  torch.Size([2300])
  Token: 8 central
  tensor([ 0.0159, -0.4097, -0.0489,  ...,  0.0743,  0.0005,  0.0012])
  torch.Size([2300])
  Token: 9 node
  tensor([ 9.7725e-03, -3.3817e-01, -2.2848e-02,  ..., -6.6284e-02,
          2.3646e-04,  1.0505e-02])
  torch.Size([2300])
  Token: 10 in
  tensor([ 0.0219, -0.0677, -0.0154,  ...,  0.0102,  0.0066,  0.0016])
  torch.Size([2300])
  Token: 11 transducing
  tensor([ 0.0092, -0.0431, -0.0450,  ...,  0.0060,  0.0002,  0.0005])
  torch.Size([2300])
  Token: 12 signals
  tensor([ 0.0047, -0.2732, -0.0408,  ...,  0.0136,  0.0005,  0.0072])
  torch.Size([2300])
  Token: 13 from
  tensor([ 0.0072, -0.0173, -0.0149,  ..., -0.0013, -0.0004,  0.0056])
  torch.Size([2300])
  Token: 14 membrane
  tensor([ 0.0086, -0.1151, -0.0629,  ...,  0.0043,  0.0050,  0.0016])
  torch.Size([2300])
  Token: 15 receptors
  tensor([ 7.6452e-03, -2.3825e-01, -1.5683e-02,  ..., -5.4974e-04,
          -1.4646e-04,  6.6120e-03])
  torch.Size([2300])
  Token: 16 to
  tensor([ 0.0038, -0.0354, -0.1337,  ...,  0.0060, -0.0004,  0.0102])
  torch.Size([2300])
  Token: 17 the
  tensor([ 0.0186, -0.0151, -0.0641,  ...,  0.0188,  0.0391,  0.0069])
  torch.Size([2300])
  Token: 18 nucleus.
  tensor([ 0.0003, -0.0461,  0.0043,  ..., -0.0126, -0.0004,  0.0142])
  torch.Size([2300])
'''

## The embeddings above are PyTorch tensors (Flair depends on Torch/PyTorch).

## https://stackoverflow.com/questions/53903373/convert-pytorch-tensor-to-python-list
## https://pytorch.org/docs/stable/tensors.html#torch.Tensor.tolist

## https://stackoverflow.com/questions/29895602/how-to-save-output-from-python-like-tsv
## https://stackoverflow.com/a/29896136/1904943

[ optional ]

Write Python output to files:

In an earlier iteration of this effort I saved the Flair tokens as metadata, and the embeddings (tensors) as a list. While those files are not needed here, I leave this code for future reference.


import csv

metadata_f = 'metadata.tsv'
tensors_f = 'tensors.tsv'

with open(metadata_f, 'w', encoding='utf8', newline='') as tsv_file:
    tsv_writer = csv.writer(tsv_file, delimiter='\t', lineterminator='\n')
    for token in tokens:
        ## Assign to a dummy variable ( _ ) to suppress character counts;
        ## Using (token), rather than ([token]), prints spaces between all characters:
        _ = tsv_writer.writerow([token])


'''
[victoria@victoria tensorflow]$ cat metadata.tsv :
  The
  RAS-MAPK
  signalling
  cascade
  serves
  as
  a
  central
  node
  in
  transducing
  signals
  from
  membrane
  receptors
  to
  the
  nucleus.
'''

import torch    ## needed for tolist()

with open(tensors_f, 'w', encoding='utf8', newline='') as tsv_file:
    tsv_writer = csv.writer(tsv_file, delimiter='\t', lineterminator='\n')
    for token in sentence:
        embedding = token.embedding
        ## https://stackoverflow.com/questions/12770213/writerow-csv-returns-a-number-instead-of-writing-rows
        ## assign to a dummy variable ( _ ) to suppress character counts
        ## tolist() is a PyTorch method that converts tensors to lists:
        _ = tsv_writer.writerow(embedding.tolist())

## CAUTION: even for the single, short sentence used in this example, the
## following `cat` statement generates an ENORMOUS list!

'''
  [victoria@victoria tensorflow]$ cat tensors.tsv 
      0.007691788021475077	-0.02268664352595806	-0.0004340760060586035	...
'''

Transform Flair tokens and tensors to NumPy arrays:


##  https://stackoverflow.com/questions/40849116/how-to-use-tensorboard-embedding-projector/41177133
##  https://stackoverflow.com/a/41177133/1904943

[victoria@victoria tensorflow]$ p37
   [Python 3.7 venv (source ~/venv/py3.7/bin/activate)]

(py3.7) [victoria@victoria tensorflow]$ python
  Python 3.7.4 (default, Nov 20 2019, 11:36:53) 
  [GCC 9.2.0] on linux
  Type "help", "copyright", "credits" or "license" for more information.


## TEST:

>>> import numpy as np
>>> from torch.utils.tensorboard import SummaryWriter

>>> vectors = np.array([[0,0,1], [0,1,0], [1,0,0], [1,1,1]])
>>> metadata = ['001', '010', '100', '111']  # labels

>>> print(metadata)
  ['001', '010', '100', '111']

>>> print(vectors)
  [[0 0 1]
  [0 1 0]
  [1 0 0]
  [1 1 1]]

>>> writer = SummaryWriter()
>>> writer.add_embedding(vectors, metadata)
>>> writer.close()
>>>

## That (Nov 28, 2019: ~11:08 am) generated a new run, "Nov28_11-08-09_victoria",
## visible in the TensorFlow TensorBoard.  When I clicked that link, those data
## opened in the Projector!

# ----------------------------------------------------------------------------

>>> tokens = [str(token).split()[2] for token in sentence]

>>> print(tokens)
'''
  ['The', 'RAS-MAPK', 'signalling', 'cascade', 'serves', 'as', 'a', 'central',
   'node', 'in', 'transducing', 'signals', 'from', 'membrane', 'receptors',
   'to', 'the', 'nucleus.']
'''

>>> tokens_array = np.array(tokens)

>>> print(tokens_array)
'''
  ['The' 'RAS-MAPK' 'signalling' 'cascade' 'serves' 'as' 'a' 'central'
  'node' 'in' 'transducing' 'signals' 'from' 'membrane' 'receptors'
  'to' 'the' 'nucleus.']
'''

>>> for token in tokens_array:
        print(token)
>>> 
'''
  The
  RAS-MAPK
  signalling
  cascade
  serves
  as
  a
  central
  node
  in
  transducing
  signals
  from
  membrane
  receptors
  to
  the
  nucleus.
'''

>>> embeddings = [token.embedding for token in sentence]

>>> print(embeddings)
'''
  [tensor([ 0.0077, -0.0227, -0.0004,  ...,  0.1377, -0.0003,  0.0028]),
   tensor([-0.0007, -0.1601, -0.0274,  ...,  0.1982,  0.0013,  0.0042]),
   tensor([ 4.2534e-03, -3.1018e-01, -3.9660e-01,  ...,  5.9336e-02, -9.4445e-05,  1.0025e-02]),
   tensor([ 0.0026, -0.0087, -0.1398,  ..., -0.0037,  0.0012,  0.0274]),
   tensor([-0.0005, -0.0164, -0.0233,  ..., -0.0013,  0.0039,  0.0004]),
   tensor([ 3.8261e-03, -7.6409e-02, -1.8632e-02,  ..., -2.8906e-03, -4.4556e-04,  5.6909e-05]),
   tensor([ 0.0035, -0.0207,  0.1700,  ..., -0.0193,  0.0017,  0.0006]),
   tensor([ 0.0159, -0.4097, -0.0489,  ...,  0.0743,  0.0005,  0.0012]),
   tensor([ 9.7725e-03, -3.3817e-01, -2.2848e-02,  ..., -6.6284e-02, 2.3646e-04,  1.0505e-02]),
   tensor([ 0.0219, -0.0677, -0.0154,  ...,  0.0102,  0.0066,  0.0016]),
   tensor([ 0.0092, -0.0431, -0.0450,  ...,  0.0060,  0.0002,  0.0005]),
   tensor([ 0.0047, -0.2732, -0.0408,  ...,  0.0136,  0.0005,  0.0072]),
   tensor([ 0.0072, -0.0173, -0.0149,  ..., -0.0013, -0.0004,  0.0056]),
   tensor([ 0.0086, -0.1151, -0.0629,  ...,  0.0043,  0.0050,  0.0016]),
   tensor([ 7.6452e-03, -2.3825e-01, -1.5683e-02,  ..., -5.4974e-04, -1.4646e-04,  6.6120e-03]),
   tensor([ 0.0038, -0.0354, -0.1337,  ...,  0.0060, -0.0004,  0.0102]),
   tensor([ 0.0186, -0.0151, -0.0641,  ...,  0.0188,  0.0391,  0.0069]),
   tensor([ 0.0003, -0.0461,  0.0043,  ..., -0.0126, -0.0004,  0.0142])]
'''

import torch    ## needed for tolist()

>>> embeddings = [token.embedding.tolist() for token in sentence]

##  ***  CAUTION -- EVEN FOR THIS ONE SENTENCE THIS IS AN ENORMOUS LIST!!  ***

>>> print(embeddings)
'''
  [[0.007691788021475077, -0.02268664352595806, ..., -0.0004157265357207507, 0.014170931652188301]]
'''

>>> embeddings_array = np.array(embeddings)

>>> print(embeddings_array)
'''
  [[ 7.69178802e-03 -2.26866435e-02 -4.34076006e-04 ...  1.37687057e-01 -3.07319278e-04  2.84141395e-03]
  [-7.38183910e-04 -1.60104632e-01 -2.73584425e-02 ...  1.98223457e-01 1.31987268e-03  4.19976842e-03]
  [ 4.25336510e-03 -3.10180396e-01 -3.96601588e-01 ...  5.93362860e-02 -9.44453641e-05  1.00254947e-02]
  ...
  [ 3.82626243e-03 -3.53914015e-02 -1.33689731e-01 ...  5.97812422e-03 -3.52837233e-04  1.01681864e-02]
  [ 1.86223574e-02 -1.51006011e-02 -6.41461909e-02 ...  1.87926367e-02 3.90900113e-02  6.87920302e-03]
  [ 2.52505066e-04 -4.60800231e-02  4.34845686e-03 ... -1.26084751e-02 -4.15726536e-04  1.41709317e-02]]
'''
>>> 

Start new TensorBoard instance & load those data:

OK, we now have everything needed to visualize those tensors (reformatted as NumPy arrays) in TensorFlow’s Embedding Projector! :-D


>>> from torch.utils.tensorboard import SummaryWriter
>>> writer = SummaryWriter()

## Load those data:

>>> writer.add_embedding(embeddings_array, tokens_array)
>>> writer.close()
>>> 

## Wait a few seconds for tensorboard, http://localhost:6006/#projector
## to refresh in Firefox (manually reload the browser, if needed).
## My new "run" appears!  "Nov28_11-08-09_victoria" !
##    /mnt/Vancouver/apps/tensorflow/runs/Nov28_11-54-28_victoria

## Yea: works!! :-D
## SimpleScreenRecorder video screen capture below.  :-)

Projector:


[ Double-click the video or click the expander button at the bottom right to view in fullscreen. ]