Education

  • B.Sc., Honours Biochemistry (Polynucleotide Chemistry), 1983, Dalhousie University
  • M.Sc., Occupational Hygiene (Genotoxicity), 1995, University of British Columbia
  • Ph.D., Biology (Molecular Genetics), 2000, University of Victoria
  • Postdoctoral (Molecular Genetics), 2001-2008, National Institute of Environmental Health Sciences

I have a lifelong interest in molecular genetics, with a focus on functional genomics: the phenotypic expression of the information encoded in our genomes.

Central to those interests is how information is encoded, retrieved and utilized.

My interests span biology, genetics, genomics, pathways, networks, bioinformatics, graphical models, natural language processing and machine learning.


Research: 1980 - 2008

Biochemistry; Biology; Molecular Genetics

  • polynucleotide chemistry
    • site-directed mutagenesis
    • DNA repair pathways
  • microbial genetics | transgenic rodent models
    • spontaneous mutations (ageing)
    • dietary mutagens and carcinogens
    • DNA damage and repair
    • mitochondrial genetics
  • bioinformatics:
    • gene expression profiles
    • DNA, protein interactomes)

Research: 2008 - present

Computational Analyses

My research at NIEHS (Durham, N.C.) involved genetic and bioinformatic analyses of DNA damage, repair and metabolism in yeast. Upon my return to Vancouver I continued to focus on bioinformatic approaches to leveraging molecular genomics data for a better understanding of metabolism, molecular genetics, functional genomics and clinical science.

To better address this Vision, I acquired expertise in Python (a powerful general purpose programming language), relational databases, textual knowledge stores, web programming, natural language programming (NLP), and graphical models.

I utilize Python as my primary programming language, and Postgres as my primary data store. While I have invested some time in Neo4j as a graphical store, I am keen to implement more flexible, custom solutions.

One of my longtime Aims is building a high-quality relational knowledge store from information extracted from PubMed and other biomedical data (metabolome; metabolic networks and pathways; cellular signaling networks; …) for use in recommendation, summarization, question answering, and biomedical knowledge discovery.

Coincident with my focus on NLP-based methods (late 2015) were breakthrough advances in machine learning (ML) including ML-based advances in NLP – which at that time were largely cumbersome, domain-specific, rules-based approaches. Accordingly, I spent nearly two years fully immersing myself in ML-related background, theory and hands-on programming: see my Personal Projects page for a few examples. Satisfied with that effort, later in 2017 I shifted my focus back to my overall strategy.

Recent Work

In more recent work, after a period of activity around

  • information retrieval / storage: PostgreSQL, PSQL;
  • graphical models: Neo4J, Cypher; and
  • adding all human genomic and metabolome data (NCBI; HMDB) to PostgreSQL

my focus returned to the most recent advances in NLP and ML.

Analogous to earlier, stunning advances in ML-based computer vision and other ML domains (reinforcement learning; deep network models; generative adversarial models; etc.), early in 2018 equally stunning advances in pretrained language models emerged, providing unparalleled opportunities in NLP and language understanding. For example, recent pretrained language models enable highly performant natural language processing that includes the processing of out-of-vocabulary words, and domain adaptation (transfer and multitask learning).

2018 also saw significant advances in graph-based machine learning, and graph signal processing. Recent advances in graphical models (embeddings, representations, attention and convolutions) and knowledge graph completion facilitate large-scale link prediction (relation extraction) and latent knowledge discovery, while recent advances in graph signal processing can be used to infer global properties based on sampling, noise reduction, etc.

Integration of these technologies provides a rational and comprehensive approach to implementing my Aims and Vision.

My next steps include retrieving a high-quality subset of Pubmed and PubMed Central and extracting relationships from those documents for incorporation into my knowledge graph. Knowledge graphs complement Solr and Postgres for their utility in visualizing complex datasets, establishing and mining relationships, and rapid, complex queries.

  • Applications include:

    • addressing information overload through classification, attentional models, and summarization; and
    • question answering and recommendation.

I am also interested in and have begun the construction of some basic metabolic pathways in a graphical model, with the intention of adding additional cellular signaling pathways (and other data) relevant to human disease.

  • Applications include:

    • network and pathways analyses;
    • grounding of metabolic and cellular signaling pathways to external knowledge sources; and
    • in silico modeling (e.g., modeling the effects of genomic variants or dysregulated pathways and networks, and modeling therapeutic interventions).

Another interest is the creation and leveraging of multi-view and hyperbolic embeddings in knowledge graphs, enabling the encoding of various signals within the same graph. Graph signal processing methods are especially relevant for this task.

  • Applications include:

    • comparing metabolic/signaling networks in healthy/diseased patients; and
    • temporal views of metabolism and metabolites, to list a couple of examples.

In all cases, advanced NLP and ML methods will be used (as appropriate) to further explore textual and graphical knowledge stores for knowledge discovery.


Research Areas and Approaches

Research Areas Programmatic Approaches
Information retrieval, preprocessing Linux scripts [example-1example-2];
GNU R;
Python;
Natural language processing:
    Classification:
    Clustering:
    Constituency parsing:
    Coreference resolution:
    Embeddings:
    Event detection/extraction:
    Named entity recognition:
    Polysemy, hypernymy:
    POS tagging:
    Relation classification;
    Relation extraction:
    Semantic parsing:
    Semantic role labeling:
    Summarization:
    Tagging:
    Word sense disambiguation;
Machine learning
Textual knowledge store Apache Solr
Relational knowledge store RDBMS: PostgreSQL (PSQL);
Comma separated files
Graphical knowledge store
(knowledge graphs)
PostgreSQL (PSQL) [example];
Neo4j (Cypher);
Custom solutions;
Comma separated files;
Statistical relational learning;
Knowledge graph embedding
Information overload Document, text classification;
Text summarization
Textual information extraction Python: TF-IDF | RAKE | TextRank;
NLP: named entity recognition | word sense disambiguation;
NLP: information, relation extraction (dependency / syntactic parsing; noun phrase chunking);
Natural language processing;
Machine learning
Graphical information extraction Knowledge graph completion (link prediction);
Representation learning;
Graph signal processing
Information retrieval | extraction Natural language processing;
Machine learning
Natural language understanding Word embeddings;
Memory based architectures;
Language Models;
Question answering & Reading comprehension;
Natural language inference
In silico modeling Bioinformatic approaches;
Graphical models;
Network, pathways analyses
Data visualization Backend: Apache Solr;
Frontend (GUI): HTML, JavaScript (JQuery; D3.js); Custom solutions

Applications

As suggested in my Vision, there are multiple numerous and applications of this work – including but not limited to:

  • topic modeling
  • summation | attention-based information retrieval
  • active / directed learning | question answering
  • automatic text understanding
  • cognitive computing
  • statistical inference (Bayesian …)
  • personal agents / assistants
  • advanced clustering methods: vector space models (VSM); knowledge graph traversals; …
  • advanced user interfaces | visualizations
  • dynamic network models | in silico modeling; e.g.:
    • temporal fluctuations in metabolite concentrations (health; disease)
    • effect on pathways due to defects in catalysis or signaling
    • effects on pathways due to genetic / epigenetic variations
    • identification of therapeutic targets
    • personalized, precise medicine
    • preventative medicine

Integral to many of those aims is the application of NLP, ML and graphical models; e.g.:

  • CNN | LSTM | bi-LSTM
  • attentional mechanisms; memory mechanisms
  • word embedding, VSM; language models
  • generative adversarial models applied to metabolic modeling
  • text, natural language understanding
  • analyses of KG to predict previously unknown treatment and causative relations between biomedical entities
  • in silico network modeling + ML (one informs the other)
    • note (e.g.) that the Allen Institute | Google DeepMind | others are hugely invested in the application of ML to better understanding the brain, cognition, dynamic memory network modeling …