Victoria Stuart, Ph.D.:
Technical Biography

Education

B.Sc., Honours Biochemistry (Polynucleotide Chemistry), 1983, Dalhousie University
M.Sc., Occupational Hygiene (Genotoxicity), 1995, University of British Columbia
Ph.D., Biology (Molecular Genetics), 2000, University of Victoria
Postdoctoral (Molecular Genetics), 2001-2008, National Institute of Environmental Health Sciences

My knowledge and experience spans

biochemistry; cell biology; metabolic pathways; cellular signaling pathways and networks
molecular genetics and genomics, including cancer biology
bioinformatics

with more recent programming experience including

Linux super-user
Python
relational databases (PostgreSQL)
graphical knowledge stores (knowledge graphs)
natural language processing / understanding
machine learning.

[See my curriculum vitae for additional detail.]

I have a lifelong interest in molecular genetics, with a focus on functional genomics: the phenotypic expression of the information encoded in our genomes.

Central to these interests is how information is encoded, retrieved and utilized.

Research: 1980 - 2008

Biochemistry; Biology; Molecular Genetics

polynucleotide chemistry
- site-directed mutagenesis
- DNA repair pathways
microbial genetics | transgenic rodent models
- spontaneous mutations (ageing)
- dietary mutagens and carcinogens
- DNA damage and repair
- mitochondrial genetics
bioinformatics:
- gene expression profiles
- DNA, protein interactomes)

Research: 2008 - Present

Aims

One of my longtime Aims is building a high-quality relational knowledge store using information extracted from PubMed and other biomedical data (metabolome; metabolic networks and pathways; cellular signaling networks; …) for use in recommendation, summarization, question answering, and biomedical knowledge discovery.

Background

My research at NIEHS (Durham, N.C.) involved genetic and bioinformatic analyses of DNA damage, repair and metabolism in yeast. Upon my return to Vancouver I continued to focus on bioinformatic approaches to leveraging molecular genomics data for a better understanding of metabolism, molecular genetics, functional genomics and clinical science.

To better address this Vision, I acquired expertise in Python (a powerful general purpose programming language), relational databases, textual knowledge stores, web programming, natural language programming (NLP), and graphical models.

Computational Analyses

Coincident with my focus on NLP-based methods (late 2015) were breakthrough advances in machine learning (ML) – particularly in the computer vision domain, including the use of pretrained models and transfer learning. This period also saw stunning advances in other ML domains, including NLP, reinforcement learning, deep neural network architectures, generative adversarial models, ML platforms, etc.

Consequently, for a period of ~1.5+ years (2015-2017) I fully immersed myself in the machine learning domain including the theoretical background, installing and debugging major ML platforms (Theano; Caffe; Torch7; TensorFlow; …), etc. During that period I also implemented various personal, self-taught ML projects.

Early in 2018 I returned to my primary focus of acquiring and developing the skills and understanding necessary for me to implement my Aims and Vision.

The emergence of pretrained language models in 2018 provided stunning advances and unparalleled opportunities in NLP and language understanding – including, for example, the processing of out-of-vocabulary words, domain adaptation (transfer and multitask learning), and syntactic analyses. These language models greatly facilitated ML-based advances in NLP, supplanting traditional NLP approaches which tend to be cumbersome, domain-specific, rules-based approaches.

2018 also saw significant advances in graph-based machine learning (embeddings, representations, attention, convolutions; knowledge graph completion; etc.) that facilitate large-scale link prediction (relation extraction) and latent knowledge discovery. Likewise, recent advances in graph signal processing can be used to infer global properties based on sampling, noise reduction, etc.

The integration of these technologies provides a rational and comprehensive approach to implementing my Aims and Vision.

My next steps include retrieving a high-quality subset of Pubmed and PubMed Central and extracting relationships from those documents for incorporation into my knowledge graph. Knowledge graphs complement Solr and Postgres for their utility in visualizing complex datasets, establishing and mining relationships, and rapid, complex queries.

Applications include:
- addressing information overload through classification, attentional models, and summarization; and
- question answering and recommendation.

I have also begun the construction of some basic metabolic pathways in a graphical model, with the intention of adding additional cellular signaling pathways (and other data) relevant to human disease.

Applications include:
- network and pathways analyses;
- grounding of metabolic and cellular signaling pathways to external knowledge sources; and
- in silico modeling (e.g., modeling the effects of genomic variants or dysregulated pathways and networks, and modeling therapeutic interventions).

Another area of interest is the creation and leveraging of multi-view and hyperbolic embeddings in knowledge graphs, enabling the encoding of various signals within the same graph. Graph signal processing methods are especially relevant for this task.

Applications include:
- comparing metabolic/signaling networks in healthy/diseased patients; and
- temporal views of metabolism and metabolites, to list a couple of examples.

In all cases, advanced NLP and ML methods will be used (as appropriate) to further explore textual and graphical knowledge stores for knowledge discovery.

Research Areas and Approaches

Research Areas	Programmatic Approaches
Information retrieval, preprocessing	Linux scripts [example-1; example-2]; GNU R; Python; Natural language processing: Classification: Clustering: Constituency parsing: Coreference resolution: Embeddings: Event detection/extraction: Named entity recognition: Polysemy, hypernymy: POS tagging: Relation classification; Relation extraction: Semantic parsing: Semantic role labeling: Summarization: Tagging: Word sense disambiguation; Machine learning
Textual knowledge store	Apache Solr
Relational knowledge store	RDBMS: PostgreSQL (PSQL); Comma separated files
Graphical knowledge store (knowledge graphs)	PostgreSQL (PSQL) [example]; Neo4j (Cypher); Custom solutions; Comma separated files; Statistical relational learning; Knowledge graph embedding
Information overload	Document, text classification; Text summarization
Textual information extraction	Python: TF-IDF \| RAKE \| TextRank; NLP: named entity recognition \| word sense disambiguation; NLP: information, relation extraction (dependency / syntactic parsing; noun phrase chunking); Natural language processing; Machine learning
Graphical information extraction	Knowledge graph completion (link prediction); Representation learning; Graph signal processing
Information retrieval \| extraction	Natural language processing; Machine learning
Natural language understanding	Word embeddings; Memory based architectures; Language Models; Question answering & Reading comprehension; Natural language inference
In silico modeling	Bioinformatic approaches; Graphical models; Network, pathways analyses
Data visualization	Backend: Apache Solr; Frontend (GUI): HTML, JavaScript (JQuery; D3.js); Custom solutions

Applications

As suggested in my Vision, there are multiple numerous and applications of this work – including but not limited to:

topic modeling
summation | attention-based information retrieval
active / directed learning | question answering
automatic text understanding
cognitive computing
statistical inference (Bayesian …)
personal agents / assistants
advanced clustering methods: vector space models (VSM); knowledge graph traversals; …
advanced user interfaces | visualizations
dynamic network models | in silico modeling; e.g.:
- temporal fluctuations in metabolite concentrations (health; disease)
- effect on pathways due to defects in catalysis or signaling
- effects on pathways due to genetic / epigenetic variations
- identification of therapeutic targets
- personalized, precise medicine
- preventative medicine

Integral to many of these aims is the application of NLP, ML and graphical models; e.g.:

CNN | LSTM | Bi-LSTM
attentional mechanisms; memory mechanisms
word embedding, VSM; language models
generative adversarial models applied to metabolic modeling
text, natural language understanding
analyses of KG to predict previously unknown treatment and causative relations between biomedical entities
in silico network modeling + ML (one informs the other)
- note (e.g.) that the Allen Institute | Google DeepMind | others are hugely invested in the application of ML to better understanding the brain, cognition, dynamic memory network modeling …
…

Victoria Stuart, Ph.D.:Technical Biography