• BioNLP: natural language processing, applied in the biomedical domain. See also: Natural Language Processing.

  • Computational Linguistics: an interdisciplinary field concerned with the statistical or rule-based modeling of natural language from a computational perspective.

  • Dependency Parsing: (see also Syntactic/Semantic Parsing) a dependency parser analyzes the grammatical structure of a sentence, establishing relationships between “head” words and words which modify those heads.

    The basic idea is that syntactic structure consists of lexical items, linked by binary asymmetric relations called dependencies. The sentence is an organized whole, the constituent elements of which are words. Every word that belongs to a sentence ceases by itself to be isolated, as in the dictionary. Between the word and its neighbors, the mind perceives connections, the totality of which forms the structure of the sentence.

    The structural connections establish dependency relations between the words. The dependencies are all binary relations: a grammatical relation holds between a governor (also known as a regent or a head) and a dependent.

    Thus, in the sentence “Winehouse performed …”, “performed” is the governor and “Winehouse” is the dependent (subordinate).

    Among other tasks, dependency parse trees may be applied to basic relation extraction. Stanford dependencies provide a representation of grammatical relations between words in a sentence. They have been designed to be easily understood and effectively used by people who want to extract textual relations. Stanford dependencies (SD) are triplets: name of the relation, the governor, and the dependent.

    For example, the sentence “Winehouse effortlessly performed her song Rehab.” yields the following dependency paths:

      nsubj(performed-3, Winehouse-1)
      advmod(performed-3, effortlessly-2)
      poss(Rehab-6, her-4)
      nn(Rehab-6, song-5)
      dobj(performed-3, Rehab-6)

    In this example, the shortest path between “Winehouse” and “Rehab” is:

      Winehouse nsubj performed dobj Rehab.

    and an extracted relation (triple) would be (Winehouse; performed; Rehab)

    Winehouse dependency parse
          [graphs above per Stanford CoreNLP online demo]

  • Entity Normalization: the mapping of a named entity or type in the text to a unique identifier, possibly requiring disambiguation and contextual analysis.

  • Functional Genomics: the standard interpretations is as described on Wikipedia:

    Functional genomics is a field of molecular biology that attempts to make use of the vast wealth of data produced by genomic and transcriptomic projects (such as genome sequencing projects and RNA sequencing) to describe gene (and protein) functions and interactions. Unlike structural genomics, functional genomics focuses on the dynamic aspects such as gene transcription, translation, regulation of gene expression and protein-protein interactions, as opposed to the static aspects of the genomic information such as DNA sequence or structures. Functional genomics attempts to answer questions about the function of DNA at the levels of genes, RNA transcripts, and protein products. A key characteristic of functional genomics studies is their genome-wide approach to these questions, generally involving high-throughput methods rather than a more traditional “gene-by-gene” approach.

    However, I broaden my use use of that term, in the sense of

    How is the information contained in our genome expressed, and what are the functional consequences of that the expression of that information?

    My use of the term “functional genomics” thus spans genomics, molecular genetics, biochemistry, and bioinformatics. I am particularly fascinated by how the information in our genome is encoded and manifested.

    Individual variations in our genetic / epigenetic makeup determine who we are, and how we respond to both

    • extrinsic factors:
      • the environment (environmental stress: radiation, heat, famine, anoxia, toxins, pollutants, chemicals, …)
      • pathogens (bacterial, viral)


    • intrinsic factors:
      • metabolism (e.g. different functional isotypes of proteins, that affect how we process chemicals and drugs, relevant e.g. to toxicology and cancer chemotherapy …)
      • mutation (spontaneous: ageing; induced: environmental in nature but affecting the individual)
  • Graph database: data often exists as relationships between different objects. While relational databases (RDBMS) store highly structured data, they do not store the relationships between the data. Unlike other databases, graph databases store relationships and connections as first-class entities, and excel at managing highly connected data and complex queries. [… continued]

  • Information Extraction (IE): the process of extracting structured information (e.g. events, binary relations, etc.) from text and data, so that it can be used for another purpose, such as an information retrieval system (e.g. a search engine). IE creates structured information from unstructured text. See also: Relation Extraction. [image source]

    IE vs. IR

  • Information Retrieval (IR): employs highly scalable statistics-based techniques to index and search large volumes of text efficiently. Information retrieval is based on a query – you specify what information you need, and it is returned in human understandable form.

  • Knowledge Discovery in Databases (KDD): the non-trivial extraction of implicit, previously unknown and potentially useful knowledge from data. KDD (“data mining”) is the process of discovering useful knowledge from a collection of data. This widely used data mining technique is a process that includes data preparation and selection, data cleansing, incorporating prior knowledge on data sets and interpreting accurate solutions from the observed results.

  • Named Entity: the difference between an entity and a named entity is essentially the same difference between nouns and proper nouns. An entity can be nominal – a common thing like a city – whereas a named entity is more like a proper noun, such as a name (Paris). In other words, a named entity is something that deserves to have a name. A human is an entity, but if we give human a name, this produces a named entity. Named entities may consist of more than one word.

    IE vs. IR

  • Natural Language Processing (NLP): a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora.

  • Neo4j: graph database management system. Neo4j is an ACID-compliant transactional database with native graph storage and processing. Neo4j is accessible using the Cypher Query Language. [… continued]

  • Ontology: a representation / specification of a conceptualization of a domain of knowledge, characterizing the classes and relations that exist in the domain. That is, an ontology is a description (like a formal specification of a program) of the concepts and relationships that can exist for an agent or a community of agents. Commonly, ontologies are represented as graph structure that represents a taxonomy.

  • Provenance: a reference to literature from which a statement or its supporting evidence were derived.

  • Relation Extraction (RE): a subproblem of IE that addresses the extraction of labeled relations between two named entities. Dependency parsing and phrase structure parsing may be combined for relation extraction. To minimize cascading errors, accurate sentence chunking (splitting) is required, prior to the dependency parsing step. See also: Information Extraction;  Dependency Parsing.

  • Syntactic/Semantic Parsing: (see also: Dependency Parsing) processing of the sentence structure using statistics or grammar rules to produce an electronic representation that delivers logical components (for example, a ‘noun phrase’), their roles (for example, the ‘subject’) and dependencies.

    An example of the application of semantic parsing in biomedical question-answering is indicated in Fig. 3 in: Titov & Klementiev (2011) [pdf]:

    Semantic parsing: biomedical QA
          (click image for full-size)

  • Text Mining (TM): comprises the discovery and extraction of knowledge from free text, and can extend to the generation of new hypotheses by joining the extracted information from several publications. Text mining solutions can achieve different objectives, depending on the tasks they then have to address. Primarily, we can distinguish four different categories of purposes for text-mining solutions: information retrieval, information extraction, building knowledge bases, and knowledge discovery. These categories are illustrated in the following figure [image source]:

    text mining
          (click image for full-size)