Contents


Overview

I wanted to create dictionary of dictionaries, where the inner dictionaries are annotated tokens (words) from sentences,

  • parts of speech (e.g. NNP; PRP)
  • named entities (e.g. ORGANIZATION; PERSON)
  • dependency tags (e.g. nsubj; dobj)

and where the keys in the outer dictionary are the sentence numbers. I will then use that dictionary of annotated tokens for downstream natural language processing (NLP).

Here is a simplified example. 

{i {})

e.g.

{0 {'pos':'', 'dp':'', 'ner':''}}

Rationale for this approach

The use of defaultdict overcomes several issues when dealing, dynamically, with dictionaries.

For example, this appears to work, with the key is initialized as 0, until you try to add a key > 0:

i = 0
j = 0

d = {i: {j: {}}}

d[0][0] = 'apple'
d[0][1] = 'banana'

print(d)
# {0: {0: 'apple', 1: 'banana'}}

d[1][0] = 'carrot'
'''
Traceback (most recent call last):
  File "<console>", line 1, in <module>
KeyError: 1
'''

Likewise, the following example encounters the same issue (the solution, shown in Example 3, is to deploy a defaultdict() within a defaultdict()):

i = 0

annotated_tokens_dict = {i: defaultdict(lambda: StrictDict({
            'gov':'',
            'dep':'',
            'depTag':'',
            'govGloss':'',
            'depGloss':'',
            'gov_pos':'',
            'dep_pos':'',
            'gov_ner':'',
            'dep_ner':''
        }
        ))}
        # ), d )}

for i in range(1):
    annotated_tokens_dict[i][0]['gov_pos'] = 'NNP'
    annotated_tokens_dict[i][0]['depTag'] = 'nsubj'
    annotated_tokens_dict[i][1]['gov_pos'] = 'PRP'
    annotated_tokens_dict[i][2]['depTag'] = 'dobj'

print(json.dumps(annotated_tokens_dict))
'''
annotated_tokens_dict:
{
  "0": {
    "0": {
      "dep": "",
      "depGloss": "",
      "depTag": "nsubj",
      "dep_ner": "",
      "dep_pos": "",
      "gov": "",
      "govGloss": "",
      "gov_ner": "",
      "gov_pos": "NNP"
    },
    "1": {
      "dep": "",
      "depGloss": "",
      "depTag": "",
      "dep_ner": "",
      "dep_pos": "",
      "gov": "",
      "govGloss": "",
      "gov_ner": "",
      "gov_pos": "PRP"
    },
    "2": {
      "dep": "",
      "depGloss": "",
      "depTag": "dobj",
      "dep_ner": "",
      "dep_pos": "",
      "gov": "",
      "govGloss": "",
      "gov_ner": "",
      "gov_pos": ""
    }
  }
}
'''

Problems:

‌1. Initialization issue (starting key out of range):

i = 1

for i in range(2):
    print(i)
'''
0
1
'''

for i in range(2):
    annotated_tokens_dict[i][0]['gov_pos'] = 'NNP'

'''
Traceback (most recent call last):
  File "/mnt/Vancouver/apps/CoreNLP/_victoria/test.py", line 417, in <module>
    annotated_tokens_dict[i][0]['gov_pos'] = 'NNP'
KeyError: 0
'''

‌2. Inserting new keys, values:

Unable to increment [i] in dict[i][0][''] past initialization value:

i = 0

for i in range(2):    ## i.e., i = 0; i = 1
    annotated_tokens_dict[i][0]['gov_pos'] = 'NNP'

'''
Traceback (most recent call last):
  File "/mnt/Vancouver/apps/CoreNLP/_victoria/test.py", line 417, in <module>
    annotated_tokens_dict[i][0]['gov_pos'] = 'NNP'
KeyError: 0
'''

Solutions


Create a dictionary with immutable keys

… e.g., preventing key creation through d[key] = val

Solution: create a child of dict with a special __setitem__ method that refuses to accept keys that didn’t exist when the dictionary was initially created.

class StrictDict(dict):
    def __setitem__(self, key, value):
        if key not in self:
            raise KeyError("{} is not a legal key of this StrictDict".format(repr(key)))
        dict.__setitem__(self, key, value)

In my NLP use case I want to be able to add data for new sentences (i) as I process Stanford CoreNLP parse data; therefore, I want something like this,

d = {i: StrictDict({'pos':'', 'dp':'', 'ner':''})}

not this,

d = StrictDict({i: {'pos':'', 'dp':'', 'ner':''}})

References:


Inserting default values into a dictionary

I also want to avoid accidentally adding metadata (new keys) to my dict: a defaultdict provides a good solution.

Here, “dd” is my abbreviation for defaultdict() and “d” is the source dictionary

from collections import defaultdict
dd = defaultdict(lambda: <data_structure>, d)

where <data_structure> can be whatever you want it to be, e.g.:

  • None (NoneType)
  • '' (string)
  • [] (list)
  • {} (dictionary)

References:


Example 1

If you don’t mind working on an existing dictionary, use this example.

If you want to work on a copy of an existing dictionary, see Example 2.

from collections import defaultdict

class StrictDict(dict):
    def __setitem__(self, key, value):
        if key not in self:
            raise KeyError("{} is not a legal key of this StrictDict".format(repr(key)))
        dict.__setitem__(self, key, value)

i = 0

d = defaultdict(lambda: StrictDict({'pos':'', 'dp':'', 'ner':''}))

type(d)
'''
  <class 'collections.defaultdict'>
'''

d
'''
  defaultdict(<function <lambda> at 0x7f5509e7e700>, {})
'''

d[0]['dp'] = 'foo'
d
'''
  defaultdict(<function <lambda> at 0x7fddac089700>,
              {0: {'dp': 'foo', 'ner': '', 'pos': ''}})
'''

d[0]['dpp'] = 'foo'    ## invalid key -- throws error:
'''
  Traceback (most recent call last):
    File "<console>", line 1, in <module>
    File "<console>", line 4, in __setitem__
  KeyError: "'dpp' is not a legal key of this StrictDict"
'''

d[1]['ner'] = 'bar'

d
'''
  defaultdict(<function <lambda> at 0x7fddac089700>,
              {0: {'dp': 'foo', 'ner': '', 'pos': ''},
              1: {'dp': '', 'ner': 'bar', 'pos': ''}})
'''

print(d)
'''
  defaultdict(<function <lambda> at 0x7fddac089700>, {0: {'pos': '', 'dp': 'foo', 'ner': ''}, 1: {'pos': '', 'dp': '', 'ner': 'bar'}})
'''

import json

print(json.dumps(d, indent=2, sort_keys=True))    ## sorted "pretty-print" 
'''
  {
    "0": {
      "dp": "foo",
      "ner": "",
      "pos": ""
    },
    "1": {
      "dp": "",
      "ner": "bar",
      "pos": ""
    }
  }
'''

To remove an unwanted key,

d.pop(1)    ## delete the second key, values:
'''
  {'dp': '', 'ner': 'bar', 'pos': ''}
'''

d
'''
  defaultdict(<function <lambda> at 0x7fddac089700>,
              {0: {'dp': 'foo', 'ner': '', 'pos': ''}})
'''

References


Example 2

To work on a copy of a dictionary, leaving the original unchanged, refer to my StackOverflow post,

How To Update Values in Dictionary of Lists

import copy
dict2 = copy.deepcopy(dict1)

Everything is as in Example 1, with the addition of deepcopy.

  • defaultdict allows you to write to to the copy of the dictionary
  • StrictDict() preserves the keys, preventing accidental addition of new or corrupted keys
  • deepcopy ensures that you work only on a copy (dd) of the source dictionary (d).
import copy

'''
Example 1:

  d = defaultdict(lambda: StrictDict({'pos':'', 'dp':'', 'ner':''}))
  d
    {0: {'dp': 'foo', 'ner': '', 'pos': ''}}
'''

# Here:

#d = defaultdict(lambda: StrictDict({'pos':'', 'dp':'', 'ner':''}))
dd = defaultdict(copy.deepcopy(lambda: StrictDict({'pos':'', 'dp':'', 'ner':''}), d))

dd
'''
  defaultdict(<function <lambda> at 0x7f3cbec5a160>,
              {0: {'dp': 'foo', 'ner': '', 'pos': ''}})
'''

dd[0]['dp'] = 'foo'

dd[0]['dpp'] = 'bar'    ## invalid key -- throws error:
'''
  Traceback (most recent call last):
    File "<console>", line 1, in <module>
    File "<console>", line 4, in __setitem__
  KeyError: "'dpp' is not a legal key of this StrictDict"
'''

dd
'''
  defaultdict(<function <lambda> at 0x7f3cbec5a280>,
              {0: {'dp': 'foo', 'ner': '', 'pos': ''}})
'''

dd[1]['dp'] = 'bar'    ## notice how this inserts that value AND includes all keys:

d    ## unaltered (as expected -- working on a deepcopy):
'''
  {0: {'dp': '', 'ner': '', 'pos': ''}}
'''

dd
'''
  defaultdict(<function <lambda> at 0x7f3cbec5a280>,
              {0: {'dp': 'foo', 'ner': '', 'pos': ''},
              1: {'dp': 'bar', 'ner': '', 'pos': ''}})
'''

dd[1]['dp']
'''
  'bar'
'''

d    ## unaltered (as expected -- working on a deepcopy):
'''
  {0: {'dp': '', 'ner': '', 'pos': ''}}
'''

dd.pop(1)    ## delete the second key, values:
  '''
  {'dp': 'bar', 'ner': '', 'pos': ''}
'''

dd
'''
  defaultdict(<function <lambda> at 0x7f3cbec5a280>,
              {0: {'dp': 'foo', 'ner': '', 'pos': ''}})
'''

Example 3

Here is an example where I create a JSON-like data structure (a dictionary within a dictionary within a dictionary),

{0: {0: {'key':'value'} } }

e.g.

{sentence_number: {token_number: {'depparse_tag': 'tag'} } } .

While I could have used other data structures (e.g. a dictionary of tokens / metadata inside a list of sentences; …) I wanted to maintain a consistent data structure, in a JSON format, amenable to facile processing and human-readable output (pretty printed) – as needed.

The use of defaultdict() enables you to dynamically add new key:value data, without encountering TypeError: unhashable type: 'dict' and/or KeyError errors.

import json
from collections import defaultdict

class StrictDict(dict):
    def __setitem__(self, key, value):
        if key not in self:
            raise KeyError("{} is not a legal key of this StrictDict".format(repr(key)))
        dict.__setitem__(self, key, value)


d = defaultdict(lambda: defaultdict(lambda: StrictDict({'pos':'', 'dp':'', 'ner':''})))

d[0][0]['pos'] = 'NNP'
d[1][2]['dp'] = 'nsubj'
d[5][1]['ner'] = 'PERSON'
d[0][1]['dp'] = 'dobj'

print(json.dumps(d, indent=2, sort_keys=True))
'''
{
  "0": {
    "0": {
      "dp": "",
      "ner": "",
      "pos": "NNP"
    },
    "1": {
      "dp": "dobj",
      "ner": "",
      "pos": ""
    }
  },
  "1": {
    "0": {
      "dp": "",
      "ner": "",
      "pos": ""
    },
    "2": {
      "dp": "nsubj",
      "ner": "",
      "pos": ""
    }
  },
  "5": {
    "1": {
      "dp": "",
      "ner": "PERSON",
      "pos": ""
    }
  }
}
'''

StrictDict() test

Will throw a KeyError, as that key is not permitted.

d[1][0]['foo'] = 'bar'
'''
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "<console>", line 4, in __setitem__
KeyError: "'foo' is not a legal key of this StrictDict"
'''