This builds on work summarized in my May 2018 blog post,


Contents


Problem Description

I maintain a collection of tens of thousands of science (mostly biochemical and molecular biology related) personal emails, summarized at the end of my Aug 2017 inaugural research blog post. As of 2019-05-14, there were 70,214 of these files.

In this information extraction project, I wanted to parse and insert those data into a PostgreSQL database.

Here are the numbers, per my StackOverflow answer,



Approach

  • Preprocess the Claws Mail (CM) messages. My mail directory structure (refer above) is a series of non-nested directories, which simplifies access. Preprocessing steps included:

    • transcoding to UTF-8
    • processing the headers for select fields (Date; From: Subject)
    • removing the headers and attachments
    • processing the message bodies
    • provenance: adding fields that indicate the message source, the processed file name, other fields of interest
    • using rsync + include, exclude files to track previously processed data

  • As my research-related email collection spans >2 decades over various operating systems (WinNT; WinXP; but mostly Linux: Ubuntu; Arch Linux) and email clients (by far mostly Claws Mail, but perhaps also distant vestiges of Eudora, MS Outlook, Mozilla Thunderbird) the encodings etc. are very heterogeneous. Developing the scripts for the preprocessing of those data was by far the most challenging and time-consuming aspect of this project.

  • As part of that processing I devised an approach to tag selected portions of that content within XML tagsets, inside XML Files – thus enabling the export of those data to PostgreSQL.

  • I accomplished this via BASH scripts:

    • one to process the email, and
    • the other to initiate the export process.

The second BASH script calls SQL scripts that define / build the Postgres database and tables, ingress the XML-formatted Claws Mail data, and executes postprocessing of characters that are problematic with respect to the Postgres processing ( & < % ),


Solution

For details, see the comments in my scripts. Here are some key points (excerpted).

Plain Text (Claws Mail) Preprocessing

Scripts (run in this order)

* claws_decode-v09.sh                        ## << Run 1st (manually) 
* claws2postgres-v02.sh                      ## << Run 2nd (manually) 
* claws2postgres-v05.sql                     ## << Runs 3rd (auto-called by claws2postgres-v02.sh)
* claws2postgres-postprocessing-v01.sql    ## << Runs last (auto-called by claws2postgres-v02.sh)

(Aside: in case you’re wondering about line lengths; wrapping; etc. when viewing those scripts, I program in full-screen Neovim with set textwidth=220 in my ~/.vimrc.)

As I mention below, you’ll need to adjust some paths, near the top of the those BASH (*.sh) scripts. To assist in the interpretation of those paths, here is my CM project structure.

[victoria@victoria claws]$ pwd; ls -l
  /mnt/Vancouver/projects/ie/claws    ## "$BASEDIR"
  total 56
  drwxr-xr-x 2 victoria victoria  4096 Apr  8 09:06 bak
  drwxr-xr-x 2 victoria victoria  4096 Apr  7 12:36 bin
  drwxr-xr-x 7 victoria victoria  4096 May 14 23:23 data
  drwxr-xr-x 3 victoria victoria  4096 May 14 13:14 docs
  drwxr-xr-x 2 victoria victoria  4096 Apr  7 12:40 logs
  -rw-r--r-- 1 victoria victoria 25203 May 14 14:59 _README.html
  drwxr-xr-x 6 victoria victoria  4096 Apr 16 17:04 src
  drwxr-xr-x 2 victoria victoria  4096 May 12 09:41 tmp

  [victoria@victoria claws]$ tree -L 3 -F . -I 'output*|input*'
  .
  ├── data/
  │   ├── claws_decode_files_processed
  │   ├── claws_decode_files_processed_test
  │   ├── claws_decode-rsync_excludes
  │   ├── claws_decode-rsync_excludes_test
  │   ├── claws_decode-rsync_includes
  │   ├── claws_decode-rsync_includes_test
  │   └── cm_sample_data.tgz
  ├── docs/
  │   ├── claws_mail_processing.html
  │   └── _readme.txt
  ├── _README.html
  └── src/
      ├── bash/
      │   ├── base64-decode.pl*
      │   ├── claws2postgres-v02.sh*
      │   ├── claws_decode-v09.sh*
      │   ├── quoted-printable-decode.pl*
      │   └── versions/
      └── sql/
          ├── claws2postgres-postprocessing-v01.sql
          ├── claws2postgres-v05.sql
          └── versions/

  [victoria@victoria claws]$

# ----------------------------------------------------------------------------

  [victoria@victoria data]$ pwd; ls -l
    /mnt/Vancouver/projects/ie/claws/data
    total 7440
    -rw-r--r--   1 victoria victoria       0 May 14 11:39 claws_decode_files_processed
    -rw-r--r--   1 victoria victoria       0 May 14 11:39 claws_decode_files_processed_test
    -rw-r--r--   1 victoria victoria       0 May 14 11:39 claws_decode-rsync_excludes
    -rw-r--r--   1 victoria victoria   79141 May 14 11:24 claws_decode-rsync_excludes_test
    -rw-r--r--   1 victoria victoria   11451 May 13 21:55 claws_decode-rsync_includes
    -rw-r--r--   1 victoria victoria     209 May 14 10:51 claws_decode-rsync_includes_test
    drwxr-xr-x 418 victoria victoria   20480 May 14 11:41 input
    drwxr-xr-x   2 victoria victoria    4096 May 14 11:39 output
    drwxr-xr-x   2 victoria victoria  180224 May 14 11:24 output_test
    drwxr-xr-x   2 victoria victoria 2633728 May 14 14:58 output_tmp

  [victoria@victoria data]$

I excluded (tree -I) the input and output data directories (tens of 1000’s of files) from the schematic, above.

For your reference, here are excerpts from the BASH scripts that need to be edited (per your use).


In broad strokes, claws-decode.sh

  • copies (via rsync) selected CM folders to my preprocessing input directory (I work on copies of my original, source CM emails);

  • pre-processes those copies, attempting to convert them to UTF-8 if not already in that encoding, removing attachments, etc.

  • PostgreSQL / PSQL have difficulty dealing with some characters ( & < % ), triggering errors such as

  ... ERROR:  2200M: invalid XML document ... DETAIL:  line 10: xmlParseEntityRef: no name

To deal with those annoyances, I temporarily replace them (those mentioned with HTML character codes; others with unique alphanumeric strings) – replacing then in PSQL postprocessing.

As an example of the technical challenges faced, here is a portion of one of those unprocessed emails. 


Imagine (for example) trying to extract the unwrapped Subject: line(s) from that mess1 … complicated by the variations in the header structure / content that appears among 1000’s of messages! Throw in non-UTF-8 encodings; characters (e.g.: < & %), that affect downstream Postgres / PSQL processing; …!

1 My solution viz-a-viz Subject lines was to:

  • prepend START to the beginning of the header lines;
  • replace blank lines with triple backticks ( ``` ) (as these won’t otherwise occur; I could have also employed a unique alphanumeric string, e.g. using pwgen 6 1);
  • unwrap all those lines using Perl (\v “trick”), giving one long line;
  • restoring linebreaks
  • grep‘ing the first START:Subject: string
  • polishing that Subject line (removing the first bit, extra spaces)
SUBJ=$(perl -pe 's/^([A-Z])/\n\START_\1/g' tmp | \
    sed 's/^$/```/g' | \
    perl -pe 's/\v/ /g' | \
    perl -pe 's/\a{0,}```\s{0,}/\n/g' | \
    grep -m1 'START_Subject:' | \
    sed 's/START_Subject: //g ; s/\s\{2,\}/ /g ; s/\s\{1,\}$//g')


Note also the aberrant wrapping of the message bodies in some of the messages (example above) – an issue that I previously faced and solved – specifically, through the use of my base64-decode.pl and quoted-printable-decode.pl scripts!

I managed to sort it all out – again, see my comments in my claws_decode-v09.sh script. Note that

  • at times I rely on Perl regex expressions (that in some cases appear to be more robust to non-UTF-8 characters than sed expressions and are good for removing linebreaks)
  • other times sed expressions are the better choice; and,
  • that the order of some of the processing steps is important.

Continuing …

  • claws2postgres-v02.sh

    initiates the ingressing of processed CM (XML-formatted) data into PostgreSQL.

  • claws2postgres-v02.sh

    also calls (and passes some BASH variables to) claws2postgres-v05.sql, which in turn calls claws2postgres-postprocessing-v01.sql for post-PSQL processing (e.g. replacement of HTML character and other unique alphanumeric strings, introduced in claws_decode-v09.sh to facilitate various CM preprocessing steps).


XML Files

This was kind of fun!    I needed to find a way of tagging CM (untagged, plain-text, poorly formatted: wrapping, indentation, …) content in a manner acceptable to Postgres / PSQL. My solution, inspired by my May 2018 blog post,

was during the preprocessing of the CM messages to wrap content of interest inside XML tagsets, encapsulating each parsed CM message as separate XML files! 

<?xml version="1.0" encoding="UTF-8"?>
<message>
  <uuid>...</uuid>
  <subj>...</subj>
  <date>...</date>
  <from>...</from>
  <file>...</file>
  <src>...</src>
  <urls>...</urls>
  <body>...</body>
</message>

At the end of this step I end up with XML files, illustrated in this example.


Note the simple XML declaration, above: the use of more complicated declarations may break the Postgres XML processing.


Loading XML Data Into PostgreSQL

I previously blogged (May 2018) about this.

In that post I described the use of PostgreSQL (PSQL) temporary tables for the XML processing via xpath.

However,  when I tried that on this project (on my “test” dataset: 4,722 XML files), this was the result!

TL/DR:

Parameter Temp Tables Direct Import Reduction
Time: 1048 min 1.75 min 599x
Space: 252,000 MB 18 MB 14,000x
  • The ingress of those 4,722 files took 17 h 28 min, occupying 252 GB disk space; after a VACUUM FULL, that footprint dropped to 18 MB!

  • My solution, further below, was to simply and directly load those XML data into Postgres – reducing the processing time (in this test run) to 1 min 45 sec and the footprint to 27MB!  

    • (In the table above, the post-solution footprint differs as I made some additional code edits, prior to rerunning the modified code.)

  • As you can see from the output below, the XML-formatted test data occupied 35968 bytes (36.0 MB).


XML Solution

OK, how did I do this?  See the comments in my claws2postgres-v05.sql script:

\c :DATABASE_bash_var

CREATE TABLE IF NOT EXISTS claws_table (
  uuid TEXT
  ,subj TEXT CHECK (length(subj) <= 2500)
  ,body TEXT NOT NULL
  ,"date" DATE
  ,"file" TEXT NOT NULL CHECK (length("file") <= 300)
  ,src TEXT CHECK (length(src) <= 300)
  ,"from" TEXT CHECK (length("from") <= 100)
  ,urls TEXT
  ,UNIQUE (uuid, "file")
  ,PRIMARY KEY (uuid)
);

-- Solution. Here, I dropped the use of a TEMP TABLE, described in my hmdb.sql script,
-- swapping this line,
--    CREATE TEMP TABLE tmp_table AS ...
-- for this one
--    INSERT INTO claws_table

INSERT INTO claws_table
SELECT 
  (xpath('//uuid/text()', x))[1]::text AS uuid
  ,(xpath('//subj/text()', x))[1]::text AS subj
  ,(xpath('//body/text()', x))[1]::text AS body
  ,(xpath('//date/text()', x))[1]::text::date AS "date"
  ,(xpath('//file/text()', x))[1]::text::text AS "file"
  ,(xpath('//src/text()', x))[1]::text::text AS src
  ,(xpath('//from/text()', x))[1]::text::text AS "from"
  ,(xpath('//urls/text()', x))[1]::text AS urls
FROM unnest(xpath('//message', XMLPARSE(DOCUMENT convert_from(pg_read_binary_file(:'bash_var'), 'UTF8')))) x
ON CONFLICT(uuid) DO NOTHING
;

Production Run

On 2019-05-14 – using the versions of the scripts described in this post – I ran those scripts over my full Claws Mail (research-related) email folders: 70,214 email messages! Of course, that data is proprietary (mine), but further below I provide a tgz archive that contains the scripts and sample data, enabling the reproduction of these experiments. 


Program versions

[victoria@victoria 2_RESEARCH - NEWS]$ sed --version | head -n1
  sed (GNU sed) 4.7

awk --version | head -n1
  GNU Awk 4.2.1, API: 2.0 (GNU MPFR 4.0.2, GNU MP 6.1.2)

gawk --version | head -n1
  GNU Awk 4.2.1, API: 2.0 (GNU MPFR 4.0.2, GNU MP 6.1.2)

perl -v | head -n2

  This is perl 5, version 28, subversion 2 (v5.28.2) built for x86_64-linux-thread-multi
  
postgres --version | head -n1
  postgres (PostgreSQL) 11.2

psql --version | head -n1
  psql (PostgreSQL) 11.2 

Scripts + Sample Data

I’ve uploaded a zipped archive of the scripts used here, plus some sample Claws Mail data (email messages), to my Persagen.com website:

tgz Archive – Contents


As mentioned, you will need to make a few edits to the paths, near the top of the aforementioned BASH (*.sh) scripts.  Enjoy!! 


Q.E.D.