Part 1 (here!)   


Here is the source file, with which I am currently working:

ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz

$ gzip -l gene_info.gz
         compressed        uncompressed  ratio uncompressed_name
          425798821          2754397759  84.5% gene_info

$ gzip -dc gene_info.gz | wc -l
20800987

So, unzipped:

  • 20,800,987 lines
  • 2,754,397,759 bytes (2.754 GB)

In Part 1 I extracted the human (Homo sapiens; tax_id 9606) data, then certain columns ($2, $3, $5: GeneID | Symbol | Synonyms):

gzip -dc gene_info.gz | awk -F$'\t' 'BEGIN {OFS = FS} {if ($1 == "9606") print }' | 
awk -F$'\t' 'BEGIN {OFS = FS} {print $2, $3, $5}' | head
$ gzip -dc gene_info.gz | awk -F$'\t' 'BEGIN {OFS = FS} {if ($1 == "9606") print }' | awk -F$'\t' 'BEGIN {OFS = FS} {print $2, $3, $5}' | head

1 A1BG A1B|ABG|GAB|HYST2477
2 A2M A2MD|CPAMD5|FWP007|S863-7
3 A2MP1 A2MP
9 NAT1 AAC1|MNAT|NAT-1|NATI
10 NAT2 AAC2|NAT-2|PNAT
11 NATP AACP|NATP1
12 SERPINA3 AACT|ACT|GIG24|GIG25
13 AADAC CES5A1|DAC
14 AAMP -
15 AANAT DSPS|SNAT

… and saved those data to a file (gene_info_9606_genes):

gzip -dc gene_info.gz | awk -F$'\t' 'BEGIN {OFS = FS} {if ($1 == "9606") print }' |
awk -F$'\t' 'BEGIN {OFS = FS} {print $2, $3, $5}' > gene_info_9606_genes
$ gzip -dc gene_info.gz | awk -F$'\t' 'BEGIN {OFS = FS} {if ($1 == "9606") print }' | awk -F$'\t' 'BEGIN {OFS = FS} {print $2, $3, $5}' > gene_info_9606_genes
  • See SO 5374239 for more information on tab-separated values in awk.

In preparation for importing those (and other data) into PostgreSQL, I need to normalize those data; i.e., unwind the |-delimited arrays in the third column.

This is what I want:

1 A1BG A1B
1 A1BG ABG
1 A1BG GAB
1 A1BG HYST2477
2 A2M A2MD
2 A2M CPAMD5
2 A2M FWP007
2 A2M S863-7
3 A2MP1 A2MP
9 NAT1 AAC1
9 NAT1 MNAT
9 NAT1 NAT-1
9 NAT1 NATI
...

… and this is how I did it (for simplicity, working here with gene_info_9606_genes_head – a smaller, 10-line file that I extracted in Part 1):

echo ''; cat gene_info_9606_genes_head; echo ''; cat gene_info_9606_genes_head | 
while read line; do a=$(echo "$line" | awk '{print $1, $2}') && echo "$line" | 
sed "s/|/\n$a /g"; done; echo ''
$ echo ''; cat gene_info_9606_genes_head; echo ''; cat gene_info_9606_genes_head | while read line; do a=$(echo "$line" | awk '{print $1, $2}') && echo "$line" | sed "s/|/\n$a /g"; done; echo ''

1 A1BG A1B|ABG|GAB|HYST2477
2 A2M A2MD|CPAMD5|FWP007|S863-7
3 A2MP1 A2MP
9 NAT1 AAC1|MNAT|NAT-1|NATI
10 NAT2 AAC2|NAT-2|PNAT
11 NATP AACP|NATP1
12 SERPINA3 AACT|ACT|GIG24|GIG25
13 AADAC CES5A1|DAC
14 AAMP -
15 AANAT DSPS|SNAT

1 A1BG A1B
1 A1BG ABG
1 A1BG GAB
1 A1BG HYST2477
2 A2M A2MD
2 A2M CPAMD5
2 A2M FWP007
2 A2M S863-7
3 A2MP1 A2MP
9 NAT1 AAC1
9 NAT1 MNAT
9 NAT1 NAT-1
9 NAT1 NATI
10 NAT2 AAC2
10 NAT2 NAT-2
10 NAT2 PNAT
11 NATP AACP
11 NATP NATP1
12 SERPINA3 AACT
12 SERPINA3 ACT
12 SERPINA3 GIG24
12 SERPINA3 GIG25
13 AADAC CES5A1
13 AADAC DAC
14 AAMP -
15 AANAT DSPS
15 AANAT SNAT

Here is that command, again, with explanations:

echo ''; cat gene_info_9606_genes_head; echo ''; cat gene_info_9606_genes_head | 
while read line; do a=$(echo "$line" | awk '{print $1, $2}') && echo "$line" | 
sed "s/|/\n$a /g"; done; echo ''

In sequence:

  • cat gene_info_9606_genes_head | pipes (|) that file to the while ... ; do ...; done statement, that loops over the lines in the file

    • The syntax of the while statement differs slightly as a “one-liner” on the command-line, compared with multi-line statements in BASH scripts: note carefully the use and placement of the semicolons (;) in the while statement, above.
  • Next, I need to copy the first two columns (GeneID; Symbol: approved gene name) to a variable, that I can later use: a=$(echo "$line" | awk '{print $1, $2}'). That command echoes the line to awk, which prints (copies) the first two fields. Because it is wrapped in $(...). the output of that command is saved as variable $a.

  • The && basically continues the process, if the preceding bit(s) are successful.

  • The last piece, echo "$line" | sed "s/|/\n$a /g", completes the process. Here, we echo the line to sed, which splits the line on the |-delimiter, replacing the | and for each adding a newline (\n: line break; carriage return) and pastes the $a variable (i.e., the first two fields of the original line), followed by a space.

    • Note that | is used as both a pipe and a delimiter (depending on the context)!

    • Importantly, note that to use a variable within a sed expression, you must use double quotation marks around the sed expression!       [See StackOverflow 11146098: Use a variable in a sed command.]

      $ a='apple'
      $ echo $a
      $   apple
      $ echo '$a'
      $   $a
      $ echo "$a"
      $   apple
      
    • Related to that, notice also that I wrapped $line – where it appears – in double quotation marks: "$line"!

      Per SO 142321232, Why a variable assignment replaces tabs with spaces, I need to do that to preserve tabs (\t) in my output files.

      [As some of the column 3 Synonyms values contain commas it will be simpler, later, to import a TSV into PostgreSQL.]

    • Also note that lines lacking |-delimiters, e.g.

      3 A2MP1 A2MP
      14 AAMP -
      

      are ignored, as the expression

      echo "$line" | sed "s/|/\n$a /g"

    only matches and processes lines with | characters.   


Of course, you can trim / edit that command and pipe it to an output file:

cat gene_info_9606_genes_head | while read line; do a=$(echo "$line" | 
awk '{print $1, $2}') && echo "$line" | sed "s/|/\n$a /g" >> 
gene_info_9606_genes_head_normalized; done
cat gene_info_9606_genes_head | while read line; do a=$(echo "$line" | awk '{print $1, $2}') && echo "$line" | sed "s/|/\n$a /g" >> gene_info_9606_genes_head_normalized; done

Normally, we use sed -i ... to edit a file “in place.” However, in a loop that is not possible so I append that output to a file (> overwrites; >> appends).

$ cat gene_info_9606_genes_head_normalized

1 A1BG A1B
1 A1BG ABG
1 A1BG GAB
1 A1BG HYST2477
2 A2M A2MD
2 A2M CPAMD5
...
12 SERPINA3 GIG24
12 SERPINA3 GIG25
13 AADAC CES5A1
13 AADAC DAC
14 AAMP -
15 AANAT DSPS
15 AANAT SNAT

Addendum: Follow-on Notes

When I implemented this command on the full gene_info_9606_genes file (all the H. sapiens GeneID, Symbol, Synonyms data), I encountered a sed-related error:

$ rm -fR gene_info_9606_genes_normalized; 
cat gene_info_9606_genes | while read line; do a=$(echo "$line" | awk -F$'\t' 
'BEGIN {OFS = FS} {print $1, $2}') && echo "$line" | sed "s/|/\n$a\t/g" >> 
gene_info_9606_genes_normalized; done

sed: -e expression #1, char 23: unknown option to `s'

That error has to be occurring in the first part ('s/|/...) of the sed expression:

sed "s/|/\n$a\t/g"

As I previously extracted columns 1, 2 and 3 to separate files:

cat gene_info_9606_genes | awk -F$'\t' 'BEGIN {OFS = FS} {print $1}' > gene_info_9606_genes_geneid
cat gene_info_9606_genes | awk -F$'\t' 'BEGIN {OFS = FS} {print $2}' > gene_info_9606_genes_symbol
cat gene_info_9606_genes | awk -F$'\t' 'BEGIN {OFS = FS} {print $3}' > gene_info_9606_genes_synonyms

… I could examine them for problematic characters:

rg gene_info_9606_genes_symbol -n -e "/"

  44870:THRA1/BTR

rg gene_info_9606_genes_synonyms -e "/" | head -n5

  ABL|CHDSKM|JTK7|bcr/abl|c-ABL|c-ABL1|p150|v-abl
  EWS-ATF1|FUS/ATF-1|TREB36
  BCL-XL/S|BCL2L|BCLX|Bcl-X|PPP1R52
  FLVI2/BMI1|PCGF4|RNF51|flvi-2/bmi-1
  LRRC76|RDMS|SMDAX|YF5/A2

rg is ripgrep: you can also use grep.

“apple” does not appear anywhere in the file gene_info_9606_genes, so I replaced “/” with “apple” using Neovim:

  • select all text in gene_info_9606_genes (ggVG):
  • make the substitutions: :'<,'>s/\//apple/g

which gives, e.g. line 44870, 105371807 THRA1appleBTR -

That solves it:

rm -fR gene_info_9606_genes_normalized; 
cat gene_info_9606_genes | while read line; do a=$(echo "$line" | awk -F$'\t' 
'BEGIN {OFS = FS} {print $1, $2}') && echo "$line" | sed "s/|/\n$a\t/g" >> 
gene_info_9606_genes_normalized; done

[no error reported!]

For additional information on that issue, see SO 9366816: sed - unknown option to `s’

Now we need to replace those / in both the source (gene_info_9606_genes) and output (gene_info_9606_genes_normalized) files – again using Neovim:

  • select all text (ggVG) in Neovim
  • make the replacements: :'<,'>s/apple/\//g

Check:

rg gene_info_9606_genes_normalized -e "/" | head

  25	ABL1	bcr/abl
  466	ATF1	FUS/ATF-1
  598	BCL2L1	BCL-XL/S
  648	BMI1	FLVI2/BMI1
  648	BMI1	flvi-2/bmi-1
  755	C21orf2	YF5/A2
  837	CASP4	Mih1/TX
  969	CD69	BL-AC/P26
  969	CD69	GP32/28
  1045	CDX2	CDX2/AS

rg gene_info_9606_genes_normalized -i -e "thra1/btr"

  105371807	THRA1/BTR	-

Lastly, there are many genes with no Synonyms; those fields are a dash (-). Since they occur after a tab (\t) and appear at the end of a line ($), we can replace them with “NULL”, using Neovim:

  • select all text (ggVG) in Neovim
  • make the replacements: :'<,'>s/\t-$/NULL/g

Neovim regex example

Neovim regex example - before substitution          Neovim regex example - after substitution

[Note line 25, above]

That's it!