Part 2, here!   


This is actually quite easy to do, on the command-line in a Linux terminal. Here is a file with which I am currently working, with some examples:

ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz

$ ls -l gene_info.gz
-rw-r--r-- 1 victoria victoria 425798821 Apr 17 11:12 gene_info.gz

$ ls -lh gene_info.gz
-rw-r--r-- 1 victoria victoria 407M Apr 17 11:12 gene_info.gz

$ gzip -l gene_info.gz
         compressed        uncompressed  ratio uncompressed_name
          425798821          2754397759  84.5% gene_info

$ gzip -dc gene_info.gz | wc -l
20800987

So, unzipped:

  • 20,800,987 lines
  • 2,754,397,759 bytes (2.754 GB)

To examine the top of the extracted archive (without “opening” it):

$ gzip -dc gene_info.gz | head > gene_info.gz.head

$ cat gene_info.gz.head 

#tax_id	GeneID	Symbol	LocusTag	Synonyms	dbXrefs	chromosome	map_location	description	type_of_gene	Symbol_from_nomenclature_authority	Full_name_from_nomenclature_authority	Nomenclature_status	Other_designations	Modification_date	Feature_type
7	5692769	NEWENTRY	-	-	-	-	-	Record to support submission of GeneRIFs for a gene not in Gene (Azotirhizobium caulinodans.  Use when strain, subtype, isolate, etc. is unspecified, or when different from all specified ones in Gene.).	other	-	-	-	-	20171118-
9	1246500	repA1	pLeuDn_01	-	-	-	-	putative replication-associated protein	protein-coding	-	-	-	-	20180129	-
9	1246501	repA2	pLeuDn_03	-	-	-	-	putative replication-associated protein	protein-coding	-	-	-	-	20180129	-
9	1246502	leuA	pLeuDn_04	-	-	-	-	2-isopropylmalate synthase	protein-coding	-	-	-	-	20180129	-
9	1246503	leuB	pLeuDn_05	-	-	-	-	3-isopropylmalate dehydrogenase	protein-coding	-	-	-	-	20180129	-
9	1246504	leuC	pLeuDn_06	-	-	-	-	isopropylmalate isomerase large subunit	protein-coding	-	-	-	-	20180129	-
9	1246505	leuD	pLeuDn_07	-	-	-	-	isopropylmalate isomerase small subunit	protein-coding	-	-	-	-	20180129	-
9	1246509	ibp	pBPS1_01	-	-	-	-	Ibp protein	protein-coding	-	-	-	-	20180129	-
9	1246510	repA1	pBPS1_02	-	-	-	-	repA1 protein	protein-coding	-	-	-	-	20180129	-

I am interested in human genes (“tax_id” 9606), so I can just extract those data:

gzip -dc gene_info.gz | awk -F$'\t' 'BEGIN {OFS = FS} 
{if ($1 == "9606") print }' > gene_info_9606
$ gzip -dc gene_info.gz | awk -F$'\t' 'BEGIN {OFS = FS} {if ($1 == "9606") print }' > gene_info_9606

… that saves the human-related info in a tab-delimited file (like the source) that I can examine, e.g. using Vim [or my preferred editor, Neovim (nvim | nvr -s)]:

$ nvr -s gene_info_9606 

Notes:

  • You can also directly open / work with .gz archives in Vim, within which .gz files open quickly and are easy to search, navigate …

  • See SO 5374239 for more information on tab-separated values in awk.

Alternatively, you can do this in one step:

gzip -dc gene_info.gz | awk -F$'\t' 'BEGIN {OFS = FS} 
{if ($1 == "9606") print }' | head
$ gzip -dc gene_info.gz | awk -F$'\t' 'BEGIN {OFS = FS} {if ($1 == "9606") print }' | head

9606	1	A1BG	-	A1B|ABG|GAB|HYST2477	MIM:138670|HGNC:HGNC:5|Ensembl:ENSG00000121410|Vega:OTTHUMG00000183507	19	19q13.43	alpha-1-B glycoprotein	protein-coding	A1BG	alpha-1-B glycoprotein	O	alpha-1B-glycoprotein|HEL-S-163pA|epididymis secretory sperm binding protein Li 163pA	20180408	-
9606	2	A2M	-	A2MD|CPAMD5|FWP007|S863-7	MIM:103950|HGNC:HGNC:7|Ensembl:ENSG00000175899|Vega:OTTHUMG00000150267	12	12p13.31	alpha-2-macroglobulin	protein-coding	A2M	alpha-2-macroglobulin	O	alpha-2-macroglobulin|C3 and PZP-like alpha-2-macroglobulin domain-containing protein 5|alpha-2-M	20180414	-
9606	3	A2MP1	-	A2MP	HGNC:HGNC:8|Ensembl:ENSG00000256069	12	12p13.31	alpha-2-macroglobulin pseudogene 1	pseudo	A2MP1	alpha-2-macroglobulin pseudogene 1	O	pregnancy-zone protein pseudogene	20180329	-
9606	9	NAT1	-	AAC1|MNAT|NAT-1|NATI	MIM:108345|HGNC:HGNC:7645|Ensembl:ENSG00000171428|Vega:OTTHUMG00000097001	8	8p22	N-acetyltransferase 1	protein-coding	NAT1	N-acetyltransferase 1	O	arylamine N-acetyltransferase 1|N-acetyltransferase 1 (arylamine N-acetyltransferase)|N-acetyltransferase type 1|arylamide acetylase 1|monomorphic arylamine N-acetyltransferase	20180408	-
9606	10	NAT2	-	AAC2|NAT-2|PNAT	MIM:612182|HGNC:HGNC:7646|Ensembl:ENSG00000156006|Vega:OTTHUMG00000130826	8	8p22	N-acetyltransferase 2	protein-coding	NAT2	N-acetyltransferase 2	O	arylamine N-acetyltransferase 2|N-acetyltransferase 2 (arylamine N-acetyltransferase)|N-acetyltransferase type 2|arylamide acetylase 2	20180408	-
9606	11	NATP	-	AACP|NATP1	HGNC:HGNC:15	8	8p22	N-acetyltransferase pseudogene	pseudo	NATP	N-acetyltransferase pseudogene	O	arylamide acetylase pseudogene	20180329	-
9606	12	SERPINA3	-	AACT|ACT|GIG24|GIG25	MIM:107280|HGNC:HGNC:16|Ensembl:ENSG00000196136|Vega:OTTHUMG00000029851	14	14q32.13	serpin family A member 3	protein-coding	SERPINA3	serpin family A member 3	O	alpha-1-antichymotrypsin|cell growth-inhibiting gene 24/25 protein|growth-inhibiting protein 24|growth-inhibiting protein 25|serine (or cysteine) proteinase inhibitor, clade A, member 3|serpin A3|serpin peptidase inhibitor, clade A (alpha-1 antiproteinase, antitrypsin), member 3	20180408	-
9606	13	AADAC	-	CES5A1|DAC	MIM:600338|HGNC:HGNC:17|Ensembl:ENSG00000114771|Vega:OTTHUMG00000159876	3	3q25.1	arylacetamide deacetylase	protein-coding	AADAC	arylacetamide deacetylase	O	arylacetamide deacetylase|arylacetamide deacetylase (esterase)	20180329	-
9606	14	AAMP	-	-	MIM:603488|HGNC:HGNC:18|Ensembl:ENSG00000127837|Vega:OTTHUMG00000155202	2	2q35	angio associated migratory cell protein	protein-coding	AAMP	angio associated migratory cell protein	O	angio-associated migratory cell protein	20180408	-
9606	15	AANAT	-	DSPS|SNAT	MIM:600950|HGNC:HGNC:19|Ensembl:ENSG00000129673|Vega:OTTHUMG00000180179	17	17q25.1	aralkylamine N-acetyltransferase	protein-coding	AANAT	aralkylamine N-acetyltransferase	O	serotonin N-acetyltransferase|arylalkylamine N-acetyltransferase|serotonin acetylase	20180408	-

You can also search (grep | ripgrep, though I see no speed advantage with ripgrep, here) that archive, e.g. for gene A1BG:

time gzip -dc gene_info.gz | awk -F$'\t' 'BEGIN {OFS = FS} 
{if ($1 == "9606") print }' | rg -e "A1BG"
$ time gzip -dc gene_info.gz | awk -F$'\t' 'BEGIN {OFS = FS} {if ($1 == "9606") print }' | rg -e "A1BG"

9606  1         A1BG      -	A1B|ABG|GAB|HYST2477	        MIM:138670|HGNC:HGNC:5|Ensembl:ENSG00000121410|Vega:OTTHUMG00000183507	19	19q13.43	alpha-1-B glycoprotein	protein-coding	A1BG	alpha-1-B glycoprotein	O	alpha-1B-glycoprotein|HEL-S-163pA|epididymis secretory sperm binding protein Li 163pA	20180408	-
9606  503538	A1BG-AS1  -	A1BG-AS|A1BGAS|NCRNA00181	HGNC:HGNC:37133|Ensembl:ENSG00000268895	                                19	19q13.43	A1BG antisense RNA 1	ncRNA	A1BG-AS1	A1BG antisense RNA 1	O	A1BG antisense RNA (non-protein coding)|A1BG antisense RNA 1 (non-protein coding)	20180329-
0:10.37

$ time gzip -dc gene_info.gz | awk -F$'\t' 'BEGIN {OFS = FS} {if ($1 == "9606") print }' | grep "A1BG"

$ time gzip -dc gene_info.gz | awk -F$'\t' 'BEGIN {OFS = FS} {if ($1 == "9606") \
print }' | grep "A1BG"

9606  1         A1BG      -	A1B|ABG|GAB|HYST2477	        MIM:138670|HGNC:HGNC:5|Ensembl:ENSG00000121410|Vega:OTTHUMG00000183507	19	19q13.43	alpha-1-B glycoprotein	protein-coding	A1BG	alpha-1-B glycoprotein	O	alpha-1B-glycoprotein|HEL-S-163pA|epididymis secretory sperm binding protein Li 163pA	20180408	-
9606  503538	A1BG-AS1  -	A1BG-AS|A1BGAS|NCRNA00181	HGNC:HGNC:37133|Ensembl:ENSG00000268895	                                19	19q13.43	A1BG antisense RNA 1	ncRNA	A1BG-AS1	A1BG antisense RNA 1	O	A1BG antisense RNA (non-protein coding)|A1BG antisense RNA 1 (non-protein coding)	20180329-
0:10.47

You can also extract certain columns (fields $2, $3, $5: GeneID | Symbol | Synonyms):

gzip -dc gene_info.gz | awk -F$'\t' 'BEGIN {OFS = FS} {if ($1 == "9606") print }' | 
awk -F$'\t' 'BEGIN {OFS = FS} {print $2, $3, $5}' | head
$ gzip -dc gene_info.gz | awk -F$'\t' 'BEGIN {OFS = FS} {if ($1 == "9606") print }' | awk -F$'\t' 'BEGIN {OFS = FS} {print $2, $3, $5}' | head

1 A1BG A1B|ABG|GAB|HYST2477
2 A2M A2MD|CPAMD5|FWP007|S863-7
3 A2MP1 A2MP
9 NAT1 AAC1|MNAT|NAT-1|NATI
10 NAT2 AAC2|NAT-2|PNAT
11 NATP AACP|NATP1
12 SERPINA3 AACT|ACT|GIG24|GIG25
13 AADAC CES5A1|DAC
14 AAMP -
15 AANAT DSPS|SNAT

… save those data to a file (gene_info_9606_genes):

gzip -dc gene_info.gz | awk -F$'\t' 'BEGIN {OFS = FS} {if ($1 == "9606") print }' | 
awk -F$'\t' 'BEGIN {OFS = FS} {print $2, $3, $5}' > gene_info_9606_genes
$ gzip -dc gene_info.gz | awk -F$'\t' 'BEGIN {OFS = FS} {if ($1 == "9606") print }' | awk -F$'\t' 'BEGIN {OFS = FS} {print $2, $3, $5}' > gene_info_9606_genes

… and examine it:

$ head -n25 gene_info_9606_genes

1 A1BG A1B|ABG|GAB|HYST2477
2 A2M A2MD|CPAMD5|FWP007|S863-7
3 A2MP1 A2MP
9 NAT1 AAC1|MNAT|NAT-1|NATI
10 NAT2 AAC2|NAT-2|PNAT
11 NATP AACP|NATP1
12 SERPINA3 AACT|ACT|GIG24|GIG25
13 AADAC CES5A1|DAC
14 AAMP -
15 AANAT DSPS|SNAT
16 AARS CMT2N|EIEE29
17 AAVS1 AAV
18 ABAT GABA-AT|GABAT|NPD009
19 ABCA1 ABC-1|ABC1|CERP|HDLDT1|TGD
20 ABCA2 ABC2
21 ABCA3 ABC-C|ABC3|EST111653|LBM180|SMDP3
22 ABCB7 ABC7|ASAT|Atm1p|EST140535
23 ABCF1 ABC27|ABC50
24 ABCA4 ABC10|ABCR|ARMD2|CORD3|FFM|RMP|RP19|STGD|STGD1
25 ABL1 ABL|CHDSKM|JTK7|bcr/abl|c-ABL|c-ABL1|p150|v-abl
26 AOC1 ABP|ABP1|DAO|DAO1|KAO
27 ABL2 ABLL|ARG
28 ABO A3GALNT|A3GALT1|GTB|NAGAT
29 ABR MDB
30 ACAA1 ACAA|PTHIO|THIO

Notes:

* gzip
  -c --stdout --to-stdout
  -d --decompress --uncompress

* awk
  -F sepstring Define the input field separator

* wc
  print newline, word, and byte counts for each file

* wc -l
  -l, --lines  print the newline counts

Use man <term> for more information on any of those Linux commands; e.g.:

$ man gzip

GZIP(1) General Commands Manual
GZIP(1)

NAME gzip, gunzip, zcat - compress or expand files

SYNOPSIS gzip [ -acdfhklLnNrtvV19 ] [-S suffix] [ name … ] gunzip [ -acfhklLnNrtvV ] [-S suffix] [ name … ] zcat [ -fhLV ] [ name … ]

DESCRIPTION Gzip reduces the size of the named files using Lempel-Ziv coding (LZ77). Whenever possible, each file is replaced by one with the extension .gz, while keeping the same ownership modes, access and modification times. (The default extension is z for MSDOS,


… Continued in Part 2 (here!)