I needed to find and grab arXiv article identifiers from a file (e.g., the “1206.5538” in “arXiv:1206.5538” or “https://arxiv.org/abs/1206.5538” or “https://arxiv.org/pdf/1206.5538.pdf”), so that I could bulk import them into JabRef. [JabRef is a Graphical Java application for managing bibtex (.bib) databases.]

Here is my solution.

TASK: find/return specific string in file

SOLUTION:

cat arxiv_test.txt | sed -r 's/^.*arxiv.{9}//p' | egrep -o '[0-9]{4}\.[0-9]{4,5}' | sort -u

Explanation:

Employs regex expressions with the sed and egrep commands:

^.*arxiv.{9} match from beginning of line (^.) to “arxiv” plus any following 9 (.{9}) characters
p print line
egrep same as grep -E (-E, --extended-regexp: interpret PATTERN as an extended regular expression)
(e)grep -o return the --only-matching part of the expression (not the entire line, as normally done by grep)
[0-9]{4}\.[0-9]{4,5}           match 4 digits, a period, then 4 or 5 digits
sort -u sort the results, return --unique matches

TEST FILE:


EXECUTION:


REFERENCES: