[Linux] Find, Extract String (arXiv Numbers) from File
I needed to find and grab arXiv article identifiers from a file (e.g., the “1206.5538” in “arXiv:1206.5538” or “https://arxiv.org/abs/1206.5538” or “https://arxiv.org/pdf/1206.5538.pdf”), so that I could bulk import them into JabRef. [JabRef is a Graphical Java application for managing bibtex (.bib) databases.]
Here is my solution.
TASK: find/return specific string in file
SOLUTION:
cat arxiv_test.txt | sed -r 's/^.*arxiv.{9}//p' | egrep -o '[0-9]{4}\.[0-9]{4,5}' | sort -u
Explanation:
Employs regex expressions with the sed
and egrep
commands:
^.*arxiv.{9} |
match from beginning of line (^. ) to “arxiv ” plus any following 9 (.{9} ) characters |
p |
print line |
egrep |
same as grep -E (-E , --extended-regexp : interpret PATTERN as an extended regular expression) |
(e)grep -o |
return the --only-matching part of the expression (not the entire line, as normally done by grep) |
[0-9]{4}\.[0-9]{4,5} |
match 4 digits, a period, then 4 or 5 digits |
sort -u |
sort the results, return --unique matches |
TEST FILE:
EXECUTION:
REFERENCES: