TL/DR

  • I provide:

    • a BASH script, arxiv-rss.sh for daily retrieval of new, de-duplicated content from preselected arXiv groups. That content is parsed into two documents: keyword-matched articles of interest, plus the remaining articles, plus

    • an arxiv_keywords.txt file, used to grep articles-of-interest.

    • an external lookup file (sed_characters), used with sed to replace HTML code and non-translated $\small \LaTeX$ characters in titles

  • The script can be scheduled to run daily via crontab, or manually executed.

  • If you have desktop notification (e.g. Linux notify-send), then a persistent message will indicate the arrival of new content.


The Issue

Until today, I had been scannning five arXiv RSS feeds via RSS readers (most recently using the Feedbro Firefox addon):

FEED                          RESOLVES (BROWSER) TO                 TOPIC
--------------------------------------------------------------------------------------------
http://arxiv.org/rss/cs.AI    http://export.arxiv.org/rss/cs.AI     Artificial Intelligence
http://arxiv.org/rss/cs.CL    http://export.arxiv.org/rss/cs.CL     Computation and Language
http://arxiv.org/rss/cs.IR    http://export.arxiv.org/rss/cs.IR     Information Retrieval
http://arxiv.org/rss/cs.LG    http://export.arxiv.org/rss/cs.LG     Machine Learning
http://arxiv.org/rss/stat.ML  http://export.arxiv.org/rss/stat.ML   Machine Learning

feedbro-1.png

However, this is problematic for two reasons:  

  • the shear volume of new content
  • lack of de-duplication by RSS readers of arXiv content

For example, note the duplications here (Jun 10, 2019 RSS snapshot):

feedbro-2.png

BAH!!  

This evening (Jun 10, 2019) there were 344 new articles (Feedbro indicated 345, but per my queries below there were actually 344 unique new articles).

However, including the duplicate, cross-posted content there are actually 690 entries that you need to scan through in Feedbro (or other RSS readers) – as indicated in the screenshot immediately above and counted as follows.

## I copied the list from the browser to a file ("new"), and counted the unique entries:

[victoria@victoria ~]$ date
  Mon 10 Jun 2019 07:19:19 PM PDT

## LOTS of duplicated in raw content:

[victoria@victoria ~]$ rg new -i -e 'arxiv' | sort | head

  5 Parallel Prism: A topology for pipelined implementations of convolutional neural networks using computational memory. (arXiv:1906.03474v1 [cs.LG])
  A Closed-Form Learned Pooling for Deep Classification Networks. (arXiv:1906.03808v1 [cs.LG])
  A cost-reducing partial labeling estimator in text classification problem. (arXiv:1906.03768v1 [stat.ML])
  A cost-reducing partial labeling estimator in text classification problem. (arXiv:1906.03768v1 [stat.ML])
  A cost-reducing partial labeling estimator in text classification problem. (arXiv:1906.03768v1 [stat.ML])
  Adaptive Nonparametric Variational Autoencoder. (arXiv:1906.03288v1 [stat.ML])
  Adaptive Nonparametric Variational Autoencoder. (arXiv:1906.03288v1 [stat.ML])
  Adversarial Examples for Non-Parametric Methods: Attacks, Defenses and Large Sample Limits. (arXiv:1906.03310v1 [cs.LG])
  Adversarial Examples for Non-Parametric Methods: Attacks, Defenses and Large Sample Limits. (arXiv:1906.03310v1 [cs.LG])
  Adversarial Mahalanobis Distance-based Attentive Song Recommender for Automatic Playlist Continuation. (arXiv:1906.03450v1 [cs.IR])

[victoria@victoria ~]$ rg new -i -e 'arxiv' | sort | tail

  Understanding overfitting peaks in generalization error: Analytical risk curves for $l_2$ and $l_1$ penalized interpolation. (arXiv:1906.03667v1 [cs.LG])
  Using learned optimizers to make models robust to input noise. (arXiv:1906.03367v1 [cs.LG])
  Using learned optimizers to make models robust to input noise. (arXiv:1906.03367v1 [cs.LG])
  Variance Reduction in Gradient Exploration for Online Learning to Rank. (arXiv:1906.03766v1 [cs.IR])
  VideoFlow: A Flow-Based Generative Model for Video. (arXiv:1903.01434v2 [cs.CV] UPDATED)
  Watch, Try, Learn: Meta-Learning from Demonstrations and Reward. (arXiv:1906.03352v1 [cs.LG])
  Watch, Try, Learn: Meta-Learning from Demonstrations and Reward. (arXiv:1906.03352v1 [cs.LG])
  Watch, Try, Learn: Meta-Learning from Demonstrations and Reward. (arXiv:1906.03352v1 [cs.LG])
  What can AI do for me: Evaluating Machine Learning Interpretations in Cooperative Play. (arXiv:1810.09648v3 [cs.AI] UPDATED)
  WISER: A Semantic Approach for Expert Finding in Academia based on Entity Linking. (arXiv:1805.03947v2 [cs.IR] UPDATED)

[victoria@victoria ~]$ rg new -i -e 'arxiv' | wc -l
  690


The Solution

To address this issue, I wrote a BASH script, arxiv-rss.sh [plus associated plain-text keywords file: arxiv_keywords.txt] for daily retrieval of new, de-duplicated content from preselected arXiv groups. That content is parsed into two documents:

  • keyword-matched articles of interest;
  • the remaining articles.

I also filter / replace annoyances (HTML code; non-translated accented characters) in titles via sed plus an external lookup file (sed_characters).

The script can be scheduled to run daily via crontab, or manually executed.

I schedule my script to run daily at 7:30 am (PST). If there is new content, a persistent desktop notification appears, with a local link to the processed content (automatically opens in my Krusader file manager):

    arXiv-RSS-notify-send2.png

arxiv-rss-Krusader.png

I then open those txt files in Neovim, where I can scan them and view articles of interest by moving my cursor into the URL and pressing gx, which opens the URL in Firefox.

In today’s (Jun 10, 2019) download there were 344 new articles among my five arXiv RSS feeds, as indicated in the Feedbro RSS reader (345 – nonetheless displaying 690 articles), and in the line counts of my results files.

[victoria@victoria arxiv]$ date; pwd; ls -l
  Mon 10 Jun 2019 07:13:01 PM PDT
  /mnt/Vancouver/tmp/arxiv
  total 40
  -rw-r--r-- 1 victoria victoria 12029 Jun 10 19:06 arxiv-filtered
  -rw-r--r-- 1 victoria victoria 24561 Jun 10 19:06 arxiv-others
  drwxr-xr-x 2 victoria victoria  4096 Jun 10 13:52 old

[victoria@victoria arxiv]$ cat arxiv-filtered | wc -l
  107

[victoria@victoria arxiv]$ cat arxiv-others | wc -l
  237

[victoria@victoria arxiv]$ cat arxiv* | wc -l
  344

## Spot check for duplicates (there are none):

[victoria@victoria arxiv]$ rg . -i -e 'assessing incrementality'
  ./arxiv-others
  26:Assessing incrementality in sequence-to-sequence models. https://arxiv.org/pdf/1906.03293

[victoria@victoria arxiv]$ rg . -i -e 'latent feature'
  ./arxiv-others
  108:Latent feature disentanglement for 3D meshes. https://arxiv.org/pdf/1906.03281

[victoria@victoria arxiv]$ rg . -i -e 'reliable training'
  ./arxiv-others
  170:Reliable training and estimation of variance networks. https://arxiv.org/pdf/1906.03260

I am thus easily (literally, minutes  ) able to scan those files for new articles of interest, at the time reducing eye strain and lessening the probability of missing articles due to informational fatigue as I scan through those lists.

If this sounds useful, download my arxiv-rss.sh script. You’ll need to make a few edits:

  • local paths and preferred arXiv RSS feeds;
  • make it executable (chmod 755);
  • add it to crontab using your favorite text editor (e.g. sudo nvim /etc/crontab).

Also grab my arxiv_keywords.txt and sed_characters files, and edit those per your specifications.

Enjoy! 


    StackoverDoh.png