# Victoria's Data Science Notes

## Persagen.com

This file:  Persagen.com/files/datasci.html

## ALGORITHMS

[Algorithms] Blogs:

• The 280-Year-Old Algorithm Inside Google Trips [Google Research Blog: Sep 2016]  |  local copy [pdf]
• Algorithms Engineering is a lot of fun because algorithms do not go out of fashion: one never knows when an oldie-but-goodie might come in handy. Case in point: Yesterday, Google announced Google Trips, a new app to assist you in your travels by helping you create your own "perfect day" in a city. Surprisingly, deep inside Google Trips, there is an algorithm that was invented 280 years ago.

In 1736, Leonhard Euler authored a brief but beautiful mathematical paper [pdf;   local copy (pdf)] regarding the town of Königsberg and its 7 bridges, shown here: [ ... SNIP! ... ]

• Top Algorithms Used by Data Scientists KDNuggets.com (Sep 2016)

## DATA-RELATED

### [Data-Related] Data Munging

[Data-Related - Data Munging] Blogs:

• Approaching (Almost) Any Machine Learning Problem

• An average data scientist deals with loads of data daily. Some say over 60-70% time is spent in data cleaning, munging and bringing data to a suitable format such that machine learning models can be applied on that data. This post focuses on the second part, i.e., applying machine learning models, including the preprocessing steps. The pipelines discussed in this post come as a result of over a hundred machine learning competitions that I've taken part in. It must be noted that the discussion here is very general but very useful and there can also be very complicated methods which exist and are practised by professionals. ...

• Victoria: great article [local copy (pdf)]! Touches on data preparation but mainly/most significantly a rational approach to designing, applying and optimizing ML algorithms.

• Excellent!  How to Rank 10% in Your First Kaggle Competition  |  pdf

• Preprocessing Data, a Python Workflow Part 1  |  reddit

## DATABASE-RELATED

[Database-Related] Query

• Kueri: NL interface for DB; free for commercial use  |  demo | my sample query:

## IMAGES - FFT

[Images - FFT] Blogs:

[Images - FFT] Papers:

• Kipf TN (2016) Semi-Supervised Classification with Graph Convolutional Networks. arXiv:1609.02907

• We present a scalable approach for semi-supervised learning on graph-structured data that is based on an efficient variant of convolutional neural networks which operate directly on graphs. We motivate the choice of our convolutional architecture via a localized first-order approximation of spectral graph convolutions. Our model scales linearly in the number of graph edges and learns hidden layer representations that encode both local graph structure and features of nodes. In a number of experiments on citation networks and on a knowledge graph dataset we demonstrate that our approach outperforms related methods by a significant margin.

• More here  [large file; opens in new tab]; e.g.:

• How powerful are Graph Convolutions? (review of Kipf & Welling, 2016)

• ... This post is mainly a review of (Kipf and Welling, 2016). ...

Summary of this post:

• Graph convolutions are generalisation of convolutions, and easiest to define in spectral domain
• General Fourier transform scales poorly with size of data so we need relaxations
• (Kipf and Welling) use first order approximation in Fourier-domain to obtain an efficient linear-time graph-CNNs
• I illustrate here what this first-order approximation amounts to on a 2D lattice one would normally use for image processing, where actual spatial convolutions are easy to compute
• In this application the modelling power of the proposed graph convolutional networks is severely impoverished, due to the first-order and other approximations made.

## INFORMATION EXTRACTION

[IE] Papers:

• [Looks very promising!]  Singh M (2016) OCR++: A Robust Framework For Information Extraction from Scholarly Articles. arXiv:1609.06423

• This paper proposes OCR++, an open-source framework designed for a variety of information extraction tasks from scholarly articles including metadata (title, author names, affiliation and e-mail), structure (section headings and body text, table and figure headings, URLs and footnotes) and bibliography (citation instances and references). We analyze a diverse set of scientific articles written in English language to understand generic writing patterns and formulate rules to develop this hybrid framework. Extensive evaluations show that the proposed framework outperforms the existing state-of-the-art tools with huge margin in structural information extraction along with improved performance in metadata and bibliography extraction tasks, both in terms of accuracy (around 50% improvement) and processing time (around 52% improvement). A user experience study conducted with the help of 30 researchers reveals that the researchers found this system to be very helpful. As an additional objective, we discuss two novel use cases including automatically extracting links to public datasets from the proceedings, which would further accelerate the advancement in digital libraries. The result of the framework can be exported as a whole into structured TEI-encoded documents. Our framework is accessible online at OCR++.

## MEMORY-RELATED

[Memory] Papers

• Kiriansky V [MIT CSAIL] (2016) Optimizing Indirect Memory References with milk. pdf

• Modern applications such as graph and data analytics, when operating on real world data, have working sets much larger than cache capacity and are bottlenecked by DRAM. To make matters worse, DRAM bandwidth is increasing much slower than per CPU core count, while DRAM latency has been virtually stagnant. Parallel applications that are bound by memory bandwidth fail to scale, while applications bound by memory latency draw a small fraction of much-needed bandwidth. While expert programmers may be able to tune important applications by hand through heroic effort, traditional compiler cache optimizations have not been sufficiently aggressive to overcome the growing DRAM gap. In this paper, we introduce milk - a C/C++ language extension that allows programmers to annotate memory-bound loops concisely. Using optimized intermediate data structures, random indirect memory references are transformed into batches of efficient sequential DRAM accesses. A simple semantic model enhances programmer productivity for efficient parallelization with OpenMP. We evaluate the Milk compiler on parallel implementations of traditional graph applications, demonstrating performance gains of up to 3×.

• Faster parallel computing: [MIT News Office: Sep 2016] New programming language delivers fourfold speedups on problems common in the age of big data. ... Researchers have designed a new programming language that lets application developers manage memory more efficiently in programs that deal with scattered data points in large data sets. In tests on several common algorithms, programs written in the new language were four times as fast as those written in existing languages. ...

## MISCELLANEOUS

• Aitchison L (2016) Zipf's Law Arises Naturally When There Are Underlying, Unobserved Variables. journal  |  pdf

• Abstract. Zipf's law, which states that the probability of an observation is inversely proportional to its rank, has been observed in many domains. While there are models that explain Zipf's law in each of them, those explanations are typically domain specific. Recently, methods from statistical physics were used to show that a fairly broad class of models does provide a general explanation of Zipf's law. This explanation rests on the observation that real world data is often generated from underlying causes, known as latent variables. Those latent variables mix together multiple models that do not obey Zipf's law, giving a model that does. Here we extend that work both theoretically and empirically. Theoretically, we provide a far simpler and more intuitive explanation of Zipf's law, which at the same time considerably extends the class of models to which this explanation can apply. Furthermore, we also give methods for verifying whether this explanation applies to a particular dataset. Empirically, these advances allowed us extend this explanation to important classes of data, including word frequencies (the first domain in which Zipf's law was discovered), data with variable sequence length, and multi-neuron spiking activity.

• Author Summary. Datasets ranging from word frequencies to neural activity all have a seemingly unusual property, known as Zipf's law: when observations (e.g., words) are ranked from most to least frequent, the frequency of an observation is inversely proportional to its rank. Here we demonstrate that a single, general principle underlies Zipf's law in a wide variety of domains, by showing that models in which there is a latent, or hidden, variable controlling the observations can, and sometimes must, give rise to Zipf's law. We illustrate this mechanism in three domains: word frequency, data with variable sequence length, and neural data.

• Wikipedia:

Zipf's law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.: the rank-frequency distribution is an inverse relation. For example, in the Brown Corpus of American English text, the word "the" is the most frequently occurring word, and by itself accounts for nearly 7% of all word occurrences (69,971 out of slightly over 1 million). True to Zipf's Law, the second-place word "of" accounts for slightly over 3.5% of words (36,411 occurrences), followed by "and" (28,852). Only 135 vocabulary items are needed to account for half the Brown Corpus.

• Benford's Law and Zipf's Law

• Wilson G (2016) Good Enough Practices in Scientific Computing. arXiv:1609.00037  |  pdf

• We present a set of computing tools and techniques that every researcher can and should adopt. These recommendations synthesize inspiration from our own work, from the experiences of the thousands of people who have taken part in Software Carpentry and Data Carpentry workshops over the past six years, and from a variety of other guides. Unlike some other guides, our recommendations are aimed specifically at people who are new to research computing.

• Published 2017: journalpdf  [local copy]

• See also:  Best Practices for Scientific Computing  [journal; pdf (local copy)]

## NETWORKS, NETWORK ANALYSES

[Networks] Papers:

• Multi-scale structure and topological anomaly detection via a new network statistic: The onion decomposition [pdf]

• We introduce a network statistic that measures structural properties at the micro-, meso-, and macroscopic scales, while still being easy to compute and interpretable at a glance. Our statistic, the onion spectrum, is based on the onion decomposition, which refines the k-core decomposition, a standard network fingerprinting method. The onion spectrum is exactly as easy to compute as the k-cores: It is based on the stages at which each vertex gets removed from a graph in the standard algorithm for computing the k-cores. Yet, the onion spectrum reveals much more information about a network, and at multiple scales; for example, it can be used to quantify node heterogeneity, degree correlations, centrality, and tree- or lattice-likeness. Furthermore, unlike the k-core decomposition, the combined degree-onion spectrum immediately gives a clear local picture of the network around each node which allows the detection of interesting subgraphs whose topological structure differs from the global network organization. This local description can also be leveraged to easily generate samples from the ensemble of networks with a given joint degree-onion distribution. We demonstrate the utility of the onion spectrum for understanding both static and dynamic properties on several standard graph models and on many real-world networks.

• Blog post: Powerful new metric quickly reveals network structure at multiple scales

## QUOTES

• "Machine learning is like high school sex: Everyone says they do it, nobody really does, and no one knows what it actually is." -- @Mikettownsend

• Nobel Laureate [1991: Economics] Ronald Harry Coase:

• "If you torture a data long enough, it will confess anything."
• Alternative:  "If you torture the data long enough, it will confess." Cited in: Gordon Tullock, "A Comment on Daniel Klein's 'A Plea to Economists Who Favor Liberty'", Eastern Economic Journal, Spring 2001. [Source]

## REFERENCE

• Dictionary of Algorithms and Data Structures [NIST] This is a dictionary of algorithms, algorithmic techniques, data structures, archetypal problems, and related definitions. Algorithms include common functions, such as Ackermann's function. Problems include traveling salesman and Byzantine generals. Some entries have links to implementations and more information. Index pages list entries by area and by type.

## SIGNALS ANALYSIS

• Davis A [MIT CSAIL] (SIGGRAPH Asia 2015) Image-space modal bases for plausible manipulation of objects in video. pdf

• Davis A [MIT CSAIL | Microsoft Research | Adobe] (SIGGRAPH 2014) The visual microphone: passive recovery of sound from video. pdf

• When sound hits an object, it causes small vibrations of the object's surface. We show how, using only high-speed video of the object, we can extract those minute vibrations and partially recover the sound that produced them, allowing us to turn everyday objects - a glass of water, a potted plant, a box of tissues, or a bag of chips-into visual microphones. We recover sounds from high-speed footage of a variety of objects with different properties, and use both real and simulated data to examine some of the factors that affect our ability to visually recover sound. We evaluate the quality of recovered sounds using intelligibility and SNR metrics and provide input and recovered audio samples for direct comparison. We also explore how to leverage the rolling shutter in regular consumer cameras to recover audio from standard frame-rate videos, and use the spatial resolution of our method to visualize how sound-related vibrations vary over an object's surface, which we can use to recover the vibration modes of an object.

• Project webpage:  The Visual Microphone: Passive Recovery of Sound from Video [SIGGRAPH 2014]

• Amazing!  Hearing by seeing: "visual mic" uses potato-chip bag to recover sound from video footage. Researchers at MIT CSAIL, Microsoft, and Adobe have developed an algorithm that can reconstruct an audio signal by analyzing minute vibrations of objects depicted in video. In one set of experiments, they were able to recover intelligible speech from the vibrations of a potato-chip bag photographed from 15 feet away through soundproof glass.  |  YouTube

• Zhao M [MIT CSAIL] (2016) Emotion Recognition using Wireless Signals. pdf  |  reddit

• This paper demonstrates a new technology that can infer a person's emotions from RF signals reflected off his body. EQ-Radio transmits an RF signal and analyzes its reflections off a person's body to recognize his emotional state (happy, sad, etc.). The key enabler underlying EQ-Radio is a new algorithm for extracting the individual heartbeats from the wireless signal at an accuracy comparable to on-body ECG monitors. The resulting beats are then used to compute emotion-dependent features which feed a machine-learning emotion classifier. We describe the design and implementation of EQ-Radio, and demonstrate through a user study that its emotion recognition accuracy is on par with state-of-the-art emotion recognition systems that require a person to be hooked to an ECG monitor.

• Conclusion.

This paper presents a technology capable of recognizing a person's emotions by relying on wireless signals reflected off her/his body. We believe this marks an important step in the nascent field of emotion recognition. It also builds on a growing interest in the wireless systems' community in using RF signals for sensing, and as such, the work expands the scope of RF sensing to the domain of emotion recognition. Further, while this work has laid foundations for wireless emotion recognition, we envision that the accuracy of such systems will improve as wireless sensing technologies evolve and as the community incorporates more advanced machine learning mechanisms in the sensing process.

We also believe that the implications of this work extend beyond emotion recognition. Specifically, while we used the heartbeat extraction algorithm for determining the beat-to-beat intervals and exploited these intervals for emotion recognition, our algorithm recovers the entire human heartbeat from RF, and the heartbeat displays a very rich morphology. We envision that this result paves way for exciting research on understanding the morphology of the heartbeat both in the context of emotion-recognition as well as in the context of non-invasive health monitoring and diagnosis.

• New technology promises to wirelessly read human emotion: MIT's EQ-Radio has applications for mental health and advertising, but raises privacy concerns [CBC News: Sep 20, 2016]

• "Related" [see A1]:  How does Google know where I am?

• Google uses BSSID information from your WLAN Access Point to get an approximation of where you are located, even with GPS, and WiFi turned off. Taken from here:

Google and others like Apple and Skyhook build a Database which links WLAN BSSIDs to a geographic location. A BSSID is like the MAC Address of a access point that gets broadcasted by that access point. It is therefore "public viewable" if the BSSID broadcast is enabled, which is the default for most access points. The BSSID operates on a lower layer as the IP stack, you don't even have to be connected to an access point to receive these broadcasts.

So, essentially, when you ARE using Wifi and GPS, Google's database of BSSID's is updated with a geographic location associated with that BSSID, as you assumed. In your case, your AP is sending beacons advertising its BSSID, and because it is already in Google's database, Google Maps knows where you are based on the location of that AP.

So it's not that the ISP is giving Google the location of their routers, its that your phone is helping to build a database of the Access Points around you, and Google uses this data for geolocation.

Sadly, even if you get a new router and keep any and all android devices away from it, they will still be able to approximate your location based on the cell towers you associate with (or maybe even your neighbors AP!), but it won't be nearly as accurate.

## UTILITIES

• Performance of Python vs Pandas vs Numpy  [web]

• tsfresh [GitHub] is a python package that is used to automatically calculate a huge number of time series characteristics, the so called features. Further the package contains methods to evaluate the explaining power and importance of such characteristics for regression or classification tasks.  More here: tsfresh.readthedocs.io

## VISUALIZATION

[Visualization] Blogs:

• 5 Steps for Advanced Data Analysis using Visualization [KDNuggets.com: Oct 2016]

• A Dramatic Tour through Python's Data Visualization Landscape (including ggplot and Altair)  |  excellent: thorough, comprehensive review [local copy]  |  excerpt:
• The Options (in ~Descending Order of Subjective Complexity).

First, let's welcome our friends [footnote 2: Right off the bat, you're mad at me, so allow me to explain: I love bokeh and plotly. Indeed, one of my favorite things to do before sending out an analysis is getting "free interactivity" by passing my figures to the relevant bokeh/plotly functions; however, I'm not familiar enough with either to do anything more sophisticated. (And let's be honest - this post is long enough.) Obviously, if you're in the market for interactive visualizations (versus statistical visualizations), then you should probably look to them.]

matplotlib. The 800-pound gorilla - and like most 800-pound gorillas, this one should probably be avoided unless you genuinely need its power, e.g., to make a custom plot or produce a publication-ready graphic. (As we'll see, when it comes to statistical visualization, the preferred tack might be: "do as much as you easily can in your convenience layer of choice [i.e., any of the next four libraries], and then use matplotlib for the rest.")

pandas. "Come for the DataFrames; stay for the plotting convenience functions that are arguably more pleasant than the matplotlib code they supplant." - rejected pandas taglines (Bonus tidbit: the pandas team must include a few visualization nerds, as the library includes things like RadViz plots and Andrews Curves that I haven't seen elsewhere.)

Seaborn. Seaborn has long been my go-to library for statistical visualization; it summarizes itself thusly: "If matplotlib 'tries to make easy things easy and hard things possible,' seaborn tries to make a well-defined set of hard things easy too"

yhat's ggplot. A Python implementation of the wonderfully declarative grammar of graphics. This isn't a "feature-for-feature port of ggplot2," but there's strong feature overlap. (And speaking as a part-time R user, the main geoms seem to be in place.)

Altair. The new guy, Altair is a "declarative statistical visualization library" with an exceedingly pleasant API. Wonderful. Now that our guests have arrived and checked their coats, let's settle in for our very awkward dinner conversation. Our show is entitled

[ ... SNIP! ... ]

• An illustrated introduction to the t-SNE algorithm: This post is an introduction to a popular dimensionality reduction algorithm: t-distributed stochastic neighbor embedding (t-SNE). [OReilly.com]

• Hitchhiker’s Guide to d3.js

• Embedding Projector [Google | TensorFlowOpen sourcing the Embedding Projector: a tool for visualizing high dimensional data  [local copy]  |  reddit

• How to plot a confidence heatmap in Tensorflow to show which features of the final image was important? [reddit]

• I am currently using the Tensorflow ImageNet model, and I want to see which parts of my final test image were most helpful. Is there any way to plot this as a heatmap?

• You could take the absolute gradient of the maximum softmax input wrt the input image, this will typically highlight some areas of interest. You could try the gradient of the softmax output as well, but keep in mind that the final softmax activations are not independent. <

• How do I go about doing this? Are there any examples for that already?

• You may want to check arXiv:1312.6034

• reddit comment:

• In NN, similar images have similar activations in the upper hidden units. (... You have to choose which hidden layer you will use based on your goal. ...) The lower layers will give you images that look similar pixel wise, and the higer layers will give you images that look semantically similar.

• Awesome Interactive Journalism [GitHub]

• How to Use t-SNE Effectively: Although extremely useful for visualizing high-dimensional data, t-SNE plots can sometimes be mysterious or misleading. By exploring how it behaves in simple cases, we can learn to use it more effectively.  |  reddit

• Andrej Karpathy implemented a very cool t-SNE graph here! CVPR 2015 papers: This year I also embedded the (1,2-gram) TF-IDF vectors of all papers with t-SNE and placed them in an interface where you can navigate them visually. I'm not sure if it's useful but it's really cool.  |  GitHub

• Step-by-step Artificial Neural Network Visualizer  |  reddit  |  related blog post (Python 'generators')

• various visualizations

• Visualizing CNN architectures side by side with mxnet

• wevi: word embedding visual inspector  |  arXiv  |  Xin Rong  |  word2vec Parameter Learning Explained

• World Births and Deaths, Simulated in Real-Time

• Victoria: I did a crude global death rate estimate that was pretty close (above: ~1.9 deaths/sec)!

[6 x 109 people] / [70 yrs (lifespan)] / [365.25 d/yr] / [24 hr/d] / [60 min/hr] / [60 sec/min] = 2.716 deaths/sec

[Visualization] Code:

[Visualization] Commercial:

[Visualization] Papers:

• Duarte E (2016) Living Globe: Tridimensional interactive visualization of world demographic data. arXiv:1607.05946  |  GitHub  |  GitXiv
• This paper presents Living Globe, an application for visualization of demo- graphic data supporting the temporal comparison of data from several countries represented on a 3D globe. Living Globe allows the visual exploration of the following demographic data: total population, population density and growth, crude birth and death rates, life expectancy, net migration and population percentage of different age groups. While offering unexperienced users a default mapping of these data variables into visual variables, Living Globe allows more advanced users to select the mapping, increasing its flexibility. The main aspects of the Living Globe model and prototype are described as well as the evaluation results obtained using heuristic evaluation and usability testing. Some conclusions and ideas for future work are also presented.

• Lu Y (2016) Doubly Stochastic Neighbor Embedding on Spheres. arXiv:1609.01977  |  GitHub  |  GitXiv  |  DOSNES: Doubly Stochastic Neighbor Embedding on Spheres [reddit]  |  blog:post  |  demo
• Recently, Stochastic Neighbor Embedding (SNE) methods have widely been applied in data visualization. These methods minimize the divergence between the pairwise similarities of high- and low-dimensional data. Despite their popularity, the current SNE methods experience the "crowding problem" when the data include highly imbalanced similarities. This implies that the data points with higher total similarity tend to get crowded around the display center. To solve this problem, we normalize the similarity matrix to be doubly stochastic such that all the data points have equal total similarities. A fast normalization method is proposed. Furthermore, we show empirically and theoretically that the doubly stochasticity constraint often leads to approximately spherical embeddings. This suggests replacing a flat space with spheres as the embedding space. The spherical embedding eliminates the discrepancy between the center and the periphery in visualization and thus resolves the "crowding problem". We compared the proposed method with the state-of-the-art SNE method on three real-world datasets. The results indicate that our method is more favorable in terms of visualization quality.

• Matejka J [Autodesk Research, Toronto, Ontario] (2017) Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing. pdf  |  blog   [local copy

• Datasets which are identical over a number of statistical properties, yet produce dissimilar graphs, are frequently used to illustrate the importance of graphical representations when exploring data. This paper presents a novel method for generating such datasets, along with several examples. Our technique varies from previous approaches in that new datasets are iteratively generated from a seed dataset through random perturbations of individual data points, and can be directed towards a desired outcome through a simulated annealing optimization strategy. Our method has the benefit of being agnostic to the particular statistical properties that are to remain constant between the datasets, and allows for control over the graphical appearance of resulting output.



• Simonyan (2013) Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv:1312.6034  |  reddit
• This paper addresses the visualisation of image classification models, learnt using deep Convolutional Networks (ConvNets). We consider two visualisation techniques, based on computing the gradient of the class score with respect to the input image. The first one generates an image, which maximises the class score [Erhan et al., 2009], thus visualising the notion of the class, captured by a ConvNet. The second technique computes a class saliency map, specific to a given image and class. We show that such maps can be employed for weakly supervised object segmentation using classification ConvNets. Finally, we establish the connection between the gradient-based ConvNet visualisation methods and deconvolutional networks [Zeiler et al., 2013].

• van der Maaten L (AISTATS 2009) [parametric t-SNE] Learning a parametric embedding by preserving local structure. pdf  |  reddit
• The paper presents a new unsupervised dimensionality reduction technique, called parametric t-SNE, that learns a parametric mapping between the high-dimensional data space and the low-dimensional latent space. Parametric t-SNE learns the parametric mapping in such a way that the local structure of the data is preserved as well as possible in the latent space. We evaluate the performance of parametric t-SNE in experiments on three datasets, in which we compare it to the performance of two other unsupervised parametric dimensionality reduction techniques. The results of experiments illustrate the strong performance of parametric t-SNE, in particular, in learning settings in which the dimensionality of the latent space is relatively low.

• van der Maaten L (2013) Barnes-Hut-SNE. arXiv:1301.3342  |  GitXiv
• The paper presents an $\small O(N log N)$-implementation of t-SNE -- an embedding technique that is commonly used for the visualization of high-dimensional data in scatter plots and that normally runs in $\small O(N^2)$. The new implementation uses vantage-point trees to compute sparse pairwise similarities between the input data objects, and it uses a variant of the Barnes-Hut algorithm - an algorithm used by astronomers to perform N-body simulations - to approximate the forces between the corresponding points in the embedding. Our experiments show that the new algorithm, called Barnes-Hut-SNE, leads to substantial computational advantages over standard t-SNE, and that it makes it possible to learn embeddings of data sets with millions of objects.

[Visualization] Tools: Machine Learning

[Visualization] Tools: Python

• BowtieGitHub  |  demo  |  Python library for writing interactive visualizations  [reddit]

## VSM: VECTOR SPACE MODELS

### word2vec

• Charts: BriteCharts  [API]