Data blog

Python web app for studying combo-word Chinese vocabulary using Streamlit

I best remember new vocabulary in Chinese is by breaking up its characters into component words, so I made a game to do this. e.g. 半 (half) + 岛 (island) = 半岛 (peninsula)

Most frequent Chinese characters appearing in metro station names

When I arrive in a new Chinese city, one problem I have is not knowing how to read the names of the metro stations I need to go to. I scraped a few thousand station names from Wikipedia to familiarize myself with the most frequent characters.

Historical analysis of FIRE strategies

On r/FIRE, there are common questions about ‘Is $X enough to retire?’. I analyzed some historical inflation and SPY return data to get a sense of how often different retirement strategies succeed.

Google Sheet template for net worth tracking

A lot of nerds have spreadsheets to track the details of their financials. I, of course, am proud of the custom-made columns in my own.

Black Mirror episode ranking analysis

Several websites have ranked all of the episodes in Seasons 1-6 of Black Mirror. I aggregated them into a spreadsheet along with my own ratings, and produced some figures and tables, including aggregate rankings and the most underrated and overrated skits.

I Think You Should Leave Skit Ranking: A systematic review and meta-analysis

Multiple websites have ranked the best skits in “I Think You Should Leave.” I aggregated them into a spreadsheet along with my own ratings, and produced some figures and tables, including aggregate rankings and the most underrated and overrated skits.

Fruit picking in the East Bay

Many streets in Oakland and Berkeley are lined with fruit trees. I made a map of them and a flow chart to formally decide if it’s OK to pick the fruit from a tree.

Hofstede’s 6 culture dimensions - Streamlit

A recent Freakonomics episode described Hofstede’s framework of 6 culture dimensions. I made a web app to visualize the culture dimensions by country and compare them to a personal set of preferences.

Metro door opening durations - Cross-city comparison

Riding the metro in Mexico City, I immediately noticed how briefly their doors open at the stops. So I collected some data to compare this more quantitatively with the metro I take at home, BART.

Misleading title of British Medical Journal article

The title of a BMJ article indicates that the first dose of the Pfizer vaccine has 52% efficacy. I rant here about how the statistical analysis does not match the intuitive interpretation of efficacy after the first dose, and how this leads to inaccurate downstream citations.

Income inequality in USA, visualized

I had heard that “income inequality is getting worse,” but I never really had a quantified perspective of it. Therefore, I downloaded some data from the Census and visualized it here

My personal data from 10 apps

I requested, processed, analyzed, and visualized data from Spotify, Twitter, Amazon, Facebook, Apple, LinkedIn, Uber, Venmo, Bank of America, and Tinder.

Analysis of 10,000+ fact checks on Politifact

Politifact is a handy nonprofit organization that rates the truth value of political statements, mostly by American politicians. For this post, I scraped the results of their fact checking since their inception in 2007 and visualized some trends across time, space, and the political spectrum.

Estimating the prevalence of code sharing in scientific research

I scraped over 100,000 full-text articles from the Pub Med API to estimate how common code sharing is across different journals.

Analysis of Insight Data Science Fellows

Insight Data Science is a popular fellowship for PhDs going into data analytics. I wanted to get a better sense of where fellows came from and ended up, so I scraped some data from the Insight website and analyzed it.

Delays of US domestic flights: trends and predictability

Using data collected by the US Bureau of Transportation Statistics, we analyzed the relationships between basic properties of a flight (e.g. time of day, airline) and how much they were delayed. We also trained a classifier to predict if a flight would be delayed.

Brain Oscillations and the importance of waveform shape

We believe that the waveform shape of brain rhythms should be analyzed to extract more biological information from neural recordings.

Free supercomputing for research: A tutorial on using Python on the Open Science Grid

The Open Science Grid is a free supercomputing resource for academics. This step-by-step tutorial will allow any researcher to begin running their Python-based analysis using high-throughput computing for free.

Poster popularity at SfN 2016: Comparing across states and countries

I analyzed the geographic distribution of poster viewership for posters presented at the SfN 2016 annual meeting. Posters from some states (Minnesota) and countries (Netherlands) are more popular than others. But not significantly.

Poster popularity at SfN 2016: Cognition and systems are hot. Development is not.

At the annual neuroscience conference, I collected data to quantify the popularity of thousands of presented posters. As a first analysis, I related poster popularities to 8 of the major themes in neuroscience.

Lucha Libre Taco Shop: Official burrito review

Twenty-eight people applied burritology to asses their experiences eating burritos at the famous Lucha Libre Taco Shop in San Diego.

Olympics 2016: Normalizing results by sport

The United States is dominating in the Olympic medal count, but maybe that’s because of the disproportionate number of medals in swimming. What would the results look like if the number of medals was even for all sports?

Which country is winning the 2016 Olympic games?: A Tableau Visualization

Interactive visualization to set weights to each medal category to visualize performance across the globe. Playing around with data visualization in Tableau Public using the Rio Summer 2016 Olympic medals dataset.

Extracting time series data from a published figure

Rather than extreme zooming on small figure panels, using simple image processing, we can extract an estimate of signals plotted in papers.

100 Burritos in San Diego: 10-dimensional rating system

A group of San Diegans quantified over 100 burrito experiences by decomposing their meal into 10 dimensions. This post describes the data and has some preliminary analysis.

Phase-amplitude coupling: hidden in noise

Phase-amplitude coupling is a common analysis on neural oscillations. But in order to obtain meaningful results, we need to first preprocess the signal.

Empirical Mode Decomposition (EMD) tutorial

Rhythmic signal analysis can be improved with a transform our of the time domain. While Fourier techniques are traditionally applied, EMD offers an alternative approach to frequency analysis.