Data science projects

Projects

1. Burritos - repo, blog, ignite talk, seminar1, seminar2, poster, dashboard

I designed a 10-dimensional system to systematically rate burrito quality and applied this method to rate 350+ burritos (60+ restaurants, 70+ unique reviewers). I then analyzed the data to establish significant differences between rival taco shops, explore dimensionality reduction, and quantify inter-reviewer reliability. I made a couple dashboards using Tableau and Dash to explore taco shop ratings around San Diego. I also formatted the data into an SQL database and demonstrated some example queries.

2. Brain rhythms - Papers, methods paper repo, lab toolbox, coupling toolbox

For my PhD research, I developed, implemented, and justified a fundamentally new approach to analyzing brain rhythms. Conventional analysis revolves around the Fourier transform, which decomposes brain signals as a sum of sine waves. However, these brain rhythms are often nonsinusoidal in shape, and so Fourier analysis does not well capture all the information in the signal. I wrote a review paper that summarized past reports of potentially interesting nonsinusoidal phenomena, developed a python- and pandas-based framework for analyzing waveform shape (methods paper and library), and applied it to uncover physiological information contained in waveform shape (paper in prep). Additionally, I wrote a paper demonstrating how a previous high-profile result was artifactual because it did not properly accounting for waveform shape.

3. Police officer misconduct - Crime Lab New York internship

Using scikit-learn, I built and iterated among several machine learning models trained on police officer activity to predict the likelihood of complaints of excessive force. These models are designed to be incorporated in an early intervention system (EIS) used to efficiently allocate training or other resources in police departments. I also used lots of pandas and seaborn to visualize trends in police officer behavior and variance across districts.

4. Keystroke biometrics - repo, dashboard, slides

Keystroke patterns offer a unique biometric to identify individuals and authenticate them in a way that is difficult for malicious users to reproduce. As an Insight Data Science fellow, I worked with a project management fellow and a data engineering fellow towards a system to continuously authenticate users using their keystrokes. I trained a gradient boosting model in order to identify users with a 98% true positive rate and 98% true negative rate (recall), and designed a dashboard that a security team could use to visualize suspicious user activity.

I scraped the PubMed API to obtain the full text of around 170,000 articles from 20 different journals that publish neuroscience papers. I did some keyword searching in order to approximately determine which articles relied on custom-written scripts, and which articles publicly shared their code. This involved designing a tool to efficiently, semi-automatically, sift through 200 papers in order to validate the keyword-based estimates. Ultimately, I was able to see how journals vary in terms of code sharing and track how this has improved over time.

This project was expanded during a subsequent hackathon. Our team improved and expanded the text mining to compute a measure of the openness of each journal (“O-Factor”).

6. Flight delays - repo, report

Tom Donoghue and I analyzed a data set of 5 million domestic flights in 2015. We characterized different trends in the delays of flights and built several machine learning models to predict if flights would be significantly delayed.

7. Currency exchange prediction - scraper, press

In a past life, I was super interested in trying to predict fluctuations in financial markets. Back when I used MATLAB (cringe), I built some machine learning models (logistic regression, SVM, neural net) from scratch (thanks Andrew Ng!) to predict future trends of the Euro-Dollar exchange rate based on historical trends. At least I had the sense to use python to scrape very high quality (1-minute resolution!) historical data of the exchange rates of several currency pairs over several years.

8. Politifact fact checks - repo, blog post

Politifact is a handy nonprofit organization that rates the truth value of political statements, mostly by American politicians. I scraped the results of their fact checking since their inception in 2007 and visualized some trends across time, geography, and the political spectrum.

9. Personal data requests - repo, blog post

The EU’s General Data Protection Regulation (GDPR) law has prompted many web companies to allow its users to easily download (a subset of) the personal data they have stored. I was curious what information was held by the apps that I use (and what they would provide), so I requested, processed, analyzed, and visualized data from Spotify, Twitter, Amazon, Facebook, Apple, LinkedIn, Uber, Venmo, Bank of America, and Tinder.

10. Maps with Tableau - Tableau Public profile

I’ve used Tableau to make a few maps to visualize a weighted sum of Olympic medals, burrito ratings, and popularity of posters at the Society for Neuroscience annual meeting (USA, international). For when I want to stay in Python, I made a simple example of plotting features as a function fo US state.

11. Neuroscience poster popularity - Notebooks, blog

At the 2016 Society for Neuroscience annual meeting, I developed an efficient data collection system that allowed me to quickly measure the number of people at over 3000 posters. I then cross-referenced this data with the online abstract booklet in order to determine which themes were most popular. Additionally, I visualized how poster popularity varied depending on the state or country of the presenter, and determined which deviances were and were not significant.

12. Insight Data Science Fellows - repo, blog

I was interested in the backgrounds of Insight Data Science fellows and the sort of jobs they ended up getting, but the only information I could find was a long list of pictures on their website. Therefore, I used BeautifulSoup to scrape the information of these fellows and then used pandas and seaborn to visualize the prevalence of different universities, scientific fields, and companies that the fellows worked in. I also looked at the interactions to discover a few interesting statistical relationships among these.

13. Interactive visualization - script, blog

I made a simple interactive graph using Bokeh that projects a savings plan to determine when you will have enough money to retire comfortably. I know I’m too young to be thinking about this already.

Tutorials and teaching

Throughout my time doing these projects, I’ve also found it useful to make some lectures and tutorials for explaining concepts and tools. Here are some of them!

1. Neural signal processing - Notebooks, YouTube videos (1) (2)

In order to help undergrads integrating into the lab, I and other lab members have written tutorials on how to use the standard tools that we use to analyze brain data. This has also included empirical mode decomposition (EMD) and extracting data from images in a paper.

2. Python on the Open Science Grid - tutorial, demo files

I was using the free supercomputing resources provided by the Open Science Grid (OSG) and there weren’t many resources for using python as there were for MATLAB or C. So I shared a tutorial, that OSG shared further to allow those who were also new to supsercomputing to get up and running utilizing it with their custom python scripts.

3. Introduction to data science - Clustering slides, Linear regression notebook

As part of my advisor’s introcution to data science undergraduate course, I gave a couple of guest lectures on clustering and multiple linear regression.

Scott Cole

My personal website

Projects

1. Burritos - repo, blog, ignite talk, seminar1, seminar2, poster, dashboard

2. Brain rhythms - Papers, methods paper repo, lab toolbox, coupling toolbox

3. Police officer misconduct - Crime Lab New York internship

4. Keystroke biometrics - repo, dashboard, slides

6. Flight delays - repo, report

7. Currency exchange prediction - scraper, press

8. Politifact fact checks - repo, blog post

9. Personal data requests - repo, blog post

10. Maps with Tableau - Tableau Public profile

11. Neuroscience poster popularity - Notebooks, blog

12. Insight Data Science Fellows - repo, blog

13. Interactive visualization - script, blog

Tutorials and teaching

1. Neural signal processing - Notebooks, YouTube videos (1) (2)

2. Python on the Open Science Grid - tutorial, demo files

3. Introduction to data science - Clustering slides, Linear regression notebook

Scott Cole

My personal website

Projects

1. Burritos - repo, blog, ignite talk, seminar1, seminar2, poster, dashboard

2. Brain rhythms - Papers, methods paper repo, lab toolbox, coupling toolbox

3. Police officer misconduct - Crime Lab New York internship

4. Keystroke biometrics - repo, dashboard, slides

5. Code sharing in neuroscience - repo, blog

6. Flight delays - repo, report

7. Currency exchange prediction - scraper, press

8. Politifact fact checks - repo, blog post

9. Personal data requests - repo, blog post

10. Maps with Tableau - Tableau Public profile

11. Neuroscience poster popularity - Notebooks, blog

12. Insight Data Science Fellows - repo, blog

13. Interactive visualization - script, blog

Tutorials and teaching

1. Neural signal processing - Notebooks, YouTube videos (1) (2)

2. Python on the Open Science Grid - tutorial, demo files

3. Introduction to data science - Clustering slides, Linear regression notebook