Skip to content
Archive of entries posted by

Why are we such a litigious society?

As a European, I’ve always been fascinated by how trivial and common-sense matters end up in courts. I’ve been less fascinated and more annoyed by the piles of forms and disclaimers everywhere. Finally, annoyance takes over any kind of fascination faced with the medical bills – often so high because of lawsuit-protecting insurance. As Paul H. […]

Data & Visualization Tools to Track Ebola

I’ve received the following email (slightly edited for clarity): Can anyone recommend a turnkey, full-service solution to help the Liberian government track the spread of Ebola and get this information out to the public? They want something that lets healthcare workers update info from mobile phones, and a workflow that results in data visualizations. They […]

Measuring Beauty

I’ve come across a paper that was using “beauty” as one of the predictors. To measure beauty, the authors used I don’t trust metrics without trying them on a gold standard first. So, I tried how well Anaface does on something that the arts world considers as one of gold standards of beauty – […]

New New York data research organizations

In a single day, New York City obtained two data analysis/statistics/machine learning organizations: \t Microsoft Research New York City with John Langford (machine learning), Duncan Watts (networks), and Dave Pennock (algorithmic economics). \t eBay technology center focusing on data – led by Chris Dixon, the co-founder of the recommendation engine company Hunch, which has recently […]

Agreement Groups in US Senate and Dynamic Clustering

Adrien Friggeri has a lovely visualization of US Senators movement between clusters: You have to click the image and play with it to appreciate it. The methodology isn’t yet published – but I can see how this could be very illuminating. The dynamic clustering aspect hasn’t been researched much – one of the notable pieces […]

Data Ecosystem

There has been a lot of talk about the growth of big data and data science recently. To a large extent, it’s just a shuffling of inversely correlated terms, and applied statisticians should just touch up their programming skills and then call themselves “data scientists” to be more hip. As for adversarial situations This post […]

Factual – a new place to find data

Factual collects data on a variety of topics, organizes them, and allows easy access. If you ever wanted to do a histogram of calorie content in Starbucks coffees or plot warnings with a live feed of earthquake data – your life should be a bit simpler now. Also see DataMarket, InfoChimps, and a few older […]

Rare name analysis and wealth convergence

Steve Hsu summarizes the research of economic historian Greg Clark and Neil Cummins: Using rare surnames we track the socio-economic status of descendants of a sample of English rich and poor in 1800, until 2011. We measure social status through wealth, education, occupation, and age at death. Our method allows unbiased estimates of mobility rates. […]

Statistical Murder

Robert Zubrin writes in “How Much Is an Astronaut’s Life Worth?” (Reason, Feb 2012): …policy analyst John D. Graham and his colleagues at the Harvard Center for Risk Analysis found in 1997 that the median cost for lifesaving expenditures and regulations by the U.S. government in the health care, residential, transportation, and occupational areas ranges […]

Beautiful Line Charts

I stumbled across a chart that’s in my opinion the best way to express a comparison of quantities through time: It compares the new PC companies, such as Apple, to traditional PC companies like IBM and Compaq, but on the same scale. If you’d like to see how iPads and other novelties compare, see here. […]

Data mining efforts for Obama’s campaign

From CNN: In July,, an online newsite focused on data mining and analytics software, ran an unusual listing in its jobs section: “We are looking for Predictive Modeling/Data Mining Scientists and Analysts, at both the senior and junior level, to join our department through November 2012 at our Chicago Headquarters,” read the ad. “We […]

DBQQ rounding for labeling charts and communicating tolerances

This is a mini research note, not deserving of a paper, but perhaps useful to others. It reinvents what has already appeared on this blog. Let’s say we have a line chart with numbers between 152.134 and 210.823, with the mean of 183.463. How should we label the chart with about 3 tics? Perhaps 152.132, […]

Luck or knowledge?

Joan Ginther has won the Texas lottery four times. First, she won $5.4 million, then a decade later, she won $2million, then two years later $3million and in the summer of 2010, she hit a $10million jackpot. The odds of this has been calculated at one in eighteen septillion and luck like this could only […]

Examining US Legislative process with “Many Bills”

This is Many Bills, a visualization of US bills by IBM: I learned about it a few days ago from Irene Ros at Foo Camp. It definitely looks better than my own analysis of US Senate bills.

Traffic Prediction

I always thought predicting traffic for a particular day and time would be something easily predicted from historic data with regression. Google Maps now has this feature: It would be good to actually include season, holiday and similar information: the predictions would be better. I wonder if one can find this data easily, or if […]

Data mining and allergies

With all this data floating around, there are some interesting analyses one can do. I came across “The Association of Tree Pollen Concentration Peaks and Allergy Medication Sales in New York City: 2003-2008” by Perry Sheffield. There they correlate pollen counts with anti-allergy medicine sales – and indeed find that two days after high pollen […]

Weather visualization with WeatherSpark

WeatherSpark: prediction and observation quantiles, historic data, multiple predictors, zoomable, draggable, colorful, wonderful: Via Jure Cuhalev.

Get the Data

At GetTheData, you can ask and answer data related questions. Here’s a preview: I’m not sure a Q&A site is the best way to do this. My pipe dream is to create a taxonomy of variables and instances, and collect spreadsheets annotated this way. Imagine doing a search of type: “give me datasets, where an […]

Poverty, educational performance – and can be done about it

Andrew has pointed to Jonathan Livengood’s analysis of the correlation between poverty and PISA results, whereby schools with poorer students get poorer test results. I’d have written a comment, but then I couldn’t have inserted a chart. Andrew points out that a causal analysis is needed. This reminds me of an intervention that has been […]

Fattening of the world and good use of the alpha channel

In the spirit of Gapminder, Washington Post created an interactive scatterplot viewer that’s using alpha channel to tell apart overlapping fat dots better than sorting-by-circle-size Gapminder is using: Good news: the rate of fattening of the USA appears to be slowing down. Maybe because of high gas prices? But what’s happening with Oceania?