Tag Archives: scatter plot

TinkerPlots, data exploration software for kids that’s all grown up.

I was blown away this morning when I watched two short movies about data exploration software called TinkerPlots. The software is marketed to schools for kids grades 4-8. I love the idea that kids in school can get their hands dirty visually exploring data. And I’m even more excited that they have this tool available to them. Why has TinkerPlots flown under our radar for so long? It’s been around for at least 4 years.

The designers of this software deserve praise for creating software that gets out of the way (a Stephen Few-ism, I think) and lets the user explore the data using simple commands. I will happily shell out the $89 to play with TinkerPlots.

Unfortunately, the Tinkerplot website makes it a bit difficult to see examples of the software in action. You can see some quicktime movies showing TinkerPlots at work here and here. Here’s a listing of all TinkerPlot movies.

This software isn’t nearly as sophisticated as some of the software mentioned on Stephen’s site. But, as da Vinci said, “simplicity is the ultimate sophistication.”

Would love to hear your thoughts on this. Is anyone out there using TinkerPlots?

Have bad graphs and faulty analysis led to evidence that Amazon has fake reviewers? Read on…

In my first post about Nick Bilton’s flawed analysis of the Amazon’s Kindle I left a few questions unanswered. One of those questions had to do with the ratings of the reviewers themselves. Since Amazon allows each review to be rated by anyone, it might be interesting to see if the number of people who found a review useful varied by the number of stars the reviewer gave to the Kindle. So I ran an analysis examining Kindle 2 reviews.

So here are 4 plots*. The first shows all reviews. Along the horizontal axis is the number of people reported to have found the review useful. Along the vertical axis is the star rating of the review. The plot on the upper right shows the same distribution, but for non-verified purchasers of Kindle2 only. The plot on the lower left shows the same distribution, but this time for reviewers who Amazon said actually purchased a Kinde2. The plot on the lower right brings the Amazon verified and Amazon non-verified purchasers together. Each red + sign is an Amazon Verified purchaser and each blue circle is a non-verified purchaser.

Four scatterplots

Evidence of fake reviews?

These four charts tell us an interesting story. Each point on the chart represents a review. So in each chart (except on the bottom right**) you’re seeing 9,212 points. The two charts on top are roughly the same. That’s because the first chart shows all reviews and the second one shows just the reviews submitted by non-verified Kindle2 purchases. You may recall that 75% of the reviews on the Kindle2 were submitted by people who Amazon said didn’t buy a Kindle2. So those dots dominate the charts. But take a look at the chart on the bottom left. You’ll notice that the cluster of reviews at the bottom of top two charts, the ones between 1 and 2 stars and stretching out all the way to the end of the X axis are gone. We knew that the non-verified purchasers were four times more likely to give a one star review compared to a verified purchaser, but we didn’t know that the 1 star non-verified reviewer were getting lots of people finding their reviews useful.

This dynamic really pops in the bottom right hand chart, the one with the red and blue lines in it. The blue line is made up of non-verified purchasers. As the number of people who said they found the review useful increases (starting around 8), the line dives down towards the 1-2 star ratings. The downward slope of the curve for the verified purchasers is much, much gentler.

This is a bit of a head-scratcher. I’ve heard people say that Amazon is full of fake reviews. These people aren’t saying that Amazon is the one doing the faking, but people who have some product that competes against the product being reviewed, or just people with an axe to grind. Is this an example of that? Do the fakers get their friends to say that their reviews are helpful? Maybe the Kindle2 verified purchasers post reviews that people just don’t find helpful. Right now, I don’t know what the correct answer is. But I have a feeling that some intelligent text-mining of the data will help flesh out an answer. Be on the lookout for a post about just that topic, by Marc Harfeld, coming soon, right here.

*To make the graphs easier to decipher I’ve excluded any review with more than 50 people finding the review useful. Taking the horizontal axis beyond 50 makes the plot very difficult to read. In all, this amounts to excluding 92 reviews out of the 9,304 I have gathered on the Kindle2. Because the star ratings are integers between 1 and 5, I needed to introduce a random jitter to the points (1 star becomes 1.221, another 1 star becomes 1.1321) so that they wouldn’t completely overlap each other on the scatterplot. I did the same to the values of how many people found each review helpful.
**Please note, to make an apples to apples comparison for chart on the bottom right, I had to reduce the number of non-verified reviewers down to the same number of amazon-verified reviewers. The sampling was a simple random sample, so it did not distort the distribution.

Do you know the simplest, yet most overlooked lesson of Business Intelligence?

Below is a data set with 4 groupings of data and 2 columns for each grouping. The summary statistics–mean, variance, correlation, sum of squares, r², and linear regression line are the same for all 4 groupings of X and Y values. If we stopped our analysis here we could move forward confidently knowing that the 4 groups of data are the same. And we’d be dead wrong.

anscombes quartet

visualize these data

In my 15 years in analytics I’ve seen good analysts, time and again, stop their analytical efforts when their data summaries don’t tell a compelling story. I’ve sat through hours of meetings, going through page after page of data related to critical financial forecasts, looking at historical trends going back years, without seeing a single graph to show a trend. For whatever reason, data exploration for many analysts starts and ends with a table of summary statistics describing the data. What a shame. In relying on summary statistics we give short thrift to one of our most powerful assets–our eyes.

To see what I mean, click here.

For years Edward Tufte and Stephen Few have been telling the BI community to, “above all else, show the data”. Make your intelligence visible. Go beyond the summary look of your data and show it, warts and all. In fact, the Business Intelligence Guru recommends looking at graphic representations of your data before you even look at summary statistics. There are tools available today that make looking at graphic distributions of data easier than ever. I have years of experience using JMP (link will take you to a fully functional 30 day free trial), from SAS, which has a distribution engine that makes it a snap to look at distributions. Even SAS graph, with its new statistical graph (sg) procedures in version 9.2 make it a snap to view your data up close and personal.

Lastly, I didn’t invent the data that I’m using to make my point. I came across two references last week that made me think that I should write about it. I watched an info viz legendJeff Heer, tell a story making the case for info viz. I didn’t realize it then, but that story he told actually dated back to 1973 and also appeared on the first page of Chapter 1 in Edward Tufte’s book, “The Visual Display of Quantitative Information” published in 2001. The story goes to the heart of why we need to show the data.

The credit for this eye-opening example goes to F.J. Anscombe, a statistician who created this data set in 1973 to make the case for graphing data before analyzing data. He was a man ahead of his time.