Category Archives: Featured

Old Spice Guy’s popularity on Twitter charted

Old Spice recently released about 14 ads with The Old Spice Guy (OSG) personally responding to Tweets from 14 celebrities. Some of the celebs are Hollywood types, others are Web Celebs like Guy Kawasaki, Biz Stone, Kevin Rose. You can see OSG’s video replies here. They are great.

I put together a chart showing the number of Tweets that mention the words ‘old’ and ‘spice’. The chart shows just how quickly the Twitterverse filled up with Tweets about the OSG. Before 9am on July 13th, there was hardly any mention of the OSG, but then, within 6 hours, there’s a spike of about 2,300 tweets per hour about Old Spice. Alas, nothing lasts forever, and after peaking at 4,500 Tweets per hour, the Twitterverse quieted down and settled at around 400 Tweets per hour about the OSG.

BTW, the OSG says he’s hung up his towel.

Chart of the Old Spice Guy's popularity on Twitter

OSG Trend

Watch as The Biz Intel Guru fixes a poorly designed WSJ graphic

A friend of mine pointed me to a story in today’s WSJ (no subscription needed) with a hard to understand graphic in it. I’ve pasted the graphic below.

The designer chose to use the entire background of the chart to represent the number of sudden cardiac deaths in a given year. They used squares of different sizes to represent the number of explained and unexplained deaths from cardiac arrest. In this case, I think the designer was trying to give the reader an easy way to compare the parts to the whole, but it doesn’t work. Also, there are over 100 words of annotation on this otherwise skimpy graphic, which makes me think they could have done away with the graphic and just used the words instead.

Here’s the WSJ graphic:

Poorly designed WSJ graphic

WSJ graphic

Here’s what I think the chart should look like:

What do you think? Is my graphic clearer than the WSJ’s? What would you do differently? I’d love to hear your comments.

Have bad graphs and faulty analysis led to evidence that Amazon has fake reviewers? Read on…

In my first post about Nick Bilton’s flawed analysis of the Amazon’s Kindle I left a few questions unanswered. One of those questions had to do with the ratings of the reviewers themselves. Since Amazon allows each review to be rated by anyone, it might be interesting to see if the number of people who found a review useful varied by the number of stars the reviewer gave to the Kindle. So I ran an analysis examining Kindle 2 reviews.

So here are 4 plots*. The first shows all reviews. Along the horizontal axis is the number of people reported to have found the review useful. Along the vertical axis is the star rating of the review. The plot on the upper right shows the same distribution, but for non-verified purchasers of Kindle2 only. The plot on the lower left shows the same distribution, but this time for reviewers who Amazon said actually purchased a Kinde2. The plot on the lower right brings the Amazon verified and Amazon non-verified purchasers together. Each red + sign is an Amazon Verified purchaser and each blue circle is a non-verified purchaser.

Four scatterplots

Evidence of fake reviews?

These four charts tell us an interesting story. Each point on the chart represents a review. So in each chart (except on the bottom right**) you’re seeing 9,212 points. The two charts on top are roughly the same. That’s because the first chart shows all reviews and the second one shows just the reviews submitted by non-verified Kindle2 purchases. You may recall that 75% of the reviews on the Kindle2 were submitted by people who Amazon said didn’t buy a Kindle2. So those dots dominate the charts. But take a look at the chart on the bottom left. You’ll notice that the cluster of reviews at the bottom of top two charts, the ones between 1 and 2 stars and stretching out all the way to the end of the X axis are gone. We knew that the non-verified purchasers were four times more likely to give a one star review compared to a verified purchaser, but we didn’t know that the 1 star non-verified reviewer were getting lots of people finding their reviews useful.

This dynamic really pops in the bottom right hand chart, the one with the red and blue lines in it. The blue line is made up of non-verified purchasers. As the number of people who said they found the review useful increases (starting around 8), the line dives down towards the 1-2 star ratings. The downward slope of the curve for the verified purchasers is much, much gentler.

This is a bit of a head-scratcher. I’ve heard people say that Amazon is full of fake reviews. These people aren’t saying that Amazon is the one doing the faking, but people who have some product that competes against the product being reviewed, or just people with an axe to grind. Is this an example of that? Do the fakers get their friends to say that their reviews are helpful? Maybe the Kindle2 verified purchasers post reviews that people just don’t find helpful. Right now, I don’t know what the correct answer is. But I have a feeling that some intelligent text-mining of the data will help flesh out an answer. Be on the lookout for a post about just that topic, by Marc Harfeld, coming soon, right here.

*To make the graphs easier to decipher I’ve excluded any review with more than 50 people finding the review useful. Taking the horizontal axis beyond 50 makes the plot very difficult to read. In all, this amounts to excluding 92 reviews out of the 9,304 I have gathered on the Kindle2. Because the star ratings are integers between 1 and 5, I needed to introduce a random jitter to the points (1 star becomes 1.221, another 1 star becomes 1.1321) so that they wouldn’t completely overlap each other on the scatterplot. I did the same to the values of how many people found each review helpful.
**Please note, to make an apples to apples comparison for chart on the bottom right, I had to reduce the number of non-verified reviewers down to the same number of amazon-verified reviewers. The sampling was a simple random sample, so it did not distort the distribution.

Stunning new software for geovisual analytics

I came across an exciting and novel piece of visualization software this morning and wanted to share it with the group. What’s novel about the software is that it combines some of the most powerful visualization techniques in one package, with all visualizations linked to each other, kind of like what you’d see in jmp, tableau, and panopticon, but with more of an emphasis on the geographical aspect of your data. When you click a point in the scatter plot, the corresponding point(s) light up on the map, bar chart, or any other graphic that is on the screen. These linked graphs are a great way to explore data.

The software was created at Linkoping University and, as far as I can tell, it allows users to upload, explore, and visualize their own data, as well as OECD data. Unfortunately, it doesn’t look like the software itself can be downloaded. Here’s a link to the site describing the software and a link where you can demo the software with canned data. The BBC also did a 3 minute demo of the software here. I’ve also put a picture of some of the graphic capabilities of the software at the bottom of this post.

It has a geographic layer with excellent mapping capabilities, including choropleth maps (let’s ignore the little pie charts on their example…no one is perfect). While the maps in the demo aren’t incredibly detailed, I think you can add layers of your own, more detailed data. It has a scatter plot engine much like trendalyzer, a tool that allows the user to animate time series data as well as change axis variables on the fly, a parallel coordinates plot function which Stephen Few wrote about in 2006, a time graph, a table lens and other goodies.  This is the only time I’ve seen the table lens made available outside of advizor analyst. If you’ve never seen a table lens visualization before, you should definitely check it out.

This platform is one of the best I’ve seen in terms of putting powerful visualization tools in the hands of info visualizers to enable them to show the data and tell their stories in an immersive and interactive fashion. In short, this is an important direction where the info viz world needs to venture. And you can’t beat the price (free).

What do you think?

Thanks to Max Kiesler at Design demo for bringing the software to my attention.

collage_gav

map, parallel coordinate plot, tablelens

Do you know the simplest, yet most overlooked lesson of Business Intelligence?

Below is a data set with 4 groupings of data and 2 columns for each grouping. The summary statistics–mean, variance, correlation, sum of squares, r², and linear regression line are the same for all 4 groupings of X and Y values. If we stopped our analysis here we could move forward confidently knowing that the 4 groups of data are the same. And we’d be dead wrong.

anscombes quartet

visualize these data

In my 15 years in analytics I’ve seen good analysts, time and again, stop their analytical efforts when their data summaries don’t tell a compelling story. I’ve sat through hours of meetings, going through page after page of data related to critical financial forecasts, looking at historical trends going back years, without seeing a single graph to show a trend. For whatever reason, data exploration for many analysts starts and ends with a table of summary statistics describing the data. What a shame. In relying on summary statistics we give short thrift to one of our most powerful assets–our eyes.

To see what I mean, click here.

For years Edward Tufte and Stephen Few have been telling the BI community to, “above all else, show the data”. Make your intelligence visible. Go beyond the summary look of your data and show it, warts and all. In fact, the Business Intelligence Guru recommends looking at graphic representations of your data before you even look at summary statistics. There are tools available today that make looking at graphic distributions of data easier than ever. I have years of experience using JMP (link will take you to a fully functional 30 day free trial), from SAS, which has a distribution engine that makes it a snap to look at distributions. Even SAS graph, with its new statistical graph (sg) procedures in version 9.2 make it a snap to view your data up close and personal.

Lastly, I didn’t invent the data that I’m using to make my point. I came across two references last week that made me think that I should write about it. I watched an info viz legendJeff Heer, tell a story making the case for info viz. I didn’t realize it then, but that story he told actually dated back to 1973 and also appeared on the first page of Chapter 1 in Edward Tufte’s book, “The Visual Display of Quantitative Information” published in 2001. The story goes to the heart of why we need to show the data.

The credit for this eye-opening example goes to F.J. Anscombe, a statistician who created this data set in 1973 to make the case for graphing data before analyzing data. He was a man ahead of his time.

Bar graphs with a non-zero baseline? “Never”! says Biz Intel Guru. Here’s why…

Trying to understand the economy is tough business. Publishing your predictions about the economy on the web is even more difficult. So I was surprised when I came across a paper on economy.com’s website titled, “The Economic Impact of the American Recovery and Reinvestment Act” and noticed this chart.

zandi_unemployement

unemployment rate bar chart

The graph in question was taken from page 13 of the paper, written by Mark Zandi. It’s also featured on his homepage, here. Dr. Zandi is the chief-economist and co-founder of economy.com with a knack for verbally explaining complex things so clearly that non-economists can understand them. He is often heard on NPR and quoted in the WSJ and NYTimes weighing in on the economy. I’ve followed his career for over 15 years and respect his insights and success. It is out of that respect and admiration that I critique this graph.

The main problem with this bar chart is that it is telling two visual lies. The first one is quite serious, the second one, less so.

Bar charts must have a zero-based axis because we use the length of the bars to compare one bar to another bar. By breaking this rule economy.com’s unemployment rate chart makes it look like the unemployment rate will increase 6 fold from 2008Q3 to 2010Q4 without the stimulus, when in fact, the estimated increase is from roughly 6% to 11%, less than a 2x. The lack of a zero baseline also adds a false visual comparison between the ‘economic stimulus’ and ‘no economic stimulus bars’. For that let’s look at  bars in 10Q4. The ‘no economic stimulus bar’ (blue bar) is about 11.2% versus the ‘economic stimulus’ (black bar) of 8.5%. The actual difference between the two percentages is 1.3x, but take a look at the length of the bars and the difference appears to be 2x.

I know Dr. Zandi had good intentions when he went with 5 as his starting value on the Y axis. His intent was make the chart better show the trend over time, but in using a bar chart to display the data, he choose the wrong chart. What should he have used? Read on.

The second visual lie being told here is caused by the third dimension on the graph. Can we tell what the unemployment rate is expected to be in Q4 of 2010 with and without the stimulus? Looks to me like the no stimulus unemployment rate is expected to come in at 11.2% and the unemployment rate with stimulus is expected to be 8.5%. The angling of the Y axis makes it hard for the eye to track over to the value of the bar. To add insult to injury, the angle at the top of each bar makes it difficult to figure out where the ending value of the bar is. Should we reference the front side of the bar or the backside? Unfortunately, the corresponding data this graph is drawn from are not available from economy.com, so we can’t tell for sure where the points are. But we can try a little experiment.

bad_3d

3d bar chart is misleading

I whipped up the chart on the right using MS Excel 2007. The values for A, B, C, D are 10, 20, 30, 40 respectively. I’ve added the actual values to the top of each bar to make it a little easier to read. This 3D chart is actually insightful because it illustrates a serious problem with 3D charts–the bars misrepresent the data. Column D should line up with 40, but it doesn’t, it’s more like 38. If you’re telling a story as important as what’s going to happen to the economy after spending nearly $800 billion in taxpayer money, you should stay away from 3D bar charts because they tell lies about the data they represent.

And that brings us to the final flaw with this chart. Bar charts are generally best used for categorical or grouped data. For time-series data we usually want to go with a line chart, not a bar chart. The lines in the line chart help our eyes see trends in the data better than the individual bars in the bar chart. Line charts also allow us to start from a non zero baseline which allows the graph’s creator to show the trend by setting the min and max values slightly above and slightly below the max and min values of the data.

Now let’s compare a non 3D bar chart to a line chart. Same data on each chart. I don’t have quarterly data in either graph, just yearly because the only hard data available in Dr. Zandi’s paper was yearly.

zandi_bars_final

Bar charts must have a zero baseline

zandi_lines_final

the BI Guru's improved line chart for time series data

I obeyed the cardinal rule of the zero baseline on the bar chart, and you can see that the magnitude of the difference between stimulus and non stimulus unemployment isn’t nearly as overstated as it was on the original chart. Even more important, the trend is much easier to grasp from the line chart than the bar chart. Notice how it just about leaps off the chart? With the bar chart, you need to go back and forth one or two times to discern the trend.

Lastly, I chose a soft, somewhat natural color pallete to draw these charts. They’re much more pleasing to the eyes than black and blue.

–John

The Business Intelligence Guru

Reblog this post [with Zemanta]

Bar chart with a non-zero baseline? “Never”! says Biz Intel Guru. Here’s why…

Trying to understand the economy is tough business. Publishing your predictions about the economy on the web is even more difficult. So I was surprised when I came across a paper on Economy.com’s website titled, “The Economic Impact of the American Recovery and Reinvestment Act” and noticed this bar chart.

zandi_unemployement

unemployment rate bar chart

The bar chart in question was taken from page 13 of a paper, written by Mark Zandi. It’s also featured on his homepage, here. Dr. Zandi is the chief-economist and co-founder of Economy.com with a knack for verbally explaining complex things so clearly that non-economists can understand them. He is often heard on NPR and quoted in the WSJ and NYTimes weighing in on the economy. I’ve followed his career for over 15 years and respect his insights and success. It is out of that respect and admiration that I critique this 3D bar chart.

The main problem with this bar chart is that it is telling two visual lies. The first one is quite serious, the second one, less so.

A bar chart must have a zero-based axis because we use the length of the bars to compare one bar to another bar.
By breaking this rule Economy.com’s unemployment rate chart makes it look like the unemployment rate will increase 6 fold from 2008Q3 to 2010Q4 without the stimulus. In fact, the estimated increase is from roughly 6% to 11%, less than a 2x. The lack of a zero baseline also adds a false visual comparison between the ‘economic stimulus’ and ‘no economic stimulus bars’. For that let’s look at the two bars in 10Q4. The ‘no economic stimulus bar’ (blue bar) is about 11.2% versus the ‘economic stimulus’ (black bar) of 8.5%. The actual difference between the two percentages is 1.3x, but take a look at the length of the bars and the difference appears to be 2x.

I know Dr. Zandi had good intentions when he went with 5 as his starting value on the Y axis. His intent was make the bar chart better show the trend over time, but in using a bar chart to display the data, he choose the wrong chart. What should he have used? We’ll answer that question in a minute.

The second visual lie being told here is caused by the third dimension on the bar chart. Can we tell what the unemployment rate is expected to be in Q4 of 2010 with and without the stimulus? Looks to me like the no stimulus unemployment rate is expected to come in at 11.2% and the unemployment rate with stimulus is expected to be 8.5%. The angling of the Y axis makes it hard for the eye to track over to the value of the bar. To add insult to injury, the angle at the top of each bar makes it difficult to figure out where the ending value of the bar is. Should we reference the front side of the bar or the backside? Unfortunately, the corresponding data this graph is drawn from are not available from Economy.com, so we can’t tell for sure where the points are. But we can try a little experiment.

bad_3d

3d bar chart is misleading

I whipped up the chart on the right using Excel 2007. The values for A, B, C, D are 10, 20, 30, 40 respectively. I’ve added the actual values to the top of each bar to make it a little easier to read. This 3D chart is actually insightful because it illustrates a serious problem with a 3D bar chart–the bars misrepresent the data. Column D should line up with 40, but it doesn’t, it’s more like 38. If you’re telling a story as important as what’s going to happen to the economy after spending nearly $800 billion in taxpayer money, you should stay away from 3D bar charts because they tell lies about the data they represent.

And that brings us to the final flaw with this bar chart. Bar charts are generally best used for categorical or grouped data. For time-series data we usually want to go with a line chart, not a bar chart. The lines in the line chart help our eyes see trends in the data better than the individual bars in the bar chart. Line charts also allow us to start from a non zero baseline which allows the graph’s creator to show the trend by setting the min and max values slightly above and slightly below the max and min values of the data.

Now let’s compare a non 3D bar chart to a line chart. Same data on each chart. I don’t have quarterly data in either graph, just yearly because the only hard data available in Dr. Zandi’s paper was yearly.

zandi_bars_final

Bar charts must have a zero baseline

zandi_lines_final

the BI Guru's improved line chart for time series data

I obeyed the cardinal rule of the zero baseline on the bar chart, and you can see that the magnitude of the difference between stimulus and non stimulus unemployment isn’t nearly as overstated as it was on the original chart. Even more important, the trend is much easier to grasp from the line chart than the bar chart. Notice how it just about leaps off the chart? With the bar chart, you need to go back and forth one or two times to discern the trend.

Lastly, I chose a soft, somewhat natural color palette to draw these charts. They’re much more pleasing to the eyes than black and blue.

–John

The Business Intelligence Guru

Reblog this post [with Zemanta]

McDonald’s knows how to sell fast food, but they don’t know how to do info viz.

As an info viz junkie I was excited to see McDonald’s using incell bar charts on their food wrappers. Here’s an image off of a cheeseburger wrapper.

mcdonalds_cheeseburger_bar

incell bar chart

As a Stephen Few fan, however, I know that a key ingredient to high quality info viz is simplicity–it’s got to be easy to understand. To that end, what in the world does the dashed vertical line spanning calories through sodium represent? It looks like it’s set to 33%, but why? Is that some magic number in the world of nutrition? Is 33% “crossing the line” for a single item on their menu? There’s ample room for an explanation of the reference line on the wrapper, but none is provided.

McDonald’s has missed the mark on a few other things on their graphics. The icons to the right of each item are meaningless and the heavy border surrounding each item detracts from the data, but the reference line to me is the real head scratcher.

McDonald’s has a great opportunity to educate their customers about how the calories/sodium/protein/fat/carbs they’re eating right now fit into their overall daily intake of calories/sodium/protein/fat/carbs. I get the sense that the incell bar charts are an attempt to simplify this information, which is great. But their attempt comes up short because they didn’t explain the reference line.

Lastly, has anyone else seen these graphs on their wrappers at McDonald’s? I’m curious if this is a local test. I live near Philadelphia, PA.

BTW, I posted this entry to Stehpen Few’s blog. Here is his response.

McDonald’s knows how to sell fast food, but they don’t know how to do info viz.

As an info viz junkie I was excited to see McDonald’s using incell bar charts on their food wrappers. Here’s an image off of a cheeseburger wrapper.

mcdonalds_cheeseburger_bar

incell bar chart

As a Stephen Few fan, however, I know that a key ingredient to high quality info viz is simplicity–it’s got to be easy to understand. To that end, what in the world does the dashed vertical line spanning calories through sodium represent? It looks like it’s set to 33%, but why? Is that some magic number in the world of nutrition? Is 33% “crossing the line” for a single item on their menu? There’s ample room for an explanation of the reference line on the wrapper, but none is provided.

McDonald’s has missed the mark on a few other things on their graphics. The icons to the right of each item are meaningless and the heavy border surrounding each item detracts from the data, but the reference line to me is the real head scratcher.

McDonald’s has a great opportunity to educate their customers about how the calories/sodium/protein/fat/carbs they’re eating right now fit into their overall daily intake of calories/sodium/protein/fat/carbs. I get the sense that the incell bar charts are an attempt to simplify this information, which is great. But their attempt comes up short because they didn’t explain the reference line.

Lastly, has anyone else seen these graphs on their wrappers at McDonald’s? I’m curious if this is a local test. I live near Philadelphia, PA.

BTW, I posted this entry to Stehpen Few’s blog. Here is his response.