Tag Archives: business intelligence

Pie Charts and faulty analytics in the NYTimes? Watch as the Biz Intel Guru fixes a seriously flawed blog post.

“Is Amazon Working Backward?” That’s the title of NYTimes blogger Nick Bilton post on Dec 24, 2009. Mr. Bilton is writing about Amazon’s product, the Kindle. Regarding the Kindle, he writes, “customers aren’t getting any happier about the end product.”

The day Mr. Bilton posted his story, best-selling author Seth Godin poked holes in it. Mr. Godin’s post is titled, “Learning from bad graphs and weak analysis.” Below is a brief listing of the serious flaws in Mr. Bilton’s approach. The listing is a mashup of Mr. Godin’s thoughts and mine.

1. Bilton should know better than to use pie charts because it’s really hard to determine the percentages when we’re looking at parts of a circle. Bar charts would’ve been much better. Stephen Few has stressed this for years. If you’re posting a chart in the NYTimes, you’d better have read your Stephen Few and Edward Tufte.
2. When your charts are the main support for your story, you’d better get them right. Mr. Bilton did get the table of numbers to the left of the pie charts correct. Perhaps he’d be better served by relying on them over the pie charts to make his point.
3. When you’re analyzing something, you shouldn’t compare opposite populations while ignoring their differences.

Mr. Godin cited 4 specific problems with the piece, ranging from the graphs being wrong (later corrected) to Bilton misunderstanding the nature of early adopters. In addition, Mr. Godin writes, “Many of the reviews are from people who don’t own the device.” Obviously, it’s hard to take a review of a Kindle seriously if the reviewer doesn’t own a Kindle. These are the different populations I’m talking about in item #3 above. I’ll address some of Mr. Godin’s concerns with Bilton’s post now and fill in some of the gaps that Godin left to be filled.

Mr. Bilton tried to make the case that each new version of the Kindle is worse than the one before it. His argument is based almost exclusively on the pie charts below, specifically, the gold slices of each pie. The gold slices are the percentage of one star reviews (lowest possible) each Kindle receives.

Here are the original 3 pies that Mr. Bilton showed in his post.
bad_NYTimes_pie

Despite difficulties in estimating the size of each slice in a pie chart, it is apparent that the 7% slice in the first pie chart is much larger than 7%. His corrected version is here.

Another problem Godin has with Bilton’s piece goes to the nature of early adopters. “The people who buy the first generation of a product are more likely to be enthusiasts,” writes Godin. The first ins are more forgiving than the last ins. I can’t really argue with that insight. My brother, an avid tech geek, is an early adopter of lots of tech gadgets. He was the first person I knew to buy an Apple Newton. I don’t recall a single complaint from him about the Newton, despite it not being able to recognize handwriting, which was its main selling point.

Mr. Godin’s claim that many of the reviewers don’t own a Kindle intrigued me the most. If I could quantify the number of one star reviewers who don’t own a Kindle then I could show the difference in one star ratings between the two groups, owners and non-owners.

I recreated the dataset that Mr. Bilton used for his analysis, 18,587 reviews in all. I also read up on how Amazon determines if a reviewer is an “Amazon Verified Purchaser.” Basically, Amazon says that if the reviewer purchased the product from Amazon, they’ll be flagged with the Amazon Verified Purchase stamp. So let’s see, do the one star ratings vary between the Amazon Verified Purchaser reviews compared to the non-Amazon Verified Purchaser reviews? Why yes, they do!

Amazon Kindle one Star reviews

Amazon Kindle 1 Star reviews

It’s clear from these charts that the reviewers who didn’t purchase a Kindle are much more likely to give a one star rating compared to the reviewers who Amazon verified as purchasing the Kindle. With each Kindle release, the non-verified Kindle owners were consistently four times more likely to give a one star review than the Amazon Verified Reviewers—the ones who actually purchased a Kindle. What’s up with that?

Let’s look at the reviews from the verified purchasers. The percentage of one star ratings each new Kindle edition receives doubles from 2% with Kindle 1, to 4% with Kindle 2, and then moves up to 5% with KindleDX. However, this evidence provides very weak support for Bilton’s claim that Kindle owners are getting progressively less happy.

What about the reviewers who are happy to very happy with the Kindle, the four and five star reviewers? Once again, the non-verified Kindle reviewers provide consistently lower ratings than the reviewers who actually own a Kindle. And once again we see the trend of the non-verified reviewers liking each new version of the Kindle less than the previous one. The four and five star ratings for actual owners of the Kindle jibe with Mr. Godin’s claim that the early adopters are more likely to be enthusiasts than those late to the game.

4 & 5 star Amazon Kindle Reviews

Four & five star Amazon Kindle Reviews

So there you have it, Mr. Godin’s hunches are correct!

What’s most interesting to me, though, is the fact that 75% of reviews of the Kindle aren’t made by people who own a Kindle. On my next post on this subject we’ll hear from a good friend of mine, and text mining expert, Marc Harfeld. We’ll mine the text of the 15,000 customer reviews looking for differences in the words used between the verified and non-verified Kindle owners. Perhaps that will shed light on this mystery. We’re also going to weight the reviews by the number of people who told Amazon that they found the review helpful. You’d think that a review that was helpful to 1 out of 3 people is different than a review that was found helpful by 18,203 out of 19,111 people, like this one.

Lastly, we’d love to hear suggestions from you on other next steps we might take with this analysis.

Thanks for reading.

October’s Real Unemployment nears 18%, over 25MM Americans jobless. Oddles more insights in this award-winning Dashboard

My dashboard is updated with October’s unemployment data.

I’ve found some interesting commentary about October’s non-seasonally adjusted numbers in the business section of the NYTimes, by Floyd Norris, here.

Click on the image for a larger version.

Unemployment in the US, Oct 2009

Unemployment in the US, Oct 2009

Reblog this post [with Zemanta]

The Best Insights into U.S. unemployment, revealed in this Dashboard

At precisely 8:30am, on the first Friday of each month, the Bureau of Labor Statistics releases its Employment Situation report, the most anticipated report for stock, bond, and currency traders in the world. The report is analyzed by a wide variety of sources like CNN, WSJ, Bloomberg, NYTimes, Economy.com, AP, and MSNBC.

The Economic Situation report is critical because it covers the single most important factor in the world’s economy, employment in the U.S. Put simply, if U.S. consumers are losing their jobs, spending will decrease. And since household spending accounts for more than two-thirds of the U.S.’s economy, any change in spending will have an impact on the rest of the world’s economy.

The Economic Situation report is important for another reason. According to Bernard Baumohl, author of the book, The Secrets of Economic Indicators, “Experts have a difficult time trying to predict the unemployment figures because so little other information is out yet for that month.”

With so much riding on this one report, the Business Intelligence Guru thought it the perfect area to apply his information visualization and analytical skills. After all, the data released by the Bureau of Labor Statistics are pretty lifeless–just a bunch of numbers in twenty different data tables. Trying to identify trends in such raw form data is difficult and time consuming. When high quality info viz is properly applied to such data, however, the fog lifts and insights come shining through.

The BLS tables contain different looks at employment and unemployment like:

  • Employment status by sex and age
  • Employment status by race, sex, and age
  • Employment status by education level
  • Unemployment by reason for unemployment
  • Unemployment by duration of unemployment
  • Average weekly hours of work
  • Average earnings (hourly/weekly) by type of industry
  • Monthly changes in employment

The challenge and opportunity here is to provide a clear, consolidated, and insightful view of related and relevant data from the BLS. The Economic Situation report for July 2009 contains nearly 1,000 words. The data tables in the report add approximately 300 data points to the document. But neither the text nor web version of the report on BLS’ website contain a single graph. It doesn’t take a Business Intelligence Guru to know that this is a ripe opportunity for a well-designed dashboard to shed light on. And so, The Business Intelligence Guru presents you with the “Insights into Unemployment in the United States” dashboard for July 2009.

The Busines Intelligence Guru's Dashboard of U.S. Unemployment

The Business Intelligence Guru's Dashboard of U.S. Unemployment

I intend to update this dashboard the first Friday of each month, shortly after the BLS releases the report, so check back then for timely updates.

Lastly, I’m always on the lookout for ways to improve my work, so feel free to leave suggestions and criticism.

Thanks.

Reblog this post [with Zemanta]

The Best Insights into U.S. unemployment, revealed in this Award Winning Dashboard

At precisely 8:30am, on the first Friday of each month, the Bureau of Labor Statistics releases its Employment Situation report, the most anticipated report for stock, bond, and currency traders in the world. The report is analyzed by a wide variety of sources like CNN, WSJ, Bloomberg, NYTimes, Economy.com, AP, and MSNBC.

The Economic Situation report is critical because it covers the single most important factor in the world’s economy, employment in the U.S. Put simply, if U.S. consumers are losing their jobs, spending will decrease. And since household spending accounts for more than two-thirds of the U.S.’s economy, any change in spending will have an impact on the rest of the world’s economy.

The Economic Situation report is important for another reason. According to Bernard Baumohl, author of the book, The Secrets of Economic Indicators, “Experts have a difficult time trying to predict the unemployment figures because so little other information is out yet for that month.”

With so much riding on this one report, the Business Intelligence Guru thought it the perfect area to apply his information visualization and analytical skills. After all, the data released by the Bureau of Labor Statistics are pretty lifeless–just a bunch of numbers in twenty different data tables. Trying to identify trends in such raw form data is difficult and time consuming. When high quality info viz is properly applied to such data, however, the fog lifts and insights come shining through.

The BLS tables contain different looks at employment and unemployment like:

  • Employment status by sex and age
  • Employment status by race, sex, and age
  • Employment status by education level
  • Unemployment by reason for unemployment
  • Unemployment by duration of unemployment
  • Average weekly hours of work
  • Average earnings (hourly/weekly) by type of industry
  • Monthly changes in employment

The challenge and opportunity here is to provide a clear, consolidated, and insightful view of related and relevant data from the BLS. The Economic Situation report for July 2009 contains nearly 1,000 words. The data tables in the report add approximately 300 data points to the document. But neither the text nor web version of the report on BLS’ website contain a single graph. It doesn’t take a Business Intelligence Guru to know that this is a ripe opportunity for a well-designed dashboard to shed light on. And so, The Business Intelligence Guru presents you with the “Insights into Unemployment in the United States” dashboard for July 2009.

Clicking the image of the dashboard (below) will get you a high-resolution version of it.

Dashboard of U.S. Unemployment

Dashboard of U.S. Unemployment

Lastly, I’m always on the lookout for ways to improve my work, so feel free to leave suggestions and criticism.

Thanks.

–John

Reblog this post [with Zemanta]

Do you know the simplest, yet most overlooked lesson of Business Intelligence?

Below is a data set with 4 groupings of data and 2 columns for each grouping. The summary statistics–mean, variance, correlation, sum of squares, r², and linear regression line are the same for all 4 groupings of X and Y values. If we stopped our analysis here we could move forward confidently knowing that the 4 groups of data are the same. And we’d be dead wrong.

anscombes quartet

visualize these data

In my 15 years in analytics I’ve seen good analysts, time and again, stop their analytical efforts when their data summaries don’t tell a compelling story. I’ve sat through hours of meetings, going through page after page of data related to critical financial forecasts, looking at historical trends going back years, without seeing a single graph to show a trend. For whatever reason, data exploration for many analysts starts and ends with a table of summary statistics describing the data. What a shame. In relying on summary statistics we give short thrift to one of our most powerful assets–our eyes.

To see what I mean, click here.

For years Edward Tufte and Stephen Few have been telling the BI community to, “above all else, show the data”. Make your intelligence visible. Go beyond the summary look of your data and show it, warts and all. In fact, the Business Intelligence Guru recommends looking at graphic representations of your data before you even look at summary statistics. There are tools available today that make looking at graphic distributions of data easier than ever. I have years of experience using JMP (link will take you to a fully functional 30 day free trial), from SAS, which has a distribution engine that makes it a snap to look at distributions. Even SAS graph, with its new statistical graph (sg) procedures in version 9.2 make it a snap to view your data up close and personal.

Lastly, I didn’t invent the data that I’m using to make my point. I came across two references last week that made me think that I should write about it. I watched an info viz legendJeff Heer, tell a story making the case for info viz. I didn’t realize it then, but that story he told actually dated back to 1973 and also appeared on the first page of Chapter 1 in Edward Tufte’s book, “The Visual Display of Quantitative Information” published in 2001. The story goes to the heart of why we need to show the data.

The credit for this eye-opening example goes to F.J. Anscombe, a statistician who created this data set in 1973 to make the case for graphing data before analyzing data. He was a man ahead of his time.

Bar graphs with a non-zero baseline? “Never”! says Biz Intel Guru. Here’s why…

Trying to understand the economy is tough business. Publishing your predictions about the economy on the web is even more difficult. So I was surprised when I came across a paper on economy.com’s website titled, “The Economic Impact of the American Recovery and Reinvestment Act” and noticed this chart.

zandi_unemployement

unemployment rate bar chart

The graph in question was taken from page 13 of the paper, written by Mark Zandi. It’s also featured on his homepage, here. Dr. Zandi is the chief-economist and co-founder of economy.com with a knack for verbally explaining complex things so clearly that non-economists can understand them. He is often heard on NPR and quoted in the WSJ and NYTimes weighing in on the economy. I’ve followed his career for over 15 years and respect his insights and success. It is out of that respect and admiration that I critique this graph.

The main problem with this bar chart is that it is telling two visual lies. The first one is quite serious, the second one, less so.

Bar charts must have a zero-based axis because we use the length of the bars to compare one bar to another bar. By breaking this rule economy.com’s unemployment rate chart makes it look like the unemployment rate will increase 6 fold from 2008Q3 to 2010Q4 without the stimulus, when in fact, the estimated increase is from roughly 6% to 11%, less than a 2x. The lack of a zero baseline also adds a false visual comparison between the ‘economic stimulus’ and ‘no economic stimulus bars’. For that let’s look at  bars in 10Q4. The ‘no economic stimulus bar’ (blue bar) is about 11.2% versus the ‘economic stimulus’ (black bar) of 8.5%. The actual difference between the two percentages is 1.3x, but take a look at the length of the bars and the difference appears to be 2x.

I know Dr. Zandi had good intentions when he went with 5 as his starting value on the Y axis. His intent was make the chart better show the trend over time, but in using a bar chart to display the data, he choose the wrong chart. What should he have used? Read on.

The second visual lie being told here is caused by the third dimension on the graph. Can we tell what the unemployment rate is expected to be in Q4 of 2010 with and without the stimulus? Looks to me like the no stimulus unemployment rate is expected to come in at 11.2% and the unemployment rate with stimulus is expected to be 8.5%. The angling of the Y axis makes it hard for the eye to track over to the value of the bar. To add insult to injury, the angle at the top of each bar makes it difficult to figure out where the ending value of the bar is. Should we reference the front side of the bar or the backside? Unfortunately, the corresponding data this graph is drawn from are not available from economy.com, so we can’t tell for sure where the points are. But we can try a little experiment.

bad_3d

3d bar chart is misleading

I whipped up the chart on the right using MS Excel 2007. The values for A, B, C, D are 10, 20, 30, 40 respectively. I’ve added the actual values to the top of each bar to make it a little easier to read. This 3D chart is actually insightful because it illustrates a serious problem with 3D charts–the bars misrepresent the data. Column D should line up with 40, but it doesn’t, it’s more like 38. If you’re telling a story as important as what’s going to happen to the economy after spending nearly $800 billion in taxpayer money, you should stay away from 3D bar charts because they tell lies about the data they represent.

And that brings us to the final flaw with this chart. Bar charts are generally best used for categorical or grouped data. For time-series data we usually want to go with a line chart, not a bar chart. The lines in the line chart help our eyes see trends in the data better than the individual bars in the bar chart. Line charts also allow us to start from a non zero baseline which allows the graph’s creator to show the trend by setting the min and max values slightly above and slightly below the max and min values of the data.

Now let’s compare a non 3D bar chart to a line chart. Same data on each chart. I don’t have quarterly data in either graph, just yearly because the only hard data available in Dr. Zandi’s paper was yearly.

zandi_bars_final

Bar charts must have a zero baseline

zandi_lines_final

the BI Guru's improved line chart for time series data

I obeyed the cardinal rule of the zero baseline on the bar chart, and you can see that the magnitude of the difference between stimulus and non stimulus unemployment isn’t nearly as overstated as it was on the original chart. Even more important, the trend is much easier to grasp from the line chart than the bar chart. Notice how it just about leaps off the chart? With the bar chart, you need to go back and forth one or two times to discern the trend.

Lastly, I chose a soft, somewhat natural color pallete to draw these charts. They’re much more pleasing to the eyes than black and blue.

–John

The Business Intelligence Guru

Reblog this post [with Zemanta]

Bar chart with a non-zero baseline? “Never”! says Biz Intel Guru. Here’s why…

Trying to understand the economy is tough business. Publishing your predictions about the economy on the web is even more difficult. So I was surprised when I came across a paper on Economy.com’s website titled, “The Economic Impact of the American Recovery and Reinvestment Act” and noticed this bar chart.

zandi_unemployement

unemployment rate bar chart

The bar chart in question was taken from page 13 of a paper, written by Mark Zandi. It’s also featured on his homepage, here. Dr. Zandi is the chief-economist and co-founder of Economy.com with a knack for verbally explaining complex things so clearly that non-economists can understand them. He is often heard on NPR and quoted in the WSJ and NYTimes weighing in on the economy. I’ve followed his career for over 15 years and respect his insights and success. It is out of that respect and admiration that I critique this 3D bar chart.

The main problem with this bar chart is that it is telling two visual lies. The first one is quite serious, the second one, less so.

A bar chart must have a zero-based axis because we use the length of the bars to compare one bar to another bar.
By breaking this rule Economy.com’s unemployment rate chart makes it look like the unemployment rate will increase 6 fold from 2008Q3 to 2010Q4 without the stimulus. In fact, the estimated increase is from roughly 6% to 11%, less than a 2x. The lack of a zero baseline also adds a false visual comparison between the ‘economic stimulus’ and ‘no economic stimulus bars’. For that let’s look at the two bars in 10Q4. The ‘no economic stimulus bar’ (blue bar) is about 11.2% versus the ‘economic stimulus’ (black bar) of 8.5%. The actual difference between the two percentages is 1.3x, but take a look at the length of the bars and the difference appears to be 2x.

I know Dr. Zandi had good intentions when he went with 5 as his starting value on the Y axis. His intent was make the bar chart better show the trend over time, but in using a bar chart to display the data, he choose the wrong chart. What should he have used? We’ll answer that question in a minute.

The second visual lie being told here is caused by the third dimension on the bar chart. Can we tell what the unemployment rate is expected to be in Q4 of 2010 with and without the stimulus? Looks to me like the no stimulus unemployment rate is expected to come in at 11.2% and the unemployment rate with stimulus is expected to be 8.5%. The angling of the Y axis makes it hard for the eye to track over to the value of the bar. To add insult to injury, the angle at the top of each bar makes it difficult to figure out where the ending value of the bar is. Should we reference the front side of the bar or the backside? Unfortunately, the corresponding data this graph is drawn from are not available from Economy.com, so we can’t tell for sure where the points are. But we can try a little experiment.

bad_3d

3d bar chart is misleading

I whipped up the chart on the right using Excel 2007. The values for A, B, C, D are 10, 20, 30, 40 respectively. I’ve added the actual values to the top of each bar to make it a little easier to read. This 3D chart is actually insightful because it illustrates a serious problem with a 3D bar chart–the bars misrepresent the data. Column D should line up with 40, but it doesn’t, it’s more like 38. If you’re telling a story as important as what’s going to happen to the economy after spending nearly $800 billion in taxpayer money, you should stay away from 3D bar charts because they tell lies about the data they represent.

And that brings us to the final flaw with this bar chart. Bar charts are generally best used for categorical or grouped data. For time-series data we usually want to go with a line chart, not a bar chart. The lines in the line chart help our eyes see trends in the data better than the individual bars in the bar chart. Line charts also allow us to start from a non zero baseline which allows the graph’s creator to show the trend by setting the min and max values slightly above and slightly below the max and min values of the data.

Now let’s compare a non 3D bar chart to a line chart. Same data on each chart. I don’t have quarterly data in either graph, just yearly because the only hard data available in Dr. Zandi’s paper was yearly.

zandi_bars_final

Bar charts must have a zero baseline

zandi_lines_final

the BI Guru's improved line chart for time series data

I obeyed the cardinal rule of the zero baseline on the bar chart, and you can see that the magnitude of the difference between stimulus and non stimulus unemployment isn’t nearly as overstated as it was on the original chart. Even more important, the trend is much easier to grasp from the line chart than the bar chart. Notice how it just about leaps off the chart? With the bar chart, you need to go back and forth one or two times to discern the trend.

Lastly, I chose a soft, somewhat natural color palette to draw these charts. They’re much more pleasing to the eyes than black and blue.

–John

The Business Intelligence Guru

Reblog this post [with Zemanta]