Posts Tagged ‘data’
Without Context, Data is Meaningless
Friday, November 21st, 2008
Without context, answers, facts and datum are meaningless. “The ultimate answer to life, the universe and everything” is undeniably 42, however, without the question, lost in an unfortunate galactic construction project, it’s useless.
Wikipedia is an awesome collection of facts, some even with context but, there are many areas, despite the best efforts of the contributors, that lack context. For example, Vincent Van Gogh has a wonderful entry detailing much of his life and explaining the inspirations of much of his works. However, a single link to the post-impressionism article and a few brief mentions of his contemporaries praise and admonitions provide context to his work. This is not a fault with the user driven model, but instead of the toolset. The unstructured nature of a Wiki doesn’t lend itself to comparative analysis - which is of course just one way to provide context to information.
Quantitative information alone is of course insufficient. What if we were to measure the sum of all works by Van Gogh. What measure would we use mass, dimension, net worth - a convoluted formula of all of the above? I am sure the answer would be 42. It’s absurd to compare everything using quantity alone. Qualitative analysis is a critical component in all evaluations. In this area Wikipedia’s model excels. Because the work is living, it can reflect the current thinking on a topic and through the edit history give us context of how perceptions have changed. Unfortunately, it still requires human beings to understand the changes and to provide the perspective - perhaps some graduate student is working on an automated system to solve this problem as their final thesis.
But if we leave the world of art and move back towards less abstract concepts that are easier to quantify, wikipedia fails. The inability to compare and contrast two articles of similar types, such as corporations leaves Wikipedia lacking. Consider Dell and HP. Both are publicly traded companies, both have Wikipedia entries and both have massive amounts of qualitative and quantitative data available about them. To look at the two companies side by side, we have to look beyond the Wiki walls and move to a more data oriented set of information. Google for example does a great job of comparing the corporate stock prices and a whole host of other providers give different toolsets for interacting with the publicly available financial data.
I think there should be a world where these two universes collide, empowering the public to contribute to the qualitative components of data sets in a wiki model, and also have comparative tools that act on the quantitative data.
Make Sure Your Data’s Right!
Monday, November 3rd, 2008Two widgets, both provided by major companies have some discrepancies in their data and the election hasn’t even started yet! I’m not sure who’s wrong here, Google or Microsoft, but either way someone’s data isn’t accurate. When it comes to something like an election - accuracy is very important. As best I can tell, Hawaii is the last poll to close and from the even hour discrepancy, it seems to me someone didn’t account for daylight savings or a timezone somewhere.
| Screen grabs taken at the same time | |
|---|---|
|
|
The live widgets for both companies are below. (more…)
Finding Good Data
Friday, October 17th, 2008
The amount of data available of publicly online is astounding. The US Government has done a pretty good job of providing detailed data to it’s citizens on pretty much anything they keep tabs on. The census data immediately comes to mind, but there is much much more if you take the time to look for it. I stumbled across the USASearch.gov search engine while looking for some population data for cities. The search engine is not as accurate as Google unless you know the government lingo for your data, but it’s pretty good. I suspect there’s a data set there somewhere that would enhance any businesses results and the only cost to you is figuring out how it applies to what you do.
Presenting Data Visually
Thursday, April 24th, 2008Anyone who’s immersed in the world of data and its presentation knows the value of a well designed and laid out graphic can lead to a project. Currently application developers have two primary tools at their disposal, Adobe Flex and the Google Chart API for making dynamic graphics. Silverlight may offer some similar features, but I’ve no experience building with it. Additionally, Java is also very capable, but when you see how simple the Google product is, you’d need a very strong reason to consider running anything like this in house again. As such I have left them out of the discussion.
I’ve used Flex for some simple reporting graphs. The real power in Flex lays in it’s ability to create entire interfaces, even entire websites entirely in a rich multimedia capable environment. I’ve been intoxicated with the raw power to manipulate the data in real time and interact in a way not easily possible on the web before. However, all of this power comes at a steep price, while the core SDK is free - for time pressed developers - the IDE environment saves a huge amount of effort. Because of it’s great power, sometimes using Flex is a little like bringing a machete to the dinner table - a little bit more power than you really need to cut the vegetables with. This is where Google Charts comes in.
| Group | Male | Female |
|---|---|---|
| 0 | 52 | 38 |
| 1 | 48 | 34 |
| 2 | 61 | 40 |
| 3 | 77 | 63 |
| 4 | 79 | 89 |
| 5 | 120 | 102 |
| 6 | 138 | 125 |
| 7 | 125 | 154 |
| 8 | 172 | 155 |
| 9 | 156 | 173 |
| 10 | 193 | 193 |
| 11 | 217 | 189 |
| 12 | 291 | 293 |
| 13 | 300 | 345 |
| 14 | 292 | 321 |
| 15 | 297 | 333 |
| 16 | 321 | 348 |
| 17 | 365 | 366 |
| 18 | 364 | 337 |
| 19 | 327 | 306 |
| 20 | 243 | 248 |
| 21 | 161 | 179 |
| 22 | 122 | 110 |
| 23 | 74 | 68 |
Way back when in December of 2007, Google anounced the Google Chart API. This revolutionized graphing for developers, although I suspect many may not realize it yet! Lets start our exploration with some fake data and see just what we can do with the API. We’ll use a small slice of some new user signup data I have at my disposal to get our hands dirty with.
Generating this table is no special feat, some simple database queries against your own internal data and presto, you’ve got a column of numbers too. But that’s where the fun begins. Consider this table for a minute, it’s pretty clear there’s a normal distribution curve there, but are they the same? Are there more men or women - or is it nearly equal? Are there other things this data could be telling us?
Let’s start with the marketing department’s questions. Are we attracting more men or women? If we sum the numbers in the table together, we get 4,595 men and 4,609 women. We can create a pie chart with these numbers so folks can look at a glance and see what that would look like.
http://chart.apis.google.com/chart? cht=p& chs=250x100& chd=t:4595,4609& chl=Male|Female& chco=0000ff,ff0000
The line break is for clarity only. To walk through the pieces, the cht is simply the chart type, p for pie (p3 is a 3d pie), chs is a width x height value, chd is the data, chl are the labels and chco are the colors. Interestingly, you can actually skip the labels on and still generate a graph, but obviously it becomes much clearer when you provide the labels.
Now lets look at the curves - obviously we have a peak during the day and lows at night, but just what does that look like and is it the same for men and women? Here we can use the bar chart.
http://chart.apis.google.com/chart? cht=bvg& chs=630x150& chbh=10,0,5& chco=0000ff,ff0000& chg=101,33.3& chxt=x,y& chxl=0:|12a|1a|2a|3a|4a|5a|6a|7a|8a|9a|10a|11a|12p|1p|2p|3p|4p|5p|6p|7p|8p|9p|10p|11p|1:|0|125|250|375& chd=t:13.9,12.8,16.3,20.5,21.1,32.0,36.8,33.3,45.9,41.6,51.5,57.9,77.6,80.0,77.9,79.2,85.6,97.3,97.1,87.2,64.8,42.9,32.5,19.7|10.1,9.1,10.7,16.8,23.7,27.2,33.3,41.1,41.3,46.1,51.5,50.4,78.1,92.0,85.6,88.8,92.8,97.6,89.9,81.6,66.1,47.7,29.3,18.1&
Now with this chart, there’s quite a bit more going on. Google recommends that for larger data sets you use their special encoding system, but in order to keep this simple I went ahead and used the decimal encoding so it would be easier to understand. For those who are ready for that next step, take a look at the docs for the API, there’s excellent pseudo code there for how to transition to the encoding method they use. cht still defines the chart type, in this case a horizontal bar chart with grouped data. chs is the size again; read the docs carefully if you decide to implement one of these. The bar chart can extend past your chosen size and you’ll simply crop your data. chbh comes in handy here by letting me shrink the size of the bars to 10 pixels and removing lots of padding. Last but not least chco sets the colors bringing us to the really new stuff.
For bar charts the data has to be limited into a range of 0-100. If your maximum data point is less than 100, you’re in luck, just feed the data in place of this method below. However, if your data doesn’t live in the realm of 0-100, you’ll need to consider doing some more work. The simplest method is to create a percentage (some PHP code is below outlining how to do that). The X and Y axis are blank by default with a bar graph. There are a number of methods for making them display but I find this method the most logical. If your data is broken out by percentages already, using the default Y axis value is fine, however, in our example, it’s not. So we’re going to pass two parameters. First chxt=x,y, what this does is tells the API that the first label set I pass in is for the X axis, and the second is for the Y. Now we can pass in the values we want to display on the X and Y axis. chxl defines our labels using position indexes to reference back to the values we had assigned in our chxt parameter. In our example, index 0 is X and index 1 is Y - notice they’re bolded: chxl=0:|label1|label2| … |labeln|1:|label1|label2| … and so on. The last parameter I skipped over from before is chg which sets up the grid based on a percentage. I passed in 101% spacing for the vertical bars along the X axis, and 33.3% for the horizontal bars because they lined up with my increments.
For those unsure how to generate the percentages in PHP, it’s very simple. The first thing we’ll do is loop over our dataset to determine what our largest value is. Then, instead of passing in the raw values as we did with the pie chart above, we’ll actually pass in their percentage of the maximum number. If your like me, you might want to step the scale up to the next logical number. I did that on my graph, but did not do it in the code below - I’ll save that for another blog post. Then you can at last loop over the data one more time to create the actual data strings. Once it’s all done, you’ll have a single string you can drop into your image call.
$result = $conn->query($select);
// get the maximum value in our data
$max = 0;
while($row = $result->fetch_assoc()){
if($row['Male'] > $max){ $max = $row['Male']; }
if($row['Female'] > $max){ $max = $row['Female']; }
}
$result->data_seek(0);
// generate the two datasets
$data_male = "";
$data_female = "";
while($row = $result->fetch_assoc()){
if(strlen($data_male) == 0){ $data_male .= ","; $data_female .= ","; }
$data_male .= round($row['Male']/$max,1);
$data_female .= round($row['Female']/$max,1);
}
// Concatenate the strings to complete the data parameter
$data = $data_male . "|" . $data_female;
There are so many different graph types available, I obviously can’t share them all in this already rather long post. I highly encourage you to read over the API. It’s very simple to understand and extremely easy to use. Google only requests that if you’ll use more than 250,000 graphs a day, you let them know.

