Bits and Bytes
By BRIAN ANGLISS
Nothing beats a journalist’s intuition for something being “not quite right.” But sometimes you want more than just your gut feeling, especially when the feeling is coming from a new published paper, a study, or even just some data you've been collecting for a while. Here are some simple tools available in spreadsheet programs that can help you mathematically check if your gut feel is accurate or not.
The following assumes that you have access to MS Excel, 2003 or later. It’s also written from the perspective of a PC user, so adjustments will need to be made for those of you who use Macs instead. Open Office Calc should have similar functionality, but I haven’t verified it.
Before you can do any sanity checks on your data, you need to get it into your spreadsheet first. If your data isn’t already in Excel format, then it may be in .csv, .txt. or .dat format. Open Excel, click “Open,” and select “All Files” to see the non-.xls/.xlsx files. Don’t simply double-click the file, as it may bring up the file in a text editor instead of your spreadsheet program.
Vertical lines indicate where Excel will set column breaks.
Notice that additional breaks are required between the
"Mar" and "Apr" data columns.
While .csv files will often open without additional effort, .txt and .dat files will often bring up a box that defines your choices on how the file should be divided up into columns. Many .txt and .dat files have varying column widths and/or explanatory text at the start and end of the file. These confuse Excel’s ability to automatically open the file, so you'll need to select “Fixed Width” instead. Click “Next” and bring up the data preview, an example of which is shown.
Scroll down until you find the data and then click on the white spaces between numbers to define the columns with vertical lines. You can also move existing lines around if you need to.
Once you’ve broken up the data the way you want it, click “Finish.” This will result in chopped-up explanatory text. Once you've opened the data you’re interested in, create a new worksheet where you can do your calculations and generate your graphs. Much scientific data (especially climate data) is done one row per year, one column per month. This row-oriented format is a problem for Excel since it functions best in column format, especially when creating graphs. There are several ways to switch the row data into column data, but one of the easier ones is to use the Excel “transpose” option.
Transpose is at the bottom center of the box.
To transpose row data into a column, select all the data in the row you care about and copy it by hitting Control “C”. Then go to the calculation worksheet, select the starting cell, then select “Paste Special” in either the Edit tab or by right clicking. One of the options in the box that comes up is “Transpose.” Select it and click “OK.” This will paste the data as a column starting at the cell you select and filling down the column.
A more graceful way to do this is with Excel’s “VLOOKUP” function, but it takes some time to get familiar with exactly how the function works.
Once you’ve got your data into a column, you can graph it by selecting Insert: Chart.
Click on the corner squares to drag in additional data.
Click on the box edge to drag to different data.
Here’s a couple of hints for easier graphing. First, when you’re selecting a lot of data, instead of dragging your cursor over all the data, select the first cell of the data, scroll down to the bottom of the data and then Shift-click the last cell of the data. This will select all the data between the two selected cells as the data you want to graph. Second, if you want to add more than one or two series of data, start by graphing a single data series. Click on the graph and then look at the cells of data. You’ll notice that the data’s name is surrounded by a green box, the Y-axis data is highlighted by a blue box, and the X-axis labels are surrounded by a purple box. If you have multiple columns of data you want on the same chart, you can click the square on the blue/green box corner and drag to include the other columns. Third, you can also make new charts of the same type as your original by copying the original chart, clicking on it, and then dragging the edge of the blue or green box until the new column of data is highlighted.
Once you’ve got your data on your graph, generating a trendline is easy. Click on the data on the graph, then right-click to bring up the formatting options. One of the options will be “Add Trendline.” Excel will give you several trendline options to choose between, as well as the option to display the trend’s equation and “R-squared” (R2) value on the chart. You’ll use the linear and polynomial trends most often, and you can add both by clicking “Add Trendline” twice. You should also display both the equation and the R2 value since they're what will tell you whether or not the data you’re sanity-checking makes sense.
If the data you’re working with comes from a scientist, you can compare your trends and R2 values against what is published in the paper. If your trend and/or R2 are close to what the paper says, then you can have some confidence in the results. If your numbers are significantly different, however, that’s a clue that something may not be right, and if you double-check your numbers and they’re still different, then it might be time to dig deeper into the paper or find expert help.
The reason you check both linear and polynomial trends is to see if there's a major difference in the R2 value between the two.The polynomial curve’s R2 value should always be higher than the linear trend’s R2, but if there is a big difference between the two (0.05 or more), then a linear approximation probably isn’t valid. Alternatively, if an accelerating trend is expected but the difference between the linear and polynomial trends’ R2 is 0.01 or less, then a linear approximation is more valid. In either case, if you’re sanity-checking someone else’s data and they’re claiming a trend that seems invalid from your check, then that’s a red flag. Don’t use polynomial trends with an order greater than two, however, as it’s easy to misinterpret the meaning of higher order polynomial trends.
The absolute value of R2 is as important as how the relative values of R2 for both types of trend are. Any R2 value that’s less than about 0.35 can be a result of random chance, so any conclusions based on such a low R2 should be well explained. The higher the R2 value is, the more certain it is that the trend is real.
While I’ve assumed data vs. time thus far, it’s also useful to graph one set of data vs. another, such as in the case of global temperature anomaly vs. atmospheric CO2 concentration. To do this, you generate a “scatter” chart (instead of a “line” chart) with CO2 on one axis and temperature on the other. Then you can generate both trends and the R2 values for comparison. In the case of CO2 and temperature, the positive trends and the high R2 values indicate that there’s a pretty strong correlation between high CO2 concentrations and higher temperature anomaly.
Cross-correlation of temp anomaly with CO2 concentration for CO2 lagging temp by 8 years. Note:1890 through 2000 are hidden for convenience.
Finally, while correlation does not prove causation, there is a way to check if causation is likely via correlations. To test causation, you run a set of what are known as “cross-correlations” between two sets of data. We do this with the Excel function “CORREL” where we select the two data series. In order to check if a causal claim is reasonable, you can run cross-correlations with different delays between one set of data and another. You delay one set of data by starting it a certain number of cells late and finishing the other set of data the same number of cells early. Square the correlation result to get the R2 value.
Be careful with cross-correlations. First, you can get in trouble fast if you’re just guessing about a correlation and don’t have some understanding of what you’re correlating (think of the high correlation between global warming and the drop in number of pirates). Second, CORREL assumes a linear relationship between the datasets you’re cross-correlating, so it can give wrong answers when there’s a non-linear relationship. But CORREL is still a good starting point for figuring out where you need to ask more questions.
Spreadsheet programs can be very powerful tools for quickly and easily checking claims based on statistical analyses of data. While I used a climate-based example, the same approach can be taken with regard to nearly any source of data, and I’ve used similar analyses to sanity-check health and economic data in the past.
* From the quarterly newsletter SEJournal, Summer 2011 issue.