Members of the LinkedIn group “Research, Methodology, and Statistics in the Social Sciences” have seen an increasing intellectual interest for research topics such as rainfall, extreme temperatures and rising sea levels. While many of the group members collaborate in discussing questions related to inferential statistics on other academic fields, basic understanding of climate and temperature research challenges our own knowledge of statistics. Recently, a member of the group posted the following question: “Can t-test be used to test/establish the statistical significance of maximum and minimum temperature data measured on twenty (24) hour basis over five(5) years?”. Group members split on opinions as to whether student test is the proper test for the analysis. I agree with members who stressed the fact that the data will violate t-test basic assumptions. Therefore, I suggest a non-parametric Mann-Whitney test for comparing the difference of location between samples.
In this blog-post, I show first that temperature data in “San Andres” Island is a time series with neither trend component nor cycle component; nevertheless, data exhibit a heavy seasonal component, hence the ambiguity as to which method one should run. Second, I show that temperature data in “San Andres” Island does not follow a normal distribution as data show a Fisher-Tippett (2) distribution, which seems to agree with other temperature related studies. Third, I demonstrate that by running a Mann-Whitney U Test we can conclude that maximum and minimum temperatures in “Jamaica” have no difference when measured on hourly basis.
The data is organized originally as follows.
At first glance, it is clear to any analyst that data is a time series. As such, the first insight would be to rearrange the data into two columns, hence group member’s intuition for running a t-test. Well, I did rearrange the data into two columns aiming at a basic inspection of its graphic behavior. The visual inspection of the data reveals a time series model with a clear fixed seasonal fluctuation; with a not so clear long-term trend; and without a clear cycle. Given the strong seasonal fluctuation of the data, the first insight that comes to mind is the seasonal adjustment of the original time series. The graph below shows the “raw” data depiction.
In a basic model decomposition by addition, I took the difference in the lag 24 of the seasonal part of the model. I labeled it seasonally adjusted data Maxt-24 and Mint-24. The partial autocorrelation graphs show the dependency of the time series before and after the difference operator t-24.
Descriptive statistics of the un-adjusted and the seasonally adjusted data are the following.
The first decision to make for the analysis is whether to work with the seasonally adjusted data or not. I run the Augmented Dickey-Fuller/Unit root test on all variables. All of the variables showed no trend, so the only convenience for working with the seasonally adjusted data is that the series remove the seasonal component. The chart below shows the results for the stationary test while the graphs show how the decomposition by addition eliminated the fixed daily fluctuation.
In my opinion, the seasonally adjusted series can be more suitable for the comparison analysis since neither trend nor seasons are present in the data. There still persists a cycle that may affect further analysis, however, for the purpose of a basic sample comparison the cycle may not constitute a challenge. Suggestions and comments are welcome in this regard.
Once data showed variation around zero, the next step was to confirm the normality of the data. A Jarque-Berra Normality test was run in all four variables showing no normality shape for the Probability Density Function. The table below shows the results.
At this point of the analysis a quick google search points towards a number of articles on temperatures. I must make the caveat that I am not an expert on clime nor in temperatures. However, some of the papers one could find on google suggest that temperatures follow a Fisher-Tippett (2) distribution which stems from the Extreme Value Theorem. Therefore, I run a Kolmogorov-Smirnov test assuming that the seasonally adjusted data follows a Fisher-Tippett (2) distribution. The results allow for the conclusion that the data follows such a distribution.
The Kolmogorov-Smirnov test allows for the conclusion that both seasonally adjusted data are Independent and Identically Distributed Fisher-Tippett (2) variables. That being said, the comparison of samples becomes a non-parametric location test for which the null hypothesis is that the difference of location between samples is equal to 0; while the alternative hypothesis is that the difference of location between samples is greater than 0.
From t-test to non-parametric Mann-Whitney / U test:
The test that I suggest for comparing the differences between samples is the Mann-Whitney / U test. Mann-Whitney non-parametric test allows to compare both population means that are drawn from the same population. Given that Kolmogorov-Smirnov test and a couple of climate studies support the claim that temperature data follow the Fisher-Tippett (2) distribution, running Mann-Whitney U Test seems the most reasonable alternative to the student test initially proposed by the Group members.
Categories: Statistics and Time Series.