We used data mining techniques and climate data to predict the presence of the native pecan, Carya illinoensis at locations in the United States and the world.
We obtained data on the geographic distribution of pecans from USGS.
We tried several different algorithms for predicting pecan distribution based on climate data. These different algorithms predicted the pecan distribution with varying accuracy. These results are summarized in the table below in order of increasing accuracy:
The measures of accuracy are:
|Model and Description
||Initial % agreement
||Cross-validated % agrmt
|ZeroR: This is the "null" model. It predicts presence or absence of pecans based on the simplest rule: since most places don't have pecans, assume all places don't have pecans. It provides a baseline for comparing the other models.
|OneR: This is the simplest real model. It predicts presence or absence of pecans based one climatic factor (aka an attribute) - the one climatic factor that best predicts all the instances. The resulting classification model can be interpreted in ecological terms.
|J48 with small tree: This is a simplified "decision tree" model. It constructs a decision tree based on the most useful values of the most useful climatic factors. This complex tree is then simplified to make it run faster. The resulting classification model can be interpreted in ecological terms.
|J48 with full tree: This is a full "decision tree" model. It constructs a decision tree based on the most useful values of the most useful climatic factors. The resulting classification model can be interpreted in ecological terms.
|JRip using only raw climate factors: This is a rule-based model. It constructs a list of rules for where pecans will/won't be found based on the most useful values of the most useful climatic factors. This set of rules was built using only the raw climate factors, not the interactions (all others were built using the full data set). The resulting classification model can be interpreted in ecological terms.
|LB1: This is a "lazy" classifier. When classifying a location as "having pecans" or "not having pecans, it looks in the 103-dimensional space defined by the known locations and finds the nearest neighbor to the new location. If that neighbor has pecans, then the model predicts that the site will also have pecans. The resulting classification model is not interpretable in ecological terms.
|JRip: This is a rule-based model. It constructs a list of rules for where pecans will/won't be found based on the most useful values of the most useful climatic factors. The resulting classification model can be interpreted in ecological terms.
|LB3: This is another "lazy" classifier. It is just like LB1 except that it makes predictions based on the three nearest neighbors to the input location. Like LB1, it is not possible to interpret the classifier model in ecological terms.
- Initial % agreement: This is the percent of the time that the prediction of the model agreed with the actual pecan distribution. You would expect this to be high, since the model was generated based on the test data.
- Initial Kappa: Kappa compares the actual % agreement with the "null hypothesis": the expected percent agreement if the classifier made random predictions with the same frequency of pecan/no-pecan as the real data. It ranges from 0 (no better than chance) to 1 (perfect agreement). As with the initial % agreement, you'd expect this to be high since the classifier is being evaluated on the data that trained it.
- Cross-validated % agreement: This is a more realistic estimate of the accuracy of the classifier (note that it is always lower than the initial values). Here, the software reserves 10% of the sample and trains on the remaining 90%. The model trained on the 90% is then used to classify the "unseen" 10%. This process is repeated 10 times and the results averaged. This value gives an estimate for how effective that algorithm will be at generating a reliable classification model.
- Cross-validated Kappa: This is the Kappa calculated using the 10%/90% method described above.
From this, we concluded:
Although JRip (the rule-based classifier) was not the most accurate (LB3 was the most accurate), it was the most accurate classifier that also gave results that we could interpret ecologically. We therefore used it in most of our analyses. Its rule set is shown below (the numbers in parentheses give the number of locations where this rule gives the correct/incorrect prediction):
- Using % agreement can be misleading. Even the worst classifier (ZeroR) was right 82% of the time. This is because most sites in the US don't have pecans, so predicting that "no sites will have pecans" works deceptively well. It is important also to look at the kappa statistic.
- Using actual data improved the accuracy of the classifiers. Going from ZeroR to OneR to the more complex models increased accuracy as we would expect.
- The more data you include, the better the accuracy. Large trees did better than small trees; 3-nearest-neighbors did better than 1-nearest-neighbor.
What does this mean?
Letís look at one of the 9 rules in detail:
- (MWM >= 26.5) and (BAR5 >= 14.6915) and (PTOAE >= 1.1925) and (ELEV <= 300) => Pecan=1 (82.0/4.0)
- (AE >= 652.4943) and (PTOAE >= 1.1295) and (WATDGRC <= 3) and (WRET >= 104.8334) and (ELEV <= 625) => Pecan=1 (72.0/4.0)
- (MWM >= 24.6) and (CVRAIN <= 44.3185) and (WSTORAGE >= 181.796) and (ELEV <= 1030) => Pecan=1 (165.0/50.0)
- (MWM >= 24.3) and (TRANGE >= 24.7) and (RLOW >= 25.91) and (PTOWATR >= 10.8738) and (Site <= 1517) => Pecan=1 (51.0/4.0)
- (AE >= 622.0895) and (COKLM >= 506.9) and (EXPREY <= 520.5728) and (PTOWATR >= 8.7045) => Pecan=1 (59.0/13.0)
- (MWM >= 24.8) and (TRANGE >= 24.7) and (RLOW >= 25.91) and (RLOW <= 46.74) => Pecan=1 (52.0/24.0)
- (MWM >= 27.22) and (RLOW >= 71.88) and (EXPREY <= 439.1472) and (WRET >= 102.7854) and (TEMP <= 56.0959) => Pecan=1 (15.0/1.0)
- (MWM >= 27.44) and (CVRAIN <= 34.6388) and (WSTORAGE <= 161.2) => Pecan=1 (77.0/37.0)
- otherwise: => Pecan=0 (4064.0/52.0)
(MWM >= 26.5) and (BAR5 >= 14.6915) and (PTOAE >= 1.1925) and (ELEV <= 300) => Pecan=1 (82.0/4.0)
So, this statement can be read as:
- MWM is the Mean Temperature in the Warmest Month (C)
- BAR5 is the Biomass Accumulation Ratio
- This is the amount of net above ground productivity added to standing biomass each year.
- Higher values indicate areas where we would find rapidly growing forests, low values could be slow growing forests or grasslands.
- PTOAE is the ratio of Potential Evapotranspiration to Actual Evapotranspiration
- higher values mark warmer/ drier settings where precipitation is not high enough to match PET
- ELEV is the elevation of the weather station in feet.
Where the Mean of the Warmest Month is greater than or equal to 26.5 deg C, and where Biomass Accumulation Ratio is greater than or equal to 14.69, and where the ratio of Potential to Actual Evapotranspiration is greater than or equal to 1.19, and where the Elevation is less than or equal to 300 feet, expect to find pecans.
In other words, pecans are found in warm locations where a moderate amount of the productivity accumulates as standing biomass (think tree trunks, branches, etc) in environments on the dry side and at low elevations.
The image below shows the training data:
- Tan area is where pecans are found.
- Green squares show locations of weather stations without pecans.
- Red squares show locations of weather stations with pecans.
The image below shows the result of applying the classifier generated by the training data to the world climate data (this is an un-trained extrpolation to new data):
- Red squares show locations of weather stations where the model predicts the absence of pecans.
- Yellow squares show locations of weather stations where the model predicts the presence of pecans.