top of page

DATA & METHODS

The methodology for this study is structured into four main parts:

​

1) Data Collection and Preprocessing

​

Within the geographical extent of British Columbia, point data for approximately 200 thousand fire events in British Columbia between 1950 and 2020 were obtained from the BC Fire Database. This data was cleaned to removed database errors including points incorrectly georeferenced, points placed in water, points beyond the provincial border, and duplicate records. From the remaining fire records, a subset of 10 thousand points was randomly selected using a custom ArcPy script in a Juypter Notebook within ArcGIS. These points were given a binary Is_Fire field assigned the value 1, indicating they were records of actual fire events. Then, an additional 10 thousand points were randomly spatially generated within the study area using the Create Random Points tool in ArcGIS. This second batch of points acted as simulated non-fire points and was assigned Is_Fire values of 0.  A total of 50 variables were established with temporal windows between 1961 and 2020. Most variables were sourced from existing government datasets; however several were custom-generated using simple geoprocessing tools (Table X). Variable values were then assigned to each of the 20 thousand points using the ArcGIS Extract Values to Points and Spatial Join tools. 

​

2) Variable Selection / Exploratory Regression

​

Selecting the variables that best tell the most about the observed pattern in wildfire occurrence is critical to enhancing the explanatory power of the final model. Therefore, prior to the OLS regression or GWR analysis, a series of exploratory regression trials were run to identify the optimal set of explanatory variables. This was completed using the ArcGIS Exploratory Regression tool using the binary Is_Fire field as the dependent variable and the aforementioned 50 independent variables (Table 1). After approximately ten rounds of exploratory analysis, the set of variables with the highest Adjusted R2 (Adj-R2) value and the lowest Akaike Information Criterion Corrected (AICc) value was MSP, MAT, SHM, Road_Dist, Road_Den, Elev, PB_Kill, Tree_Ht, Fire_Den, Coast_Dist, and ALR_Dist (Table 1). These 11 variables were also selected for model development based on the following three criteria: 1) low multicollinearity between predictor variable, 2) representation of long-term variance in wildfire frequency, and 3) broad representation of broad geographic factors including population statistics, land cover/use, ecosystem ecology, and regional climate.

​

3) OLS Regression 

​

The most basic type of regression analysis is Simple Linear Regression. This regression model involves one independent variable can be represented by the equation y = β0 + β1 xi + εi, with the slope β1, y-intercept β0, and residuals εi (Brundson et al., 1996; Charlton & Fotheringham, 2009; Fotheringham et al., 2003). The least-squares approach is the most common means of fitting a linear regression model. This Ordinary Least Squares (OSL) method involves minimizing the sum of the squared difference of the observed and predicted residuals of linear regression. It is assumed that residuals are independent and normally distributed around a mean of zero. In a global simple linear model, it is assumed that the values of β0 and β1 are constant throughout the entire study area, and therefore spatially homogenous (Brundson et al., 1996; Charlton & Fotheringham, 2009; Fotheringham et al., 2003).


Before running the OLS Regression I analyzed the multicollinearity between variables inherent in the point dataset. Summary statistics from the Exploratory Regression output show mean, maximum, and minimum values as well as standard deviations and VIF scores. All of the explanatory variables have VIF values less than 7.5 indicating low multicollinearity or redundancy.  Collinearity was also assessed using a scatterplot matrix to visualize the relationships between each of the model variables (Figure 1). Moderate correlations were observed for MAT & Elev (R2 = 0.48), MPS & SHM (R2 = 0.46), and SHM & Fire_Den (R2 = 0.4). Correlations between temperature and elevation as well as moisture and precipitation are to be expected and present no reason for concern. 


Finally, the OLS Regression was run using the ArcGIS Generalized Linear Regression tool using a Logistic (Binary) model. Statistically significant spatial autocorrelation of the regression residuals was also tested with the Spatial Autocorrelation tool. 

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

4) GWR Analysis

​

In the real world, both natural and human-driven processes and relationships are rarely independent of place. To overcome the constraints of linear regression in spatial modeling, GWR analysis was used to incorporate elements of location in the relationships of explanatory variables. Geographical Weighted Regression (GWR) is a method of exploring spatially heterogeneous relationships between dependent and independent variables. Spatial heterogeneity is observed when a phenomenon being measured changes throughout the study area (Brundson et al., 1996; Charlton & Fotheringham, 2009; Fotheringham et al., 2003). Rather than fitting a single global model to the entire study area, GWR applies a combination of many locally weighted regressions (Brundson et al., 1996). This involves the use of a moving window that models all the features in a dataset one at a time and compares them to global parameters (Charlton & Fotheringham, 2009). For each feature in a dataset, a subset of neighbor features is identified and fitted to a regression model. The spatial dimensions of each neighborhood varying depending on the bandwidth parameters used in the GWR model. Once the neighborhoods are created all their features are given a local weight based on a distance matrix of all features within the dataset (Charlton & Fotheringham, 2009). Features closest to the regression point are assigned the highest weight and features farther away that have less influence on the regression are subsequently given less weight. A variety of spatial kernels can be used to determine the weighting of features within the model (Charlton & Fotheringham, 2009). A gaussian weighting scheme involves a continuous or smooth model where the weight of features gradually decreases as the distance from the regression feature increases. Under a bisquare weighting scheme, features weights decrease with distance from the feature with a steeper gradient compared to the gaussian scheme. Additionally, all neighborhoods beyond a specified threshold all given weights of zero and therefore do not affect the local regression. 


GWR analysis for this study was initially attempted using the ArcGIS, however local multicollinearity errors inhibited running the regression. It is unclear whether this was due to problems with the input data or the limitations of the software, however similar issues with the ArcGIS GWR tool have been previously documented. As a result, the GWR analysis was instead completed using the MGWR 2.0 Software based on the pysal/mwgr Python library. A series of trial runs were completed and an optimal bandwidth of 1530 was determined using a Golden Ratio Search approach. The final Logistic (Binary) GWR analysis was then completed using the defined bandwidth value and an adaptive bisquare weighting scheme. MGWR is unique from most other GWR calculation tools as it has default variable standardization. This setting performs a z-transformation on the Y (dependent) and local (independent) variables so that each variable has a mean of 0 and standard deviation of 1. MGWR does not create explanatory variable coefficient rasters as is possible in ArcGIS, however it outputs a csv table containing the GWR residuals, and explanatory variable coefficient, standard error, and t-score values. These were values were imported into ArcGIS and converted into a point feature dataset using the XY Table to Point tool.

 

 

 

 

 

 

​

Explanatory_Variable_Correlation_Matrix.

Figure 1. Scatterplot matrix of explanatory variable relationships.

Table 1. Summary information for all independent variables tested for model selection during exploratory regression analysis.  

bottom of page