Next Article in Journal
Homogeneity in Patterns of Climate Extremes Between Two Cities—A Potential for Flood Planning in Relation to Climate Change
Next Article in Special Issue
Forecasting of Landslides Using Rainfall Severity and Soil Wetness: A Probabilistic Approach for Darjeeling Himalayas
Previous Article in Journal
A Cusp Catastrophe Model for Alluvial Channel Pattern and Stability
Previous Article in Special Issue
Geospatial Modelling of Watershed Peak Flood Discharge in Selangor, Malaysia
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Application of Principal Component Analysis and Cluster Analysis in Regional Flood Frequency Analysis: A Case Study in New South Wales, Australia

School of Engineering, Western Sydney University, Sydney 00917K, New South Wales, Australia
*
Author to whom correspondence should be addressed.
Water 2020, 12(3), 781; https://doi.org/10.3390/w12030781
Submission received: 6 February 2020 / Revised: 7 March 2020 / Accepted: 9 March 2020 / Published: 12 March 2020
(This article belongs to the Special Issue Flood Modelling: Regional Flood Estimation and GIS Based Techniques)

Abstract

:
This paper examines the applicability of principal component analysis (PCA) and cluster analysis in regional flood frequency analysis. A total of 88 sites in New South Wales, Australia are adopted. Quantile regression technique (QRT) is integrated with the PCA to estimate the flood quantiles. A total of eight catchment characteristics are selected as predictor variables. A leave-one-out validation is applied to determine the efficiency of the developed statistical models using an ensemble of evaluation diagnostics. It is found that the PCA with QRT model does not perform well, whereas cluster/group formed with smaller sized catchments performs better (with a median relative error values ranging from 22% to 37%) than other clusters/groups. No linkage is found between the degree of heterogeneity in the clusters/groups and precision of flood quantile prediction by the multiple linear regression technique.

1. Introduction

Design flood estimation in poorly gauged or ungauged catchments is often a common problem in flood frequency analysis (FFA). To mitigate this problem, design floods can be estimated using regional flood frequency analysis (RFFA) [1,2,3]. There are essentially three steps in the RFFA: (i) hydrologically homogeneous region identification; (ii) appropriate regional flood estimation model development after identification of homogeneous region(s); and (iii) developed RFFA model validation [4,5,6]. Although traditionally geographical proximity and political boundaries are often adopted to identify homogeneous regions [7,8,9,10,11], regions formed in such manner often lack hydrological similarity [12,13,14,15,16].
Some researchers have also adopted climatic and catchment characteristics to form homogeneous regions by using multivariate statistical techniques such as cluster analysis [8,17,18,19,20,21,22,23,24,25]. Ward’s method is the most common method for clustering as it can form regions of roughly equivalent size and this method is considered to be more suitable for regionalization of flood data [26]. Finally, many researchers adopted a region of influence (ROI) approach to circumvent complications related to fixed state boundaries [6,11,12,13,16,25,27,28,29,30,31,32,33]. The ROI is much more flexible than the fixed region approach and can be easily incorporated with a range of RFFA methods. ROI can also effectively reduce regional heterogeneity by sub-region formation within a large region. In the ROI approach, various ways can be adopted to form sub-regions such as by geographical distance [34,35] or distance in a multi-dimensional catchment attributes space [4,6,11,29].
Once a region is formed based on an appropriate condition [17,34,35,36,37,38], the design flood can be estimated either by the index flood method [15,39,40] or a regression-based approach [6,41]. The regression-based approach is used in numerous studies. Both the ordinary least square and the generalized least square regression methods are adopted to estimate the coefficients of the prediction equations in regression-based approaches [6,29,41,42,43,44,45,46]. For regression-based RFFA models, a linear relationship is generally assumed between the dependent variable (flood quantiles) and predictor variables (catchment and climatic characteristics). It is also assumed that the predictor variables are uncorrelated among each other; however, in many practical situations this assumption is not fully satisfied [47].
Principal component analysis (PCA) is a statistical technique capable of generating statistically uncorrelated principal components (PC) which are the linear amalgamation of the original variables (catchment and climatic characteristics). PCA has been used previously to delineate homogenous regions. For example, Burn [48], DeCoursey and Deal [49], Hawley and McCuen [50], Kar et al. [51], and Nathan and McMahon [22] adopted PCA to generate the PCs consisting of different physical, hydrological, and meteorological variables, and accordingly used them in principal component regression (PCR) to estimate flood quantiles. Choi et al. [52], Haque et al. [53], Haque et al. [54], and Koo et al. [55] used PCs in a PCR for water demand forecasting. Although PCA can produce uncorrelated PCs, one shortcoming of PCA is that due to the use of variance as an objective function, statistically independent structures are not always guaranteed in PCA [53].
There is a lack of research on delineation of homogeneous regions in RFFA. To fill this research gap, this study examines the formation of homogeneous regions using PCA and cluster analysis. This investigates the use of PCR method in RFFA with ROI and fixed region approaches. Multiple linear regression (MLR) models are developed for the regions generated by cluster analysis to estimate flood quantiles. A leave-one-out validation technique is adopted to assess the performance of the developed models.

2. Study Area

A very large catchment can have a significantly different flood frequency behavior compared to smaller sized catchments. Australian Rainfall and Runoff (ARR) [32,56] recommends an upper limit of 1000 km2 for small to medium sized catchments [6], which appears to be rational to select candidate catchments for this study. From New South Wales (NSW), Australia, a total of 88 catchments are selected to carry out this study. These are natural catchments and free from any major storage and land use change. These selected catchments have catchment areas varying from 8 to 1010 km2. The mean of catchment area is found to be 352 km2 and median is found to be 260 km2. It is recommended in Rahman et al. [57] to select catchments that have at least 20 years of flood data to develop the RFFA models in Australia. For this study, the selected catchments show a record length of annual maximum (AM) flow data in the range from 25 to 82 years (mean of 41.5 years and median 37 years). The catchments selected, vary from mountain to coastal region. The mean annual rainfall for the chosen catchments ranges from 625–1955 mm with a mean of 1000 mm and a median of 910 mm. Figure 1 shows the location of the selected 88 catchments.
A summary of descriptive statistics of selected catchment and climatic characteristics for the selected 88 sites is presented in Table 1. The design rainfall intensity of six-hour duration and two-year return period (I62) at each catchment centroid is obtained from Australian Bureau of Meteorology website. The shape factor (SF) is defined as the shortest distance between a catchment’s centroid and outlet divided by the square root of catchment area (A). Stream density (sden) is obtained as the sum of all the streamlines on a 1:100000 topographic map divided by catchment area (A). Mean annual rainfall (MAR) and mean annual evapotranspiration (MAE) data for each catchment are obtained from Australian Bureau of Meteorology website. The fraction forest (forest) is obtained as the total forested area shown on a 1:100000 topographic map divided by catchment area (A). The mainstream slope (S1085) is obtained as the difference in elevations at 10% and 85% of the mainstream length (measured from the catchment outlet) divided by 0.75 of mainstream length. The PCs are extracted using these characteristics to develop the PCR models using quantile regression technique (QRT) and to form regions using cluster analysis.

3. Methods

3.1. Principal Component Analysis

Multiple linear regression (MLR) models get unstable with an increasing number of predictor variables, in particular, if these are highly correlated. PCA is one of the multivariate statistical techniques that can be used to deal with highly correlated variables in regression [58]. In PCA, original dataset of n variables, which are correlated to various degrees are transformed to n numbers of uncorrelated PCs. The PCs are linear transformation of the original variables in such a way that the original and the new variables have equal sums of the variances. Although the number of PCs and original variables are equal, the first few PCs explain the majority of the variance in the data set, reducing the dimensionality of the original data set [59]. The PCs are sequenced from the highest to the lowest variance, i.e., the first PC describes the data’s highest variance proportion. The next highest variance is explained by the second PC and so on. The values of PCs can be obtained from Equations (1) and (2):
P C 1 = a 11 x 1 + a 12 x 2 + + a 1 n x n = j = 1 n a 1 j x j ,
P C 2 = a 21 x 1 + a 22 x 2 + + a 2 n x n = j = 1 n a 2 j x j ,
where x1, x2, … xn are the original variables and ajj are the eigenvectors. The eigenvalues are the variances of the PCs. The covariance or correlation matrix of the data set is used to derive the coefficients ajj, which are the eigenvectors. The eigenvalues of the data matrix can be calculated by Equation (3):
| C λ I | = 0 ,
where C is the correlation/covariance matrix, λ is the eigenvalue, and I is the identity matrix. The PC coefficients or the weights of the variables in the PCs are then calculated by Equation (4):
| C λ I | a j j = 0 ,
In the PCR analysis, PCs are used as predictor variable in MLR [60]. The general form of PCR model is as follows:
Y = α + β 1 P C 1 + β 2 P C 2 + β n P C n ,
where Y is the dependent variable (which is flood quantile here), α is the model intercept, β ’s are the regression coefficients.

3.2. Cluster Analysis

The statistical distance measurement representing the similarity (or dissimilarity) among the collections of attributes (similarity measurements) selected for each gauging site is used for grouping sites in cluster analysis. There are various clustering techniques available in the statistical literature [61] and are used to delineate hydrologically homogeneous regions [17,26,35,36,62,63,64,65]. Ward’s method is the most commonly used method since this can produce clusters of similar sizes [26]. Hence, Ward’s method is adopted in this study.
Ward’s approach is an agglomerative hierarchical algorithm that starts with each site being its own cluster (or region). The algorithm successively merges clusters using a variance approach analysis in which the similitude between members in a region is measured in terms of the square error sum (ESS). For region k containing Nk sites, the ESS is calculated as:
E S S k = j = 1 N ( x j x ¯ ) T ( x j x ¯ ) ,
where xj = [x1, x2, …, xp]T is a vector of p characteristics measured at site j, and where each element denotes the mean value of a characteristic across the Nk sites in the region. ESSk is calculated for the theoretical fusion of any two clusters at each step, and the actual fusions selected are those which minimize the increment in the total ESS across all regions.

3.3. Region of Influence Approach

Formation of regions without fixed boundaries was firstly carried out by Acreman [7]. Based on this, Burn 1990 a, b [12,13] introduced the ROI approach. In this approach, the individual site of interest (i.e., catchment where flood quantiles are to be estimated) forms its own region. Such identified regions can overlap and gauged sites for different sites of interest can be part of more than one ROI. The ROI may be formed for the site of interest using the group of sites in close proximity to the site of interest. More recently, the ROI approach has been adopted in ARR RFFA [32] and also in a study carried out by Rahman et al. [6]. A weighted Euclidean distance in an M-dimensional space may be used to measure the proximity. The distance metric can be defined by:
D i , j = [ m = 1 M W m ( X i m X j m ) 2 ] 1 / 2 ,
where Di,j is the weighted Euclidean distance between site i and j, M is the number of features incorporated in the distance measure, and the X terms represent standardized values for feature m at site i and site j, and Wm is a weight applied to attribute m, which reflects the relative significance of the feature. Standardization of attributes is performed to remove units and therefore bias due to scaling of the attributes can be avoided.

3.4. Homogeneity Assessment

Here, the Hosking and Wallis’ [15] criteria of heterogeneity is adopted, which is based on L moments. A group of catchments is considered to be heterogeneous if H is too high. When H is smaller than 1, the group is taken as ‘acceptably homogeneous’, H falls between +1 to +2, the group is taken as ‘possibly heterogeneous’, and H ≥ 2 it indicates a ‘definitely heterogeneous’ group. Furthermore, there are three different measures of H: H1 is based on L coefficient of variation, H2 is based on L coefficient of variation, and L coefficient of skewness and H3 is based on L coefficient of skewness and L coefficient of kurtosis [15].

3.5. Evaluation Statistics

A leave-one-out validation technique is adopted for assessing the performance of the developed RFFA models. Based on leave-one-out validation, during the construction of a model, a site is left out in each phase, i.e., this site is treated as an ungauged site. The following performance statistics for each of the models are computed using predicted flood quantile (Qpred) and observed flood quantile (Qobs): relative error (RE), median absolute relative error (med_REr), Qpred/Qobs ratio, median Qpred/Qobs ratio (med_Qpred/Qobs), mean square error (MSE), root mean square error (RMSE), bias (BIAS), relative bias (RBIAS), relative root mean square error (RRMSE), and root mean square normalized error (RMSNE):
R E = Q p r e d Q o b s Q o b s   ×   100 ,
R E r = m e d i a n [ a b s ( R E ) ] ,
M S E = m e a n [ ( Q p r e d Q o b s ) 2 ]   ,
R M S E = M S E ,
B i a s = m e a n ( Q p r e d Q o b s ) ,
R B i a s = [ m e a n ( Q p r e d Q o b s Q o b s ) ]   ×   100 ,
R R M S E = m e a n [ ( Q p r e d Q o b s ) 2 ] m e a n ( Q o b s ) ,
R M S N E = m e a n [ ( Q p r e d Q o b s Q o b s ) 2 ] ,
Qobs is the observed flood quantile at site i. Qobs is obtained by carrying out at-site FFA using LP3 distribution by FLIKE software [66]. In this study, six flood quantiles with annual exceedance probabilities (AEPs) of 50%, 20%, 10%, 5%, 2%, and 1% are considered as dependent variables.

4. Results and Discussion

4.1. Principal Component Analysis

Table 2 shows the magnitude and type of correlation between the original catchment and climatic characteristics (predictors). Correlation between the predictors, i.e., catchment area, rainfall intensity, shape factor, stream density, mean annual rainfall, mean annual evapo-transpiration, slope and fraction forest (‘A’, ‘I62’, ‘SF’, ‘sden’, ‘MAR’, ‘MAE’, ‘S1085’, and ‘forest’, respectively) are calculated using the method described in methods section. Looking into Figure 2 and Table 2, it can be seen that catchment area has a negative correlation (−0.208; p-value 0.052) with the rainfall intensity indicating if the catchment area is large, rainfall intensity decreases. Rainfall intensity has a positive correlation with the variables shape factor, stream density, fraction forest, mean annual rainfall and mean annual evapo-transpiration, where the maximum positive correlation is between mean annual rainfall and rainfall intensity being 0.83 (p-value ≈ 0) and mean annual evapo-transpiration and rainfall intensity being 0.67 (p-value ≈ 0). This indicates that, if the rainfall intensity increases, mean annual rainfall also increases. Although these values are not close to ± 1, they have a very small p values (<0.10) indicating that these correlations are significant. Slope has a positive correlation of 0.387 with fraction forest and a negative correlation of −0.286 with mean annual evapo-transpiration where both the coefficients have smaller p values (<0.10). The variable fraction forest also has a positive correlation with the variable mean annual rainfall (0.405; p-value ≈ 0) and mean annual evapo-transpiration has positive correlations with stream density and mean annual rainfall (0.392 and 0.533; p-values ≈ 0). All the other correlation coefficients range from −0.007 to 0.303, which are statistically insignificant.
From the above discussion it is possible to say that some of the predictor variables have a notable degree of correlation between them. Therefore, PCA is applied to the eight selected predictors, i.e., catchment area, rainfall intensity, shape factor, stream density, mean annual rainfall, mean annual evapo-transpiration, slope and fraction forest (‘A’, ‘I62’, ‘SF’, ‘sden’, ‘MAR’, ‘MAE’, ‘S1085’, and ‘forest’, respectively) to achieve the uncorrelated eight PCs. Figure 3 shows the transformed PCs without any correlation. The eigenvalues with their percentage of contribution represent the quantity of variability in the data and they are presented in Table 3. Table 3 confirms that the first two PCs explain the maximum degree of variability of the data set with the proportion of PC1 and PC2 being 35.3% and 20.5%, respectively. The proportions of other PCs (PC3, PC4, PC5, PC6, PC7, and PC8) range 1.8%–13.4%. Although, PC1 and PC2 have the highest percentages among all the PCs, however, the cumulative of these two PCs only accounts for 55.8% variance, meaning these two PCs can only explain half of the variability in the dataset. To explain at least 85% of the variability in the data the first five PCs are required though the individual percentage is quite low for some PCs.

4.2. Development and Testing of Regression Equation in Fixed Region and ROI Framework

To select the most impactful PCs, a significance level (p-value) for each of the PCs in the regression analysis is examined and the selection criterion is set to p ≤ 0.1. Based on this criterion, PC1, PC2, PC3, PC4, and PC5 are found to be the significant ones to be used in the development of regression equation to estimate the flood quantiles. However, the coefficient of determination (R2) and adjusted coefficient of determination (adj_R2) are found to be quite small for the regression equations, 0.46 and 0.43, respectively. The developed regression equation is then tested in both fixed region and ROI framework with leave-one-out validation. To apply fixed region (FR) approach, all the 88 sites are grouped together as ‘one region’ and all the flood quantiles for each site for the six AEPs (50%, 20%, 10%, 5%, 2%, and 1%) are estimated by leave-one-out validation.
In the ROI framework, at first, 10 sites are grouped together to form one region and the flood quantiles are estimated. Afterwards, at each of the iterations, five new sites are added to form a larger region until the site number reached 30. When the site number reached 30, ten sites are added at each of the iterations to form a larger region until the number of sites reached 80. Leave-one-out is applied at each of the iterations for validation.
Table 4 shows the statistical evaluation for 5% AEP flood. The first column shows the statistical evaluations that are calculated for both the fixed region and ROI cases, which are MSE, RMSE, BIAS, RBIAS, RRMSE, RMSNE, Rr, and med_Qpred/Qobs based on observed and predicted values for 5% AEP flood. Looking into Table 4, it is found that for 5% AEP flood, fixed region has the lowest MSE. The lowest Rr is found in case of KNN80 (46.59%) and med_Qpred/Qobs close to 1 is found in case of KNN15. From Table 4, it is clear that KNN10 performs the poorest with largest MSE and Rr, which means it is preferable to select more than ten sites to form a region to use PCR. Durocher et al. [67] carried out a study in Southern Quebec (Canada) and their results show a RMSE in the range of 38 m3/s and 45 m3/s in case of 10% and 1% AEP floods using spatial copula method. For the same dataset in Québec a number of studies [47,68,69] were carried out. The results show RMSE values being 41 m3/s to 51 m3/s for 10% AEP flood, and 49 m3/s to 70 m3/s for 1% AEP flood. These studies were carried out using ordinary kriging in PCA-space, generalized additive model and single artificial neural network. Studies carried out by Durocher et al. [67], Chokmani and Ouarda [68], Chebana et al. [20] and Shu and Ouarda [69] show RBIAS values ranging from −5% to −20% for 10% AEP flood and −7% to −27% for 1% AEP flood. A study carried out by Rahman et al. [6] found RBIAS values ranging from 22% to 69% for the six AEP floods.
For further clarification on the number of sites required to form regions in case of using this technique in FFA, boxplots are examined based on their RE and Qpred/Qobs ratio values for both the fixed region and ROI framework.
Figure 4 and Figure 5 show boxplots of the RE and Qpred/Qobs ratio values for both the fixed region and ROI framework for 5% AEP flood. Both Figure 4 and Figure 5 starts with fixed region approach and have all the ROI approaches presented one by one after the fixed region. Looking at Table 3, one can see that fixed region performs better than the rest of the ROI models. However, Figure 4 and Figure 5 show that, although the box size is smaller (i.e., a smaller error range), the median line is not close to the expected line (expected lines are set at zero and one for Figure 4 and Figure 5, respectively) for the fixed region. KNN10 shows a similar performance as presented in Table 4. KNN15 and KNN25 both show promising results in Figure 4 and Figure 5 with a smaller box size and median value being very close to the median line. However, it seems that KNN25 has smaller error bars than KNN15. There are number of outliers for all the models as shown by small circles, but they are not all visible in the figures as the range for the boxplots are set in the range of −300 to +300 for the RE and −2 to +3 for Qpred/Qobs ratio values to have a greater visibility. In Figure 5, it is seen that none of the top error bars are visible in the set range bringing in the question of how well they fit the regression analysis. The rest of the ROI models show that they represent a poorer fit with bigger box size and median RE being far away from the expected line.
Table 5 and Table 6 compare the RE and the Qpred/Qobs ratio values for all the AEPs for both the fixed region and ROI, respectively. Rows 2 to 18 of Table 5 and Table 6 show the mean, median and standard deviation (Std_Dev) of the selected AEPs for both the fixed region and ROI models, respectively. The last three rows show the overall mean, median, and Std_Dev for the AEPs. All the RE values are transformed to their absolute values by ignoring their sign. All the lowest values in case of both Qpred/Qobs ratio values and RE values are presented with blue color in both Table 5 and Table 6. Although KNN25 comes out as the best model out of all of them leaving KNN15 behind, however from Table 5 and Table 6, it seems that KNN15 outperforms KNN25 especially in the case of Qpred/Qobs ratio values. A fixed region approach or KNN80 does not show any better performance in this case. As seen earlier, KNN10 shows the worst results. The other models generate a mixture of results. In some cases, a very large RE (%) and Qpred/Qobs ratios are also found (i.e., for stations 206026, 210068, 210076, and 222016). As seen from the R2 and adj_R2 values, this regression model is found to be representing a poor fit. The analysis for both the fixed region and ROI framework also supports this finding.

4.3. Application of Cluster Analysis

4.3.1. Cluster Formation

Figure 6 shows the dendrogram by hierarchical cluster analysis (Ward’s method) using all the predictors (standardized to 0 mean and unit variance). A total of five clusters are identified in Figure 6; each of the clusters has more than seven stations (cluster 4 has 23 stations, whereas the other clusters have 15 to 18 stations). The details of the five clusters are provided in Table 7. The median of streamflow record lengths for all the clusters are in the range 36 to 40 years. Cluster 5 contains catchments that are relatively large (area ranging from 454–1010 km2 with a median value of 835.5 km2). The median values of area for the other clusters are in the range 156–365 km2. Looking into design rainfall intensity i.e., I62, the median values range 38 mm–59 mm for all the clusters. The highest rainfall intensity is found in case of cluster 3, in the range 76 mm–133 mm. The variable ‘SF’ is found to be similar for all the five clusters; whereas ‘sden’ is found to be higher for clusters 1 and 3 and minimum for cluster 4. The variables ‘MAR’ and ‘MAE’ are found to be higher for cluster 1, i.e., 1480.2 mm and 1382.7 mm (median), respectively. Cluster 2 shows a relatively higher slope. Finally, fraction forest area is found to be relatively higher for clusters 1 and 2.

4.3.2. Homogeneity Analysis of the Clusters

To investigate the degree of homogeneity of the five clusters, the heterogeneity measure proposed by Hosking and Wallis [15] is applied to each cluster individually. According to Hosking and Wallis [38], any station showing Di ≥ 3 is considered to be discordant. Based on this criterion, no discordant station is found for clusters 1, 2, 4, and 5, yet one discordant station (Di = 3.01) is found for cluster 3. The heterogeneity measure is applied to the five clusters to calculate H-statistics (H1, H2, and H3). For cluster 3, although the Di value is not very large, the heterogeneity measure is applied twice; firstly, with all the discordant station in the cluster and secondly, removing the discordant station.
Table 8 presents the heterogeneity measures for each cluster. It is visible from Table 8 that none of the clusters form homogeneous region. The lowest H-statistics is found for cluster 4; however, as the range is between 1 ≤ H ≤ 2, cluster 4 is ‘possibly heterogeneous’. Cluster 3 shows two H-statistics as one discordant station is found for cluster 3 (station 419029). Removal of the discordant station does not improve the result for cluster 3. Although the values of H2 and H3 for some clusters are smaller, H1 is mostly indicative of the heterogeneity in the group, which is much higher than 1.00. It is of interest to check how these heterogeneous clusters perform in regional flood estimation. Hence, QRT is applied to each cluster in the next section with leave-one-out validation to validate the QRT.

4.3.3. Development of Prediction Equation and Performance Testing

For the development of prediction equation, the dependent (flood quantiles) and predictor variables are natural-log transformed (i.e., a log-log modelling is adopted). A stepwise procedure based on their level of significance (p ≤ 0.1) is applied to select the best set of predictor variables. For different clusters, selection of the predictor variables generated different sets of equations because of their different levels of significance. The general regression equation for all clusters is given below for 5% AEP flood (Q20) and the regression coefficients for each variable for each cluster are given in Table 9.
The general form of the regression equation for the clusters:
ln Q 20 = β 0 + β 1 ( ln ( A ) ) + β 2 ( ln ( I 62 ) ) + β 3 ( ln ( S F ) ) + β 4 ( ln ( s d e n ) ) + β 5 ( ln ( M A R ) ) + β 6 ( ln ( M A E ) ) + β 7 ( ln ( f o r e s t ) ) + β 8 ( ln ( S 1085 ) ) ,
The R2 and adj_R2 for each model for all the clusters are found to be quite high except for cluster 5. In case of cluster 1, R2 and adj_R2 values are 0.93 and 0.89, respectively, for the selected model; for cluster 2 they are 0.98 and 0.96, respectively; for cluster 3 they are 0.82 and 0.77, respectively; for cluster 4 these are 0.68 and 0.63, respectively, and for cluster 5 these are 0.66 and 0.48, respectively. It is evident that except cluster 5 the other four clusters generate regression models with satisfactory goodness-of-fit.
Figure 9 and Figure 10 show the standardized residuals versus the fitted or predicted value plots and normality plots for 5% AEP flood for all the five clusters. It is necessary for the residuals of a linear regression model to satisfy homoscedastic pattern as heteroscedasticity in the residuals indicates that a non-linear model is more appropriate for the data. It is evident from the standardized residual versus the fitted value plots that the residuals lie between −2 and +2 and no specific pattern is visible in the plots. No pattern in the spread of the residuals indicates that the residuals are homoscedastic, which satisfies the linearity model assumption. The Q-Q plots do not completely follow the reference line except for cluster 2, yet there is no specific pattern, which indicates that the normality assumption is not grossly violated.
Figure 11 and Figure 12 show the boxplots of the selected quantiles for all the clusters in terms of RE and Qpred/Qobs ratio values. The expected line in Figure 10 is set at zero as it indicates an un-biased model. For Figure 12, the expected line is set at one as the ratio being one is indicative of an unbiased model. Figure 11 has a set boundary of −300 to +300, whereas Figure 12 has a set boundary of −3 to 3 to have better visibility.
It is seen in Figure 11 and Figure 12 that cluster 5 is the worst performing group. It is clear that most of the predictions are underestimations. The boxplots are not visible for all the quantiles as there are cases with a gross underestimation. The median of 50% AEP flood is far below the expected line in case of both Figure 11 and Figure 12 for cluster 5. Cluster 5 is made of stations having larger area than the other clusters. Poor performance of cluster 5 in QRT indicates that pooling of larger catchments into single group do not represent a viable choice in RFFA. Cluster 2 seems to be the best performing out of all the five clusters. Cluster 4 showing the best H1 value does not perform as good as cluster 2 in the leave-one-out validation.
Figure 13 shows plots of predicted vs observed flood quantiles for all the AEP floods for the five clusters. These plots generally present a good agreement between the predicted and observed flood quantiles. For cluster 1, there are a few cases of over-estimation when the observed flows are in between 100 m3/s to 200 m3/s and in case of larger observations ranging from 1200 m3/s to 1800 m3/s, there are some under-estimations by the regression model. For smaller discharges, cluster 2 seems to be performing well, although as the discharge gets larger the prediction by the model gets more erroneous, which is also visible from the boxplots. In case of clusters 3 and 4, the models perform well for smaller discharges; for the larger discharges, the models provide gross under-estimation. Cluster 5 is the worst performing group as seen from Figure 13. Cluster 2 performs the best in case of 5% AEP flood. As 5% AEP is the most frequently adopted flood quantile in design flood estimation, it can be said that regions formed based on small to medium sized area with a small range in other catchment characteristics will generate better prediction than other groups. Looking into the homogeneity analysis for all the five clusters, it can be concluded that homogeneity does not play a vital role in enhancing the prediction accuracy.
Table 9 and Table 10 show the comparison of mean, median and standard deviation (Std Dev) of absolute relative error (REabs) and absolute Qpred/Qobs ratio for all the clusters and the selected AEPs. Table 10 and Table 11 again prove the worst performance by cluster 5 with very large values for both REabs and Qpred/Qobs ratio values. Clusters 3 and 4 also show a mixture of under- and over-estimation. Clusters 1 and 2 seem to show promising results, although in the case of cluster 1, the mean REabs and mean Qpred/Qobs ratio values for 1% AEP flood are quite high. Cluster 2 shows the lowest values for both the overall REabs and Qpred/Qobs further confirming the better performance of cluster 2.
Table 12 summarizes the evaluation statistics from application of leave-one-out validation with respect to MSE, RMSE, BIAS, RBIAS, RRMSE, RMSNE, REr, and med_Qpred/Qobs based on observed and predicted flood values for all the five clusters in case of 5% AEP flood. A value close to zero is preferable for MSE as zero indicates no error in prediction. However, from Table 12 it is seen that all the MSE values for the five clusters are very large, in the range of 25,000 to 35,000,000. The smallest MSE is found in case of cluster 2 and the value is 25,309. The range of RMSE for all the clusters fall between 159 m3/s to 5800 m3/s. Cluster 2 shows the lowest RMSE with a value of 159 m3/s proving cluster 2 being the best performing group. Cluster 2 also shows the smallest values in case of RRMSE, RMSNE, BIAS, and RBIAS. Cluster 1 has large value for both BIAS and RBIAS (1480.61 and 388.4, respectively). Clusters 4 and 5 show a large negative BIAS and cluster 5 shows a very large negative RBIAS. The results here are notably higher than those reported in Durocher et al. [67], Chokmani and Ouarda [68], Chebana et al. [47], and Shu and Ouarda [69]. In Rahman et al. [6], independent component regression was adopted to develop flood prediction equations using the same data set as of this study, where error values are similar to this study. It should be noted that the values of MSE, RMSE, and BIAS depend on catchment size, a larger catchment generally has a larger discharge which is likely to result in higher MSE, RMSE, and BIAS.
The REr and med_Qpred/Qobs both show, cluster 2 has the smallest values. Cluster 5 is the worst performing group with a high REr and median_Qpred/Qobs (403.44 and −3.03, respectively). Hence, it can be said that group of stations having smaller catchment areas and lower range of other catchment characteristics such as cluster 2 is likely to generate more accurate flood prediction in QRT in the study region.

4.4. Comparison with ARR RFFA Model

An assessment of REr values between ARR RFFA model [32] and PCR KNN15, PCR KNN25, and QRT models for cluster 2 is presented in this section. ARR RFFA model is developed using a Bayesian generalized least square based parameter regression technique to estimate regional flood quantiles using Australian flood data [32]. The REr values for ARR RFFA model and PCR KNN15, PCR KNN25 and QRT models for cluster 2 are compared in Table 13. It is apparent from Table 13 that, the REr values for ARR RFFA model (ranging from 56% to 64%) is greater than PCR KNN15, PCR KNN25 and QRT models for cluster 2 (REr values ranging from 42% to 69%, 42% to 60% and 22% to 37%, respectively). ARR RFFA model is developed with data from 558 stations from NSW, Victoria and Queensland [58] and PCR KNN15, PCR KNN25 and QRT models for cluster 2 are developed for 15, 25 and 16 stations from NSW. This may be a possible reason for these differences in RE values. However, it is promising to see that the REr values from the PCR KNN15, PCR KNN25, and QRT models for cluster 2 are analogous to the REr values of ARR RFFA model. From this study, it may be argued that PCR may not be a good choice in RFFA in case of NSW. Moreover, a group of stations with smaller catchment areas such as cluster 2 may generate a better RFFA grouping. Further research with additional catchment characteristics data may enhance the reliability of PCR and cluster analysis based RFFA models in the study region.

5. Conclusions

A total of 88 stations form NSW, Australia and eight catchment characteristics variables are used in this study to compare regression-based RFFA models. Principal components are derived by applying the principal component analysis on the catchment characteristics data set and a multiple linear regression technique is applied to predict flood quantiles. The first five principal components are selected to be the predictor variables in the regression equations. According to the R2 and adj_R2 values of the developed regression equations, it is found that the principal component regression based RFFA models perform quite poorly.
The application of cluster analysis resulted into five clusters from the selected 88 stations. Cluster 1 has the smallest catchment areas and larger rainfall intensity, mean annual rainfall, mean annual evapo-transpiration, shape factor, forest, and stream density values. Stations in cluster 2 have smaller sized catchments along with moderate values for rainfall intensity, shape factor, stream density, and mean annual rainfall, although the mean annual evapo-transpiration is the highest in case of cluster 2. Cluster 3 seems to have all the catchment characteristics uniformly distributed. Cluster 4 is influenced by medium sized catchment area, smaller rainfall intensity, mean annual rainfall, mean annual evapo-transpiration, shape factor, and forest and stream density. The largest values for rainfall intensity, mean annual rainfall, mean annual evapo-transpiration, shape factor, and forest and stream density are found in case of cluster 5, although the other characteristics are quite small with large variance in forest cover. A quantile regression technique is applied to all five clusters with leave-one-out validation. Based on the findings of this study, it can be said that cluster 2 is the best performing group among the selected five clusters. It is also found that a relatively smaller catchment areas and small range of other catchment characteristics in a group is likely to result in more accurate RFFA models in the study region.
It is also found that cluster analysis does not generate any homogeneous groups of catchments for the selected dataset in NSW. Furthermore, the degree of heterogeneity does not have any link with RFFA model performance for the dataset used in this study. The best cluster-based quantile regression model in RFFA is found to be more accurate than the currently recommended ARR RFFA Model in Australia.

Author Contributions

Conceptualization, A.S.R. and A.R.; methodology, A.S.R.; software, A.S.R.; validation, A.S.R. and A.R.; formal analysis, A.S.R.; investigation, A.S.R.; data curation, A.S.R.; writing—original draft preparation, A.S.R.; writing—review and editing, A.R.; visualization, A.S.R. and A.R.; supervision, A.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no funding.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Blöschl, G.; Sivapalan, M.; Wagener, T.; Savenije, H.; Viglione, A. (Eds.) Runoff prediction in Ungauged Basins: Synthesis across Processes, Places and Scales; Cambridge University Press: Cambridge, UK, 2013. [Google Scholar]
  2. Ouarda, T.B.M.J.; St-Hilaire, A.; Bobée, B. A review of recent developments in regional frequency analysis of hydrological extremes. Revue des Sciences de l’eau 2008, 21, 219–232. [Google Scholar] [CrossRef] [Green Version]
  3. Ouarda, T.B.M.J.; Bâ, K.M.; Diaz-Delgado, C.; Carsteanu, A.; Chokmani, K.; Gingras, H.; Quentin, E.; Trujillo, E.; Bobée, A.B. Intercomparison of regional flood frequency estimation methods at ungauged sites for a Mexican case study. J. Hydrol. 2008, 348, 40–58. [Google Scholar] [CrossRef]
  4. Haddad, K.; Rahman, A.; Ling, F. Regional flood frequency analysis method for Tasmania, Australia: A case study on the comparison of fixed region and region-of-influence approaches. Hydrol. Sci. J. 2015, 60, 2086–2101. [Google Scholar] [CrossRef]
  5. Ouarda, T.B.M.J. Regional hydrological frequency analysis. In Encyclopedia of Environmetrics; El-Shaarawi, A.H., Piegorsch, W.W., Eds.; Wiley: New York, NY, USA, 2013. [Google Scholar]
  6. Rahman, A.S.; Khan, Z.; Rahman, A. Application of Independent Component Analysis in Regional Flood Frequency Analysis: Comparison between Quantile Regression and Parameter Regression Techniques. J. Hydrol. 2019, 581, 124372. [Google Scholar] [CrossRef]
  7. Acreman, M.C. Regional Flood Frequency Analysis in the UK: Recent Research-New Ideas; Institute of Hydrology: Wallingford, UK, 1987. [Google Scholar]
  8. Acreman, M.C.; Sinclair, C.D. Classification of drainage basins according to their physical characteristics; an application for flood frequency analysis in Scotland. J. Hydrol. 1986, 84, 365–380. [Google Scholar] [CrossRef]
  9. Eng, K.; Tasker, G.D.; Milly, P.C.D. An analysis of region-of-influence methods for flood regionalisation in the-Gulf-Atlantic rolling plains. J. Am. Water Resour. Assoc. 2005, 41, 135–143. [Google Scholar] [CrossRef]
  10. Pilgrim, D.H. Australian Rainfall and Runoff; Institution of Engineers: Barton, Australia, 1987. [Google Scholar]
  11. Tasker, G.D.; Hodge, S.A.; Barks, C.S. Region of influence regression for estimating the 50 year flood at ungauged sites. J. Am. Water Resour. Assoc. 1996, 32, 163–170. [Google Scholar] [CrossRef]
  12. Burn, D.H. An appraisal of the “region of influence” approach to flood frequency analysis. Hydrol. Sci. J. 1990, 35, 149–165. [Google Scholar] [CrossRef] [Green Version]
  13. Burn, D.H. Evaluation of regional flood frequency analysis with a region of influence approach. Water Resour. Res. 1990, 26, 2257–2265. [Google Scholar] [CrossRef]
  14. Chebana, F.; Ouarda, T.B.M.J. Depth and homogeneity in regional flood frequency analysis. Water Resour. Res. 2008, 44, W11422. [Google Scholar] [CrossRef] [Green Version]
  15. Hosking, J.R.M.; Wallis, J.R. Some statistics useful in regional frequency analysis. Water Resour. Res. 1993, 29, 271–281. [Google Scholar] [CrossRef]
  16. Merz, R.; Blöschl, G. Flood frequency regionalisation—Spatial proximity vs. catchment attributes. J. Hydrol. 2005, 302, 283–306. [Google Scholar] [CrossRef]
  17. Burn, D.H. Cluster analysis as applied to regional flood frequency. J. Water Res. Plan. Man. 1989, 115, 567–582. [Google Scholar] [CrossRef]
  18. Burn, D.H.; Boorman, D.B. Estimation of hydrological parameters at ungauged catchments. J. Hydrol. 1993, 143, 429–454. [Google Scholar] [CrossRef]
  19. Himeidan, Y.E.S.; Hamid, E.E.H. Rainfall variability in New Halfa agricultural scheme (Sudan). Univ. Khartoum J. Agric. Sci. 2019, 14, 383–391. [Google Scholar]
  20. Hughes, J.M.R.; James, B. A hydrological regionalization of streams in Victoria, Australia, with implications for stream ecology. Mar. Freshw. Res. 1989, 40, 303–326. [Google Scholar] [CrossRef]
  21. Mosley, M.P. Delimitation of New Zealand hydrologic regions. J. Hydrol. 1981, 49, 173–192. [Google Scholar] [CrossRef]
  22. Nathan, R.J.; McMahon, T.A. Identification of homogeneous regions for the purposes of regionalisation. J. Hydrol. 1990, 121, 217–238. [Google Scholar] [CrossRef]
  23. Rasheed, A.; Egodawatta, P.; Goonetilleke, A.; McGree, J. A Novel Approach for Delineation of Homogeneous Rainfall Regions for Water Sensitive Urban Design—A Case Study in Southeast Queensland. Water 2019, 11, 570. [Google Scholar] [CrossRef] [Green Version]
  24. Santos, C.A.G.; Moura, R.; da Silva, R.M.; Costa, S.G.F. Cluster Analysis Applied to Spatiotemporal Variability of Monthly Precipitation over Paraíba State Using Tropical Rainfall Measuring Mission (TRMM) Data. Remote Sens. 2019, 11, 637. [Google Scholar] [CrossRef] [Green Version]
  25. Tasker, G.D. Comparing methods of hydrologic regionalisation. J. Am. Water Resour. Assoc. 1982, 18, 965–970. [Google Scholar] [CrossRef]
  26. Hosking, J.R.M.; Wallis, J.R. Regional Frequency Analysis: An Approach based on L-moments; Cambridge University Press: New York, NY, USA, 1997. [Google Scholar]
  27. Eng, K.; Milly, P.C.; Tasker, G.D. Flood regionalisation: A hybrid geographic and predictor-variable region-of-influence regression method. J. Hydrol. Eng. 2007, 12, 585–591. [Google Scholar] [CrossRef]
  28. Eng, K.; Stedinger, J.R.; Gruber, A.M. Regionalisation of streamflow characteristics for the Gulf-Atlantic rolling plains using leverage-guided region-of-influence regression. In Proceedings of the World Environmental and Water Resources Congress 2007: Restoring Our Natural Habitat, Tampa, Florida, 15–19 May 2007; pp. 1–11. [Google Scholar]
  29. Gaál, L.; Kyselý, J.; Szolgay, J. Region-of-influence approach to a frequency analysis of heavy precipitation in Slovakia. Hydrol. Earth Sys. Sci. Discuss. 2008, 12, 825–839. [Google Scholar] [CrossRef] [Green Version]
  30. Haddad, K.; Rahman, A. Regional flood frequency analysis in eastern Australia: Bayesian GLS regression-based methods within fixed region and ROI framework: Quantile regression vs. parameter regression technique. J. Hydrol. 2012, 430–431, 142–161. [Google Scholar] [CrossRef]
  31. Micevski, T.; Hackelbusch, A.; Haddad, K.; Kuczera, G.; Rahman, A. Regionalisation of the parameters of the log-Pearson 3 distribution: A case study for New South Wales, Australia. Hydrol. Process. 2015, 29, 250–260. [Google Scholar] [CrossRef]
  32. Rahman, A.; Haddad, K.; Kuczera, G.; Weinmann, P.E. Regional flood methods. Aust. Rainfall Runoff. 2019, 3, 105–146. [Google Scholar]
  33. Zrinji, Z.; Burn, D.H. Regional flood frequency with hierarchical region of influence. J. Water Res. Plan. Man. 1996, 122, 245–252. [Google Scholar] [CrossRef]
  34. Burn, D.H.; Goel, N.K. The formation of groups for regional flood frequency analysis. Hydrol. Sci. J. 2000, 45, 97–112. [Google Scholar] [CrossRef]
  35. Castellarin, A.; Burn, D.H.; Brath, A. Assessing the effectiveness of hydrological similarity measures for regional flood frequency analysis. J. Hydrol. 2001, 241, 270–285. [Google Scholar] [CrossRef]
  36. Burn, D.H. Catchment similarity for regional flood frequency analysis using seasonality measures. J. Hydrol. 1997, 202, 212–230. [Google Scholar] [CrossRef]
  37. Lim, Y.H.; Lye, L.M. Regional flood estimation for ungauged basins in Sarawak, Malaysia. Hydrol. Sci. J. 2003, 48, 79–94. [Google Scholar] [CrossRef] [Green Version]
  38. Zrinji, Z.; Burn, D.H. Flood frequency analysis for ungauged sites using a region of influence approach. J. Hydrol. 1994, 153, 1–21. [Google Scholar] [CrossRef]
  39. Bates, B.C.; Rahman, A.; Mein, R.G.; Weinmann, P.E. Climatic and physical factors that influence the homogeneity of regional floods in south-eastern Australia. Water Resour. Res. 1998, 34, 3369–3382. [Google Scholar] [CrossRef]
  40. Fill, H.D.; Stedinger, J.R. Using regional regression within IF procedures and an empirical Bayesian estimator. J. Hydrol. 1998, 210, 128–145. [Google Scholar] [CrossRef]
  41. Haddad, K.; Rahman, A.; Stedinger, J.R. Regional flood frequency analysis using Bayesian generalized least squares: A comparison between quantile and parameter regression techniques. Hydrol. Process. 2012, 26, 1008–1021. [Google Scholar] [CrossRef]
  42. Griffis, V.W.; Stedinger, J.R. The use of GLS regression in regional hydrologic analyses. J. Hydrol. 2007, 344, 82–95. [Google Scholar] [CrossRef]
  43. Micevski, T.; Kuczera, G. Combining site and regional flood information using a Bayesian Monte Carlo approach. Water Resour. Res. 2009, 45. [Google Scholar] [CrossRef] [Green Version]
  44. Ouali, D.; Chebana, F.; Ouarda, T.B.M.J. Quantile regression in regional frequency analysis: A better exploitation of the available information. J. Hydrometeorol. 2016, 17, 1869–1883. [Google Scholar] [CrossRef]
  45. Rahman, A.; Charron, C.; Ouarda, T.B.M.J.; Chebana, F. Development of regional flood frequency analysis techniques using generalized additive models for Australia. Stoch. Environ. Res. Risk A 2018, 32, 123–139. [Google Scholar] [CrossRef]
  46. Rahman, A. A quantile regression technique to estimate design floods for ungauged catchments in south-east Australia. Australas. J. Water Resour. 2005, 9, 81–89. [Google Scholar] [CrossRef]
  47. Chebana, F.; Charron, C.; Ouarda, T.B.M.J.; Martel, B. Regional frequency analysis at ungauged sites with the generalized additive model. J. Hydrometeorol. 2014, 15, 2418–2428. [Google Scholar] [CrossRef] [Green Version]
  48. Burn, D.H. Delineation of groups for regional flood frequency analysis. J. Hydrol. 1988, 104, 345–361. [Google Scholar] [CrossRef]
  49. DeCoursey, D.G.; Deal, R.B. General Aspects of Multivariate Analysis with Applications. Misc. Publ. 1974, 1275, 47. [Google Scholar]
  50. Hawley, M.E.; McCuen, R.H. Water yield estimation in western United States. J. Irrig. Drain. Div. 1982, 108, 25–34. [Google Scholar]
  51. Kar, A.K.; Goel, N.K.; Lohani, A.K.; Roy, G.P. Application of clustering techniques using prioritized variables in regional flood frequency analysis—Case study of Mahanadi Basin. J. Hydrol. Eng. 2011, 17, 213–223. [Google Scholar] [CrossRef]
  52. Choi, T.H.; Kwon, O.E.; Koo, J.Y. Water demand forecasting by characteristics of city using principal component and cluster analyses. Environ. Eng. Res. 2010, 15, 135–140. [Google Scholar] [CrossRef] [Green Version]
  53. Haque, M.M.; de Souza, A.; Rahman, A. Water demand modelling using independent component regression technique. Water Resour. Res. 2017, 31, 299–312. [Google Scholar] [CrossRef]
  54. Haque, M.M.; Rahman, A.; Hagare, D.; Kibria, G. Principal component regression analysis in water demand forecasting: An application to the Blue Mountains, NSW, Australia. J. Hydrol. Environ. Res. 2013, 1, 49–59. [Google Scholar]
  55. Koo, J.Y.; Yu, M.J.; Kim, S.G.; Shim, M.H.; Koizumi, A. Estimating regional water demand in Seoul, South Korea, using principal component and cluster analysis. Water Sci. Tech. Water Supply 2005, 5, 1–7. [Google Scholar] [CrossRef]
  56. Ball, J.; Babister, M.; Nathan, R.; Weeks, W.; Weinmann, P.E.; Retallick, M.; Testoni, I. Australian Rainfall and Runoff-A Guide to Flood Estimation; Engineers Australia: Canberra, Australia, 2019. [Google Scholar]
  57. Rahman, A.; Haddad, K.; Haque, M.; Kuczera, G.; Weinmann, P.E. Australian Rainfall and Runoff Project 5: Regional Flood Methods: Stage 3 Report; (No. P5/S3, p. 025). technical report; Engineers Australia: Canberra, Australia, 2015. [Google Scholar]
  58. Çamdevýren, H.; Demýr, N.; Kanik, A.; Keskýn, S. Use of principal component scores in multiple linear regression models for prediction of Chlorophyll-a in reservoirs. Ecol. Model. 2005, 181, 581–589. [Google Scholar] [CrossRef]
  59. Olsen, R.L.; Chappell, R.W.; Loftis, J.C. Water quality sample collection, data treatment and results presentation for principal components analysis–literature review and Illinois River watershed case study. Water Res. 2012, 46, 3110–3122. [Google Scholar] [CrossRef] [PubMed]
  60. Pires, J.C.M.; Martins, F.G.; Sousa, S.I.V.; Alvim-Ferraz, M.C.M.; Pereira, M.C. Selection and validation of parameters in multiple linear and principal component regressions. Environ. Modell. Softw. 2008, 23, 50–55. [Google Scholar] [CrossRef]
  61. Johnson, R.A.; Wichern, D.W. Applied Multivariate Statistical Analysis; PrenticeHall International. Inc.: New Jersey, NJ, USA, 2007. [Google Scholar]
  62. Baeriswyl, P.A.; Rebetez, M. Regionalization of precipitation in Switzerland by means of principal component analysis. Theor. Appl. Climatol. 1997, 58, 31–41. [Google Scholar] [CrossRef] [Green Version]
  63. Bhaskar, N.R.; O’Connor, C.A. Comparison of method of residuals and cluster analysis for flood regionalization. J. Water Resour. Plan. Manag. 1989, 115, 793–808. [Google Scholar] [CrossRef]
  64. Dinpashoh, Y.; Fakheri-Fard, A.; Moghaddam, M.; Jahanbakhsh, S.; Mirnia, M. Selection of variables for the purpose of regionalization of Iran’s precipitation climate using multivariate methods. J. Hydrol. 2004, 297, 109–123. [Google Scholar] [CrossRef]
  65. Rao, A.R.; Srinivas, V.V. Regionalization of watersheds by hybrid-cluster analysis. J. Hydrol. 2006, 318, 37–56. [Google Scholar] [CrossRef]
  66. Kuczera, G. FLIKE HELP; Chapter 2 FLIKE Notes; University of Newcastle: Callaghan, Australia, 1999. [Google Scholar]
  67. Durocher, M.; Burn, D.H.; Zadeh, S.M. A nationwide regional flood frequency analysis at ungauged sites using ROI/GLS with copulas and super regions. J. Hydrol. 2018, 567, 191–202. [Google Scholar] [CrossRef] [Green Version]
  68. Chokmani, K.; Ouarda, T.B.M.J. Physiographical space-based kriging for regional flood frequency estimation at ungauged sites. Water Resour. Res. 2004, 40. [Google Scholar] [CrossRef]
  69. Shu, C.; Ouarda, T.B.M.J. Flood frequency analysis at ungauged sites using artificial neural networks in canonical correlation analysis physiographic space. Water Resour. Res. 2007, 43. [Google Scholar] [CrossRef] [Green Version]
Figure 1. Spatial location of the selected 88 catchments in New South Wales (NSW), Australia.
Figure 1. Spatial location of the selected 88 catchments in New South Wales (NSW), Australia.
Water 12 00781 g001
Figure 2. Correlation between the predictor variables before applying principal component analysis (PCA).
Figure 2. Correlation between the predictor variables before applying principal component analysis (PCA).
Water 12 00781 g002
Figure 3. Uncorrelated principal components (PCs) after application of PCA.
Figure 3. Uncorrelated principal components (PCs) after application of PCA.
Water 12 00781 g003
Figure 4. Boxplots for RE (%) of fixed region (FR) and ROI (in case of 5% AEP flood).
Figure 4. Boxplots for RE (%) of fixed region (FR) and ROI (in case of 5% AEP flood).
Water 12 00781 g004
Figure 5. Boxplots for Qpred/Qobs of fixed region (FR) and ROI (in case of 5% AEP flood).
Figure 5. Boxplots for Qpred/Qobs of fixed region (FR) and ROI (in case of 5% AEP flood).
Water 12 00781 g005
Figure 6. Result of Ward’s hierarchical clustering method by using eight selected catchment characteristics (dendrogram). Figure 7 and Figure 8 show the boxplots representing the distribution of the eight predictors for the clusters. Figure 7 shows the boxplots for ‘A’, ‘I62’, ‘SF’, and ‘sden’ and Figure 8 shows the boxplots for ‘MAR’, ‘MAE’, ‘S1085’, and ‘forest’. From Figure 7 and Figure 8, it can be said that the stations in cluster 1 have the smallest catchment area range with the largest design rainfall intensity, as well as larger ‘MAR’ and ‘MAE’ values. These stations seem to have higher percentages of forest areas as well as along with higher stream density. Stations in cluster 2 have moderate catchment areas, along with ‘I62’, ‘SF’, ‘sden’, and ‘MAR’. However, the ‘MAE’ is the highest in case of cluster 2. Cluster 3 seems to have all the predictors relatively uniformly distributed. Cluster 4 is characterized by medium catchment area, smaller ‘I62’, ‘SF’, ‘sden’, ‘MAR’, ‘MAE’, ‘S1085’, and ‘forest’. The largest area is found for cluster 5, although the other predictors are quite small with large variance in fraction forest area.
Figure 6. Result of Ward’s hierarchical clustering method by using eight selected catchment characteristics (dendrogram). Figure 7 and Figure 8 show the boxplots representing the distribution of the eight predictors for the clusters. Figure 7 shows the boxplots for ‘A’, ‘I62’, ‘SF’, and ‘sden’ and Figure 8 shows the boxplots for ‘MAR’, ‘MAE’, ‘S1085’, and ‘forest’. From Figure 7 and Figure 8, it can be said that the stations in cluster 1 have the smallest catchment area range with the largest design rainfall intensity, as well as larger ‘MAR’ and ‘MAE’ values. These stations seem to have higher percentages of forest areas as well as along with higher stream density. Stations in cluster 2 have moderate catchment areas, along with ‘I62’, ‘SF’, ‘sden’, and ‘MAR’. However, the ‘MAE’ is the highest in case of cluster 2. Cluster 3 seems to have all the predictors relatively uniformly distributed. Cluster 4 is characterized by medium catchment area, smaller ‘I62’, ‘SF’, ‘sden’, ‘MAR’, ‘MAE’, ‘S1085’, and ‘forest’. The largest area is found for cluster 5, although the other predictors are quite small with large variance in fraction forest area.
Water 12 00781 g006
Figure 7. Boxplots of area, rainfall intensity, shape factor and stream density for all clusters: (a) boxplot of area; (b) boxplot of rainfall intensity; (c) boxplot of shape factor; and (d) boxplot of stream density.
Figure 7. Boxplots of area, rainfall intensity, shape factor and stream density for all clusters: (a) boxplot of area; (b) boxplot of rainfall intensity; (c) boxplot of shape factor; and (d) boxplot of stream density.
Water 12 00781 g007
Figure 8. Boxplots of mean annual rainfall, mean annual evapo-transpiration, slope and fraction forest for all the clusters: (a) boxplot of mean annual rainfall; (b) boxplot of mean annual evapo-transpiration; (c) boxplot of slope; and (d) boxplot of fraction forest.
Figure 8. Boxplots of mean annual rainfall, mean annual evapo-transpiration, slope and fraction forest for all the clusters: (a) boxplot of mean annual rainfall; (b) boxplot of mean annual evapo-transpiration; (c) boxplot of slope; and (d) boxplot of fraction forest.
Water 12 00781 g008
Figure 9. Standardized residuals vs fitted value plots for all clusters in case of 5% AEP flood.
Figure 9. Standardized residuals vs fitted value plots for all clusters in case of 5% AEP flood.
Water 12 00781 g009
Figure 10. Normality plots for all clusters in case of 5% AEP flood.
Figure 10. Normality plots for all clusters in case of 5% AEP flood.
Water 12 00781 g010
Figure 11. Boxplots of RE for all clusters and AEPs: (a) cluster 1 (C1); (b) cluster 2 (C2); (c) cluster 3 (C3); (d) cluster 4 (C4); and (e) cluster 5 (C5).
Figure 11. Boxplots of RE for all clusters and AEPs: (a) cluster 1 (C1); (b) cluster 2 (C2); (c) cluster 3 (C3); (d) cluster 4 (C4); and (e) cluster 5 (C5).
Water 12 00781 g011
Figure 12. Boxplots of Qpred/Qobs ratios for all clusters and AEPs: (a) cluster 1 (C1); (b) cluster 2 (C2); (c) cluster 3 (C3); (d) cluster 4 (C4); and (e) cluster 5 (C5).
Figure 12. Boxplots of Qpred/Qobs ratios for all clusters and AEPs: (a) cluster 1 (C1); (b) cluster 2 (C2); (c) cluster 3 (C3); (d) cluster 4 (C4); and (e) cluster 5 (C5).
Water 12 00781 g012
Figure 13. Comparison of observed and predicted flood quantiles for all the quantiles and clusters.
Figure 13. Comparison of observed and predicted flood quantiles for all the quantiles and clusters.
Water 12 00781 g013
Table 1. Summary of descriptive statistics of selected catchment and climatic characteristics.
Table 1. Summary of descriptive statistics of selected catchment and climatic characteristics.
VariablesRangeMedianMeanStandard Deviation
Catchment area (A) in km28–1010260351.9281.4
Rainfall intensity (I62) in mm/h31.3–87.343.145.411.3
Shape factor (SF)0.3–1.60.80.80.2
Stream density (sden) in /km0.5–5.52.82.71.1
Mean annual rainfall (MAR) in mm626.2–1953.21000.3909.9304.5
Mean annual evapo-transpiration (MAE) in mm/y980.4–1543.31223.71185.6126.3
Fraction forest (forest)0–10.50.50.3
Mainstream slope (S1085) in m/km1.5–49.812.99.110.8
Table 2. Correlation coefficients with their corresponding p-values between the independent variables.
Table 2. Correlation coefficients with their corresponding p-values between the independent variables.
AI62SFsdenforestMARMAE
I62−0.208
0.052
SF−0.0540.035
0.6190.746
sden−0.1750.3670.037
0.10200.733
forest−0.1160.33−0.0070.046
0.2830.0020.9510.667
MAR−0.3140.83−0.0580.3610.405
0.00300.5920.0010
MAE−0.0940.6710.1360.392−0.0310.533
0.38100.20600.7710
S1085−0.331−0.1210.051−0.0810.387−0.021−0.286
0.0020.2620.6370.45100.8440.007
Table 3. Eigenvalue of different components and their significance level.
Table 3. Eigenvalue of different components and their significance level.
NamePC1PC2PC3PC4PC5PC6PC7PC8
Eigenvalue2.8221.6411.0700.9150.7020.4320.2780.141
Proportion0.3530.2050.1340.1140.0880.0540.0350.018
Cumulative0.3530.5580.6920.8060.8940.9480.9821
Table 4. Statistical evaluation for fixed region (FR) and region of influence (ROI) for 5% annual exceedance probability (AEP) flood.
Table 4. Statistical evaluation for fixed region (FR) and region of influence (ROI) for 5% annual exceedance probability (AEP) flood.
5% AEPKNN10KNN15KNN20KNN25KNN30KNN40KNN50KNN60KNN70KNN80FR
MSE443296.69228884.66183799.06187180.23179135.55181519.00170279.10166945.45163644.38163850.83163447.17
RMSE665.81478.42428.72432.64423.24426.05412.65408.59404.53404.78404.29
BIAS−65.511.20−11.94−13.12−0.82−20.07−14.73−21.27−18.68−6.20−0.29
RBIAS22.24−0.4455.3163.9454.4061.6965.5456.9854.3469.9065.48
RRMSE0.110.000.020.020.000.030.030.040.030.010.00
RMSNE5.403.153.283.032.382.772.282.052.052.692.43
med_Rr59.0353.4154.4652.6155.1251.4849.0150.7447.1546.5948.08
med_Qpred/Qobs1.031.011.161.071.141.191.201.161.161.181.17
Table 5. Mean, median and standard deviation of relative error for fixed region (FR) and ROI.
Table 5. Mean, median and standard deviation of relative error for fixed region (FR) and ROI.
RE FRKNN10KNN15KNN20KNN25KNN30KNN40KNN50KNN60KNN70KNN80
Mean_abs139.75229.93114.81152.82123.56133.99122.83122.97127.89131.03133.88
50%Median_abs51.7663.8442.7045.7642.7048.9142.1944.8446.7049.1050.12
Std Dev_abs356.39623.98156.38310.14204.57238.60218.48247.15227.90266.56323.67
Mean_abs122.18206.85117.88134.93132.02126.40117.35112.29111.24113.35125.28
20%Median_abs48.3251.2251.2554.0243.6450.4446.0350.8850.0946.2745.36
Std Dev_abs276.38530.31164.84242.12211.61243.99264.68230.54244.19246.78286.58
Mean_abs118.13208.47130.45140.34135.02127.57125.56115.12105.84107.25122.45
10%Median_abs48.6255.1955.3954.2348.7750.3249.5751.9253.2447.2141.70
Std Dev_abs223.95504.74182.24260.47236.11225.52247.62200.09197.48200.13247.30
Mean_abs118.95213.73153.93143.57135.83127.49132.75121.48111.44108.40124.36
5%Median_abs48.0959.0353.4154.4652.6255.1351.4849.0250.7447.1546.59
Std Dev_abs213.32498.85276.47297.00272.61201.63244.73193.92172.80174.95239.47
Mean_abs130.02225.32197.48154.64144.06140.61150.67137.73126.73118.43134.54
2%Median_abs52.1666.9558.8756.1458.2560.5254.3552.1450.7653.1053.28
Std Dev_abs258.97516.14542.05374.04359.08251.26308.61256.72218.79194.39279.09
Mean_abs143.45239.70248.66174.50164.45163.20170.96156.71144.20130.90147.16
1%Median_abs53.5671.9568.6761.2159.6659.8052.9750.9949.7547.4651.92
Std Dev_abs322.07548.06848.84458.31459.05351.10400.20341.21294.90239.05335.53
Overall mean128.75220.67160.53150.13139.16136.54136.69149.15121.22131.30133.48
Overall median49.7761.3954.9054.4251.3155.0650.6347.4250.1745.1545.95
Overall Std Dev278.68536.23443.19330.54303.33255.48286.29350.60228.51255.40264.04
Table 6. Mean, median and standard deviation of Qpred/Qobs ratio for fixed region (FR) and ROI.
Table 6. Mean, median and standard deviation of Qpred/Qobs ratio for fixed region (FR) and ROI.
Ratio FRKNN10KNN15KNN20KNN25KNN30KNN40KNN50KNN60KNN70KNN80
Mean_abs2.002.721.672.091.841.961.811.841.861.871.94
50%Median_abs1.161.091.101.121.151.121.101.121.171.161.14
Std Dev_abs3.446.201.602.952.022.392.102.332.122.563.12
Mean_abs1.852.531.691.901.901.851.791.771.711.731.89
20%Median_abs1.141.141.091.191.201.231.091.181.091.111.16
Std Dev_abs2.685.231.602.382.082.362.572.202.362.402.80
Mean_abs1.832.531.811.931.921.871.881.811.721.731.89
10%Median_abs1.151.171.241.191.131.331.121.221.171.141.13
Std Dev_abs2.234.981.702.622.382.222.441.971.941.982.46
Mean_abs1.882.582.041.971.941.871.981.901.801.761.93
5%Median_abs1.191.261.191.241.121.211.221.221.211.181.20
Std Dev_abs2.184.932.613.022.792.072.471.991.771.792.44
Mean_abs2.002.672.452.082.072.042.162.071.961.862.04
2%Median_abs1.171.281.201.131.211.231.221.211.261.211.21
Std Dev_abs2.705.125.293.833.682.613.162.682.302.072.89
Mean_abs2.152.772.952.292.272.272.372.252.141.992.17
1%Median_abs1.271.221.291.221.331.331.171.201.261.231.26
Std Dev_abs3.335.478.364.684.693.624.093.543.082.543.47
Overall mean1.952.632.102.041.991.982.001.941.871.821.98
Overall median1.171.181.141.171.191.191.161.181.181.171.20
Overall Std Dev2.795.314.343.323.082.592.872.502.292.232.87
Table 7. Detail of the clusters formed by Ward’s hierarchical clustering method.
Table 7. Detail of the clusters formed by Ward’s hierarchical clustering method.
Cluster 1Cluster 2Cluster 3Cluster 4Cluster 5
No. of stations1516162318
Period of records
(median)
29–80
(36)
26–71
(40)
30–82
(36)
25–70
(37)
32–56
(37.5)
Area (km2)
(median)
20–363
(156)
103–673
(194.5)
8–391
(161)
14–740
(365)
454–1010
(835.5)
I62 (mm)
(median)
50–88
(58.4)
31–54
(43.1)
76–133
(41.3)
31–48
(38.4)
34–54
(44.9)
SF
(median)
0.4–1.02
(0.8)
0.4–1.02
(0.8)
0.4–1.7
(0.8)
0.2–1.2
(0.7)
0.4–1
(0.8)
sden
(median)
2.2–5.2
(3.9)
1–3
(1.9)
3.2–5.5
(3.9)
0.5–3.1
(1.7)
2–5
(3.1)
MAR (mm)
(median)
1128–1954
(1480.2)
672–1310
(937.5)
744–1289
(815.2)
656–1204
(851.2)
626–1265
(791.9)
MAE (mm)
(median)
1280–1544
(1382.7)
980–1341
(1094.6)
1069–1378
(1200.5)
1044–1342
(1165.2)
1107–1396
(1245.9)
S1085
(median)
3–23
(9.9)
8–50
(29.6)
4–28
(11.2)
1–18
(6.7)
3–19
(7.2)
forest
(median)
−1
(0.9)
0.5–1
(0.9)
0.05–1
(0.3)
0–0.83
(0.2)
0.03–1
(0.5)
Table 8. Heterogeneity measures for each cluster.
Table 8. Heterogeneity measures for each cluster.
Number of StationsH1H2H3
Cluster 1155.114.933.71
Cluster 2167.382.92−0.05
Cluster 3167.596.133.62
157.555.262.74
Cluster 4231.931.160.64
Cluster 5185.544.062.70
Table 9. Coefficients of each variable for each cluster in the regression (Equation (16)) for design flood estimation.
Table 9. Coefficients of each variable for each cluster in the regression (Equation (16)) for design flood estimation.
Clusterβ0β1β2β3β4β5β6β7β8
14.290.941.5300.750−2.14−0.150
2−9.840.297.26−1.80−0.52−3.933.75−0.800
3−4.620.813.19−1.2100000
4−0.940.720.9600.520000
54.100.494.42−0.340.74−0.62−2.650−0.22
Table 10. Comparison of mean, median and standard deviation of REabs for all five clusters (blue indicates the lowest value).
Table 10. Comparison of mean, median and standard deviation of REabs for all five clusters (blue indicates the lowest value).
RE Cluster 1Cluster 2Cluster 3Cluster 4Cluster 5
50%Mean_abs76.2556.02104.5164.69471.67
Median_abs48.4136.5654.8633.76183.76
Std Dev_abs89.9170.76108.3476.86580.00
20%Mean_abs89.8836.5771.1182.74542.75
Median_abs43.0625.9242.3939.87424.41
Std Dev_abs210.3829.7289.01113.28317.63
10%Mean_abs173.5627.9166.2690.36503.40
Median_abs31.8424.9035.4845.83422.76
Std Dev_abs558.5425.4479.88130.68254.32
5%Mean_abs412.9627.8266.6294.14453.41
Median_abs28.0122.8632.8046.33403.45
Std Dev_abs1500.2424.6676.09140.28208.10
2%Mean_abs1427.0234.2571.1596.05403.47
Median_abs21.3426.5546.4647.05380.24
Std Dev_abs5427.6530.4482.16145.37170.96
1%Mean_abs3637.9944.3778.8596.41376.95
Median_abs21.1730.5342.3140.13342.35
Std Dev_abs13984.5435.8994.50146.08154.74
Overall mean969.6137.8276.4287.40458.61
Overall median33.5626.1642.9243.07387.51
Overall Std Dev6121.1239.7187.63125.93313.49
Table 11. Comparison of mean, median and standard deviation of Qpred/Qobs ratios for all five clusters (blue indicates the lowest value).
Table 11. Comparison of mean, median and standard deviation of Qpred/Qobs ratios for all five clusters (blue indicates the lowest value).
Cluster 1Cluster 2Cluster 3Cluster 4Cluster 5
50%Mean_abs1.381.211.521.354.68
Median_abs1.080.970.781.001.85
Std Dev_abs1.130.891.430.956.10
20%Mean_abs1.601.091.331.444.43
Median_abs1.030.850.890.883.24
Std Dev_abs2.210.471.111.343.18
10%Mean_abs2.471.071.301.494.03
Median_abs1.021.020.910.903.23
Std Dev_abs5.660.381.001.522.54
5%Mean_abs4.881.071.311.523.53
Median_abs1.041.010.980.893.03
Std Dev_abs15.070.370.981.622.08
2%Mean_abs15.021.101.361.543.03
Median_abs0.950.981.010.872.80
Std Dev_abs54.350.451.041.671.71
1%Mean_abs37.111.151.421.542.77
Median_abs0.990.931.080.912.42
Std Dev_abs139.920.561.171.671.55
Overall mean10.411.121.371.483.75
Overall median1.000.960.960.942.88
Overall Std Dev61.260.541.101.463.25
Table 12. Statistical evaluation for cluster 1 for 5% AEP flood (blue indicates best result).
Table 12. Statistical evaluation for cluster 1 for 5% AEP flood (blue indicates best result).
MSERMSEBIASRBIASRRMSERMSNERErmed_Qpred/Qobs
Cluster 134034166.245833.881480.61388.42.3415.0728.011.04
Cluster 225309.52159.095.547.300.010.3722.861.01
Cluster 358766.74242.4211.3930.910.040.9932.790.98
Cluster 491284.23302.13−47.0251.920.121.6645.540.89
Cluster 517065693.554131.06−4056.77−453.413.614.96403.44−3.03
Table 13. Comparison of absolute RE (%) values between ARR RFFA model and PCR KNN15, PCR KNN25, and QRT models for cluster 2.
Table 13. Comparison of absolute RE (%) values between ARR RFFA model and PCR KNN15, PCR KNN25, and QRT models for cluster 2.
AEPsARR RFFA Model Absolute RE (%)PCR_KNN15 Absolute RE (%)PCR_KNN25 Absolute RE (%)Cluster 2 Absolute RE (%)
50%63.0742.742.7036.56
20%57.2551.2543.6425.92
10%57.4855.3948.7724.9
5%58.8553.4152.6222.86
2%60.3958.8758.2526.55
1%64.0668.6759.6630.53

Share and Cite

MDPI and ACS Style

Rahman, A.S.; Rahman, A. Application of Principal Component Analysis and Cluster Analysis in Regional Flood Frequency Analysis: A Case Study in New South Wales, Australia. Water 2020, 12, 781. https://doi.org/10.3390/w12030781

AMA Style

Rahman AS, Rahman A. Application of Principal Component Analysis and Cluster Analysis in Regional Flood Frequency Analysis: A Case Study in New South Wales, Australia. Water. 2020; 12(3):781. https://doi.org/10.3390/w12030781

Chicago/Turabian Style

Rahman, Ayesha S, and Ataur Rahman. 2020. "Application of Principal Component Analysis and Cluster Analysis in Regional Flood Frequency Analysis: A Case Study in New South Wales, Australia" Water 12, no. 3: 781. https://doi.org/10.3390/w12030781

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop