Evaluating the Performance of Extreme Learning Machine Evaluating the Performance of Extreme Learning Machine Technique for Ore Grade Estimation Technique for Ore Grade Estimation

Due to the complex geology of vein deposits and their erratic grade distributions, there is the tendency of over-estimating or underestimating the ore grade. These estimated grade results determine the pro ﬁ tability of mining the ore deposit or otherwise. In this study, ﬁ ve Extreme Learning Machine (ELM) variants based on hard limit, sigmoid, triangular basis, sine and radial basis activation functions were applied to predict ore grade. The motive is that the activation function has been identi ﬁ ed to play a key role in achieving optimum ELM performance. Therefore, assessing the extent of in ﬂ uence the activation functions will have on the ﬁ nal outputs from the ELM has some scienti ﬁ c value worth investigating. This study therefore applied ELM as ore grade estimator which is yet to be explored in the literature. The obtained results from the ﬁ ve ELM variants were analysed and compared with the state-of-the-art benchmark methods of Back-propagation Neural Network (BPNN) and Ordinary Kriging (OK). The statistical test results revealed that the ELM with sigmoid activation function (ELM-Sigmoid) was the best among all the other investigated methods (ELM-Hard limit, ELM-Triangular basis, ELM-Sine, ELM-Radial Basis, BPNN and OK). This is because the ELM-sigmoid produced the lowest MAE (0.0175), MSE (0.0005) and RMSE (0.0229) with highest R 2 (91.93%) and R (95.88%) respectively. It was concluded that ELM-Sigmoid can be used by ﬁ eld practitioners as a reliable alternative ore grade estimation technique.


Introduction
A n important aspect of mining is ore grade estimation, since it determines the viability of actively mining a mineral of interest. This process mainly involves estimating the reserve and grade using statistical procedures with samples obtained during drilling to determine the feasibility of mining the resource. Geostatistics is the conventional ore grade estimation technique which proves to be effective in the grade prediction of relatively uniform and massive deposits [1e4]. However, in extremely heterogeneous data sets, the geostatistical technique tends to perform poorly due to the complicated nature of the variograms obtained which are mostly rendered useless for further analyses and tend to overestimate or underestimate the resource [3,5]. Also, manual tasking during the geostatistical resource estimation processes encourages bias and may introduce errors in the predicted ore grade values. These practical limitations are found in the most widely used geostatistical technique of Ordinary Kriging (OK). Over the years, in the quest to fix and improve the performance of OK, various kriging techniques such as indicator kriging [6,7], disjunctive kriging [8e10], multigaussian kriging [11,12], probability kriging [13e15], lognormal kriging [16,17] and outlier restricted kriging [18,19] were developed. These modifications resulted in more time consuming, computational complexity and overly expensive resource estimation processes. Due to these shortfalls, alternative resource estimation Table 1. Review of AI techniques applied in ore grade estimation.

Author
Technique Observation Wu and Zhou [20] Multilayer Feedforward Neural Network (MLFNN) with Dynamic Quick-Propagation (DQP) variant The technique overestimated and underestimated low-frequency values such as high-grade values as a result of smoothing. The number of data points employed in the study was 51. Al-Alawi and Tawo [22] BPNN Required more data for prediction. Hence it possessed poor generalisation ability. Number of sample points used was 163. Kapageridis and Denby [23,47] Kapageridis et al.
[48e50] Kapageridis [24] Radial Basis Function (RBF) and Multilayer Perceptron (MLP) The neural network's resource estimates gave comparable results to kriging with fewer sample data. The number of drill hole data used in these studies ranged from 50 to 3600. Matías et al. [51] MLP, Regularisation Networks (RN) and RBF Kriging outperformed the MLP, RN and RBF using a total of 1932 samples. Samanta et al. [52] Kohonen Neural Network (KNN) KNN and kriging models performed almost equally well. However, grade values were generally overestimated due to the high nugget effect. The total number of drill holes used in the research was 497. Samanta et al. [25] MLFNN and SLFN with the Adaboost algorithm, BPNN NN generally did not perform well due to the data's low spatial correlation and the high noise of the gold data used. A total of 275 drill hole data was employed in the study. Chatterjee et al. [35] ANN NN outperformed OK using 5149 data points. Samanta et al. [53] MLFNN with jump network and Genetic Algorithm (GA) OK performed slightly better than the NN model. The number of exploratory borehole data used in the research was 181. Mahmoudabadi et al. [26] LevenbergeMarquardt Backpropagation (LMBP) with GA NNs were quite sensitive when the MLP was used with back propagation-based algorithms (LMBP) in generating initial weight values with a limited training dataset. The study applied a total of 65 drill hole data. Li et al. [21] Wavelet Neural Network (WNN) WNN accurately captured the local nonlinearity of the dynamic systems due to its multiscale, multiresolution and localisation ability using 200 drill hole data. Chatterjee et al. [33] GA and k-means clustering NN ensemble with SVM and RBF kernel The estimated results obtained using the SVM and RBF outperformed OK. The number of drill hole data applied in the research was 4745. Badel et al. [54] MLFNN with Conjugate Gradient Method (CGM) optimisation and K-means clustering Results of the Multiple Indicator Kriging (MIK) were more similar to the actual grade values. MIK also had better local precision than the ANN technique using a total number of 1802 data points. Guo [55] MLP, X-Ray Diffraction (XRD) and Levenberg eMarquardt (LM) MLP training preferred correlated data. The dataset used was 82 drill holes. Dutta et al. [56] ANN-GA The results obtained from the hybrid NN models generally did not perform well using a total number of 168 borehole data. Dutta et al. [36] Support Vector Regression (SVR) and LM Backpropagation (LMBP) NN algorithm The SVR model gave the best results out of the NN, and OK methods applied. However, the upgrade was minimal due to the existence of extreme sample values in the 3500 drill hole data employed in the study. Tahmasebi and Hezarkhani [27] Adaptive Neuro-Fuzzy Inference System (ANFIS) ANFIS gave better results than FL, ANN and Kriging. The number of data points used in the study was 258.
(continued on next page) Tahmasebi and Hezarkhani [31] Tahmasebi and Hezarkhani [58] Coactive Neuro-Fuzzy Inference System (CANFIS) with GA and ANFIS-GA The CANFIS-GA produced the best results due to its high correlation coefficient; however, ANFIS-GA gave best the least error on the testing data set. Data from 156 boreholes was applied in the study. Maleki et al. [59] Support Vector Machine (SVM), Backpropagation Neural Networks (BPNN) The SVM was fast and gave more accurate results than that of the BPNN model using 4000 data samples. Gholamnejad et al. [32] MLFNN with Tanh activation function and LevenbergeMarquardt (LM) The predicted values were deemed acceptable and had a correlation coefficient of 0.8. The number of sample points employed in the study was 2068. Granek [60] SVM and Convolutional Neural Network (CNN) The CNN model was quite complicated, challenging to modify and computationally demanding but had an advantage over SVM by recognising anomalous structures in data. The number of sample points used in the research was 70. Li et al. [37] Self-adaptive Learning-based Particle Swarm Optimisation Support Vector Regression (SLPSO-SVR) model The SLPSO-SVR technique performed better than PSO-SVR, ANN, comprehensive learning PSO-SVR and Grid-SVR. This technique had many advantages which included its rapid training ability and grade estimation using 2000 sample data points. Jafrasteh and Fathianpour [61] Local Linear Radial Basis Function (LLRBF) with Skewed activation function (SG), Simultaneous Perturbation Artificial Bee Colony algorithm (SPABC) and BPNN The standard RBF trained with SPABC-BP algorithm showed higher generalisation ability and better prediction of ore grades for highly skewed data than LLRBF-SG-SPABC-BP and LLRBF-SPABC-BP. The technique was ideal for capturing nonlinear mappings in the 1250 data points used in the study. Jafrasteh et al. [34] Random Forest (RF), Gaussian Process (GP), MLP with LM GP gave the best performance compared to the others since it provided a smoother interpolation and offered a more accurate prediction. IK was the next to provide a better estimation, and MLP gave the worse performance. However, all the techniques were sensitive to sudden variations of the copper concentrations. The number of data points employed in the study was 5647. Singh et al. [30] Recurrent Neural Network (RNN) RNN gave comparable results with kriging. Kriging, however, performed slightly better than RNN. The number of sample points used was 3298. Jahangiri et al. [62] Gustafson-Kessel (GK) clustering algorithm with ANN The accuracy of the results was poor based on the prediction of some elements; however, predictions were more accurate than the mine's estimation techniques. The number of borehole data applied was 1755.
(continued on next page) techniques using Artificial Intelligence (AI) have been applied in ore grade estimation. AI techniques, especially Artificial Neural Networks (ANNs) have effectively been employed in mineral resource estimation using limited data and in highly heterogeneous data sets [20e32]. Most of these techniques applied in literature (Table 1) for ore grade estimation are feedforward neural networks. Their popularity stems from their ability to approximate complex nonlinear mappings between input and output and produce models for a large class of data. From Table 1, the widely used ANN approach in ore grade estimation is the Backpropagation Neural Network (BPNN). Even though majority of the different AI techniques used (Table 1) outperformed kriging [20,22e24,26e28,31,33e37] a few suffered with smoothing in noisy datasets [20,25,30,34,38] which is not ideal in ore grade estimation. Despite the broad applicability of BPNN, the technique has a wide range of limitations some of which include: overfitting problems, slow convergence and limited reasoning ability. BPNN also requires the model training parameters to be manually tuned in order to obtain optimum results which could lead to local minima with suboptimal solutions [39e41]. The chronological trial and error processes are required in this technique as there are no lied down procedures to ascertain the number of hidden neurons required for the model development [42,43]. Thus, the various issues of feedforward neural networks, including slower learning speed due to the use of gradientbased learning algorithms during the training phase and iteratively tuning all network parameters are addressed with the introduction of the Extreme Learning Machine (ELM) in Huang et al. [39].
In light of the strength and mathematical convenience, the study adopted the ELM approach for the ore grade estimation. The ELM was designed for Single Hidden Layer Feedforward Neural Network (SLFN) that randomly chooses its input weights and hidden layer biases and can adequately learn on a given data set. It does this by adopting the function approximation in a finite training set resulting in its ability to apply almost any non-linear activation function to produce distinct predictions [39]. However, in practical application of the ELM, different variants exist based on their activation functions. These activation functions have been found to be a key factor in ELM achieving optimum prediction performance.
Therefore, assessing the impact of the activation function on the ELM performance for ore grade estimation has some scientific value worth investigating. In line with that, this study applied the following activation functions: Radial basis, Hard limit, Sigmoid, Triangular basis and Sine. Despite the wide application of ELM for solving diverse science and engineering problems [44e46], there is close to no application of ELM for ore grade estimation. Furthermore, no ore grade estimation study has assessed and compared the ELM variants with the state-of-the-art benchmark methods of BPNN and OK. Therefore, taking into consideration the strength of the ELM, this paper aims at: -Determining the viability of the variants of ELM as a novel approach for ore grade estimation using exploratory data from a mine in Ghana; -Determine its generalisation and predictive ability using heterogeneous data set; and -Perform comparative analyses between the developed variants of ELM (ELM-Sine, ELM-Sigmoid, ELM-Radial Basis, ELM-Triangular Basis and ELM-Hard Limit) and benchmark techniques of BPNN and the OK.

Study area
This research was conducted in a mine (hereafter Mine X) in Ghana. Mine X deposit is found in the Ashanti belt of the Birimian Supergroup and are The obtained results were promising but required more sample data to ensure the developed model has a good generalisation ability as the samples were few (89 samples). Zhang et al. [28] Weighted Least Square Support Vector Regression (WLS-SVR) The robust weighted WLS-SVR outperformed BPNN, OK and Inverse Distance Weighting (IDW) due to its strong predictive and generalisation ability. The number of samples used the study was 2304.
mainly volcanic rocks. The Birimian supergroup consists of northeast-striking belts with significant faults. The deposits found in the Ashanti belt contain mesothermal gold vein-type deposits [64]. Thus, the mineralisation is found in steep NNE-SSW to NEeSW trending shears. The gold is found in two principal ore types: quartz veins with freemilling gold, and sulphide ore containing arsenopyrite, pyrite and rare pyrrhotite and marcasite with refractory gold [65]. A map of the study area is shown in Fig. 1.

Materials
Secondary data was obtained from Mine X. The data comprised of assay, survey, lithology, and collar files obtained from exploratory drilling programme. The collar data contains the hole ID, the X, Y and Z coordinates of each collar, maximum depth of the drill hole and the type of drilling method. The assay data contains the composite Hole ID, From, To and the Assay value observed in that composited section of the borehole. The survey file consists of composite borehole sections at a particular location with its dip and bearing. The lithology file comprises composite collar data and the area between the "From" and "To" is given the observed rock type. The rock types observed were Schist (SC), Meta Volcanic (MV), Quartz (QU), Greywacke (GK), Phyllite (PH) and Laterite (LAT). The entire data set comprised of 3759 drilled holes.

Statistical description of data
The samples were initially taken at varying depths which made compositing necessary. Compositing involves the averaging of the original assay values to pre-specified lengths. This creates homogenous support of the data to be used for estimation and minimises data variability. Compositing, therefore, produces more robust statistical and structural analyses. The statistics were done on 1 m composite samples to identify if populations within the deposit were significantly different. It was also used to access the effects of the data distribution on the methods used for the grade estimations. The total number of samples obtained after compositing was 301 507 for the Assay and (X, Y, Z) coordinate. Table 2 summarises the descriptive statistics of the entire assay data and (X, Y, Z) coordinate. A cumulative frequency graph was also developed to ascertain the data distribution of the assay values (Fig. 2).

Methods
This research applied two primary AI techniques for ore grade estimation, the results of which were then compared with those obtained from OK. The AI techniques used include BPNN and ELM. The ELM for ore grade estimation was assessed based on five different activation functions: Radial basis, Sine, Hard limit, Sigmoid and Triangular basis. The OK technique was carried out using Datamine software, whereas the AI techniques were carried out using MATLAB and Python programs.

Extreme Learning Machine
ELM is a learning algorithm for a SLFN which was developed by Huang et al. [39]. ELMs work by iteratively tuning parameters within the network and based on the gaussian probability, it randomly chooses it hidden neurons whiles the MooreePenrose generalised pseudo inverse is used to analytically determine the output weights of the SLFN [66]. In order to train a SLFN, Eq. (1) [39] is used. min g R À w i ; :::; wÑ; b i ; :: where g i ¼ ½g i1 ; g i2 ; :::; g im Q is the output weight vector linking the ith hidden node with the output node, b i is the threshold of the ith hidden neuron, w i ¼ ½w i1 ; w i2 ; :::; w in Q is the weight vector connecting the ith hidden node and the input neuron. In training a SLFN, the least-squares solution is found using Eq. (2) [67].
The smallest norm of output weight is achieved wheng ¼ R y Q, where Q is the least-squares and R is the output matrix of the hidden layer.
w i $y j represents the inner product of the w i and y j whereby the output weight w i is chosen randomly. Eq. (4) represents the output matrix of the hidden layer [44].
Suppose the number of hidden nodesÑis equivalent to the number N of various training data points,Ñ ¼ N, the matrix R becomes square and invertible after the input weight vectors w i and the   [39]. It is important to note that, the ELM prediction performance is based on the type of activation function used.
The main purpose of an activation function is to determine if a neuron should be activated or not, and is achieved by calculating the sum of weights and the addition of a bias. Thus, the non-linearity of the output node is achieved. If an activation function is not applied to a neural network system, the output will act as a simple linear regression function with limited learning ability [68e70]. The selection of the right activation function in a neural network is crucial since an unsuitable activation function can result in the loss of information from the input parameters during forward propagation and consequently, exponential vanishing gradients during backpropagation [71]. Several types of activation functions can be found in literature, however, the most commonly used such as triangular basis, sigmoid, hard limit, sine and radial basis are applied in this study.
The sigmoid activation function also referred to as the logistic function is non-linear and is widely applied in feedforward neural networks [72,73]. The major advantages of the sigmoid activation function is highlighted by Neal [74], some of which include: easiness to understand and its use in shallow networks. The equation for the sigmoid activation function is shown in Eq. (5) [72]: The sine activation function is sinusoidal in nature. Hence, it varies from the common activation functions as it rises and falls. The study done by Sopena et al. [75] showed that the sinusoids improve accuracy and shortens training time. Although the sine activation function has been applied, this is rarely used as they are difficult to train [76,77]. It is also saturated as its output converges to zero and flattens as x approaches infinity, it also has numerical problems and converges to local minima [78]. The sine activation function is governed by Eq. (6) [78,79]: The Triangular Basis Function (TBF) is a function whose graph is shaped like a triangle, more like an isosceles triangle. It is quite useful in signal processing and when used as an integral transform function produces more realistic signals. The signals from the function fall within the range of À1 to 1. The triangular basis activation function is expressed in Eq. (7) [80]: where: b q i ðyÞis the TBF and y is an independent variable.
Based on the gaussian curve, the Radial Basis Function (RBF) is achieved. RBF applies a parameter which calculates the mean value of a function. RBF is a real-valued functiongwhose observation is solely dependent on the distance from the origin, thus (Eq. (8)) [72]: otherwise, a distance from another fixed-point, i.e. center c (Eq. (9)) [72] results in: Therefore, any function g that satisfies gðuÞ ¼ gðkukÞis a radial function. The Euclidean distance norm and the radial basis function which is commonly taken to be gaussian are merged to obtain an output. The sum of the equations will give Eq. (10) [72] where yðuÞ is the output and w i is the weight.
yðuÞ ¼ X N i¼1 w i gðku À c i kÞ ð10Þ The hard limit function is essentially a transfer function that allows the output neuron to produce a 1 if the input attains a threshold, otherwise, it outputs a 0. It is often used in the perceptron learning rule and as a transfer function, it calculates the output of a layers based on its input. Hard limit activation function is governed by Eq. (11) [81]:

Backpropagation Neural Network
The AI technique that is commonly used in ore grade estimation is BPNN since it serves as the primary form of neural network [22,26,28,51,55,59]. The basic BPNN structure (Fig. 3) has three (3) layers which is made up of the input, hidden and the output; however, multiple hidden layers are accepted in the BPNN architecture. External input parameter are received into the network via the input layer, i.e. X, Y and Z coordinates to each input neuron X j ¼ ðX 1 ; X 2 ; X 3 ; :::; X m Þ T which are assigned specific weightsw ij and a bias b i (Eq. (12)) [43]. The input values are then transformed into weighted inputs and are transferred to the hidden layer. A mathematical nonlinear activation function is then used to decide if the data in the input neuron should be activated or not after which the transformed data is given out through the output neuron. As shown in Eq. (12), the input of the output layer is obtained from the output of the hidden layer. The linear activation function is used to transform the input of the hidden layer to the output layer which produces the final network output y i .
Designing a BPNN model involves a critical process of, selecting a suitable number of hidden layers, hidden neurons, training algorithm, and the transfer function. Studies led by several scholars have shown that for solving complicated problems, a BPNN having a single hidden layer is enough as a universal approximator [43,82,83]. Hence, one hidden layer was used in this research. The hyperbolic tangent transfer function was employed in the hidden layer whereas the linear transfer function was applied in the output layer to give out the ore grade value for the BPNN model. The training algorithm applied was the LevenbergeMarquardt optimisation method which is primarily used for solving nonlinear least-squares problems. The LevenbergeMarquardt algorithm works by combining the gradient descent and the gaussnewton methods. The gradient descent technique works by updating the parameters in the steepest direction to reduce the sum of squared errors. On the contrary, the Gauss-Newton method is applied, by summing the squared errors and reducing it by assuming the least-squares function to be locally quadratic in parameters, thereby finding the minimum quadratic value [84]. The detailed mathematical background of the LevenbergeMarquardt can be found in [84].

Ordinary kriging
Geostatisticians use the variogram as a fundamental tool to measure the spatial continuity of the ore grade data. The experimental variogram is the average variability between samples versus the distance between samples [85]. This variogram model is computed using Eq. (13) [86]: where: u is a vector of coordinates; z(u) is variable under consideration as a function of spatial location; h is the distance between the two points and expressed as a vector; N (h) is the number of pairs found at distance h apart; and Z (u þ h) is the value of a second variable at location h units from u. The spherical model is widely applied in most orebodies (Fig. 4). This model is characterised by Eqs. (14)e(16) [87].
where: a is the range and corresponds to the intuitive idea of the range of influence of the regionalised variables. Beyond this value, samples are no longer  auto-correlated; C 0 is the nugget variance and represents the random portion of variations of the regionalised variables; C is the spatial variance and is the predictable/structural part of the spatial variance, and C þ C 0 is the Sill. The grade estimation model can be expressed in the matrix form, as shown in Eq. (17) [86]. where: ½C ¼ In matrix C, s ij is the covariance between any two data points i and j; from matrix D, sVv i is the average covariance between a sample and the block to be estimated, while matrix W, shows a i which is the weight to be assigned to sample i.

Statistical evaluation tools
The efficiency of the predicted results of the various models were compared based on their Correlation Coefficient (R), Mean Square Error (MSE), Mean Absolute Error (MAE), Coefficient of Determination (R 2 ) and Root Mean Squared Error (RMSE) which are shown in Eqs (18)e(22) [28,43,55,88,89].
where: Z is the actual assay value obtained from drilling; Z* is the predicted assay value; Zis the mean of the actual grade; Z * is the mean of predicted grade; n is the sample; and N is the total number of samples.

Data preprocessing for model development
For good predictions using ANN, enough data is required. This study adopted the popular hold-out cross-validation technique commonly employed in ANN modelling to divide the data [43]. As such, the training data set should be larger than the testing data set. Hence, 80% of the data representing 241 206 data points were used for training the network, whereas the remaining 20% representing 60 301 data points were used for testing. The data division was done randomly to prevent bias. However, both divisions possessed similar statistical characteristics to guarantee the models generalisation ability to predict ore grade accurately.
During the data preparation, the data sets were first normalised. This was necessary because, the grade values range between 0 and 2000 while the coordinates range between À1600 to 13 700, which fall in different ranges. Moreover, without the data normalisation the large values (coordinates) essentially influence the results due to its more significant value but may not be more important as a predictor. Therefore, the aim was to change the values of the dataset to a standard scale without distorting the differences in range values and improving the model's validation accuracy. In this research, the scaler applied essentially scaled the data to range from À1 to 1. The normalisation formula used is expressed in Eq. (23) [90].
where: p i is the normalised data, q i is the actual drill hole data, q max and q min are the maximum and minimum values of the actual drill hole data with p min and p max values set at À1 and 1, respectively.

Results and discussion
3.1. Models developed 3.1.1. ELM model for ore grade estimation Based on the experimental results, the optimum number of neurons for the developed ELM model was 50. The five activation functions; sigmoid, hard limit, sine, triangular basis and radial basis were applied separately. The optimum ELM model had three input nodes with a single hidden layer made up of 50 hidden neurons and one output with structure [3-50-1]. The three input variables were the X, Y, and Z coordinates while the output was the ore grade.

BPNN model for ore grade estimation
The input and output data sets used in the ELM model was the same used to develop the BPNN model. Three layers comprising input, hidden and output layers made up the developed BPNN model. As demonstrated in literature, a single hidden layer is capable of approximating any complex problem [82], hence, one hidden layer was employed in this study. The hyperbolic tangent and linear transfer functions were employed in the hidden and output layers to capture both non-linearity and linearity between the inputeoutput data. In training the BPNN model, the LevenbergeMarquardt algorithm was applied [91]. The best BPNN model obtained in this research has three input nodes, ten neurons in the hidden layer and a single output node, with the structure [3-10-1].

OK model for ore grade estimation
As shown in Table 2 the grade distribution shows positive skewness with possible outliers. Therefore, there was the need to apply a top cut to minimise the influence of outliers on the mean and the skewness. The top cut value was based on the log probability curves' (Fig. 5) analysis showing inflexion at specific points, which indicate subpopulation. Hence, the bottom cut value is shown to be at 0.01 g/t, but due to the high-grade values in the data, the top cut value is at 15 g/t. Since the deposit is located in the Birimian, and due to the high-grade records in that structure, geologists usually consider the top cut value to be 12 g/t. However, in this research, 15 g/t is considered the top cut value based on the observed inflexion (Fig. 5).
Structural analyses were first performed on the data. The experimental variogram was generated using the exploratory data with a top cut value of 15 g/t. Based on the experimental points at various lag distances, the variogram parameters were then obtained. The nugget variance C 0 ¼ 13.4, spatial variance C ¼ 9.4 and range a in the X, Y and Z directions were 21.2, 15.03 and 72.54 respectively. These were obtained as the model parameters from the experimental variogram. The spherical model was then superimposed onto the experimental variogram based on the experimental variogram parameters (Fig. 6). Based on the results obtained from the structural analyses, ore grade estimation using OK was then conducted. The performance of OK in the ore grade prediction is summarised in Table 4 based on the statistical evaluation tools.

Comparative analysis of ELM variants
In considering the dimensioned error statistic indicators (MSE, MAE and RMSE) as shown in Table  3, it was observed that the estimation technique, ELM-Sigmoid model obtained the lowest MAE, MSE and RMSE values of 0.0175, 0.000524 and 0.022900 correspondingly. These results (MSE, MAE and RMSE) were interpreted based on the rule of thumb that states that, for a model to approximate closely to the actual data, the error values are closer to zero. The next technique that performed nearly as well as ELM-Sigmoid was ELM-Sine which was closely followed by ELM-Radial basis. The ELM-Triangular basis performed fairly while the ELM-Hard limit gave the poorest results.
The R 2 (Table 3) indicates that the nearer the value is to 1 or 100% the better the predicted results. In effect this shows the extent the model applied could explain the prediction variation level from the model compared with actual data. It is observed in Table 3 that the ELM-Sigmoid had the highest R 2 value of 91.93%, followed by ELM-Sine, 87.91%. The subsequent models: ELM-Radial basis; and ELM-Triangular basis performed fairly with R 2 values measuring 68.86% and 55.69% while the ELM-Hard limit performed poorly with R 2 of 33.37%.
The statistical tool that measures the strength between the relationship of two variables, i.e., actual and estimated values is known as the correlation coefficient R, and they fall within the range of À1 to 1. In effect it explains the level of prediction accuracy of the model. In Table 3, it is noticed that the ELM-Sigmoid and ELM-Sine had R values exceeding 0.9, making them perform better than the other models, however, ELM-Sigmoid had the highest R of 0.96. On the contrary, ELM-Radial basis had R of 0.83 while that of ELM-Triangular basis and ELM-Hard limit was 0.75 and below 0.6 respectively. The presented ELM test results are further illustrated in Fig. 7 ((a)-(e)) for visual observation. To this end, it can be stated that the ELM-Sigmoid has demonstrated strong calibration power and best generalisation on the training and testing data with great adaptability as compared to the other models.

Performance evaluation of ELM-sigmoid with other investigated techniques
The ELM-Sigmoid was the best performing model out of the various ELM methods applied in this study. Hence the ELM-Sigmoid is evaluated against state-of-the-art methods of OK and BPNN. The comparison (Table 4) aimed to determine if the proposed ELM-Sigmoid technique could produce comparable or superior results to those obtained  Comparatively, ELM-Sigmoid was superior over BPNN and OK as it recorded the lowest errors and had the best performing R 2 and R values. BPNN had MAE, MSE and RMSE of 0.0555, 0.0054 and 0.0735 whiles OK recorded 1.4127, 6.1599, and 2.4819 respectively. The ELM-Sigmoid had 0.0175, 0.0005, 0.0229, 0.9193, 0.9588 for MAE, MSE, RMSE, R 2 and R. A visual comparison of the various models based on the actual and predicted grade values are shown in Fig. 8. The interpretation is that the ELM-Sigmoid model produced a better fitted values to the actual ore grade data than the other methods. The strength of the ELM-Sigmoid comes from the fact that there is less manual tasking (human interference) and minimum fine-tuning adjustable parameters in the model development process. Moreover, optimum predictions were achieved because the ELM technique is not gradient descent based type of algorithm which can trap in local minima and therefore produces global best solutions.

Conclusions
In this study, ELM has successfully been applied in ore grade prediction of heterogeneous data sets. Five variants of the ELM based on triangular basis, radial basis, hard limit, sine and sigmoid activation functions were developed and tested with data from a mine in Ghana. The proposed ELM techniques were then compared with state-of-the-art established methods of BPNN and the conventional OK. ELM-Sigmoid model generated comparable grade predictions similar to the actual grade values than the other models applied. Thus, the ELM-Sigmoid gave the lowest MAE, MSE and RMSE values of 0.0175, 0.0005 and 0.0229 and highest R 2 and R of 0.9193 and 0.9588 respectively. Based on the results obtained, it was decided that the proposed ELM-Sigmoid model has shown encouraging application potential in ore grade estimation and therefore serve as a suitable alternative to the established BPNN and OK models employed in this research. The proposed ELM-Sigmoid model's efficiency was attributed to its inherent ability to randomly select its weights and biases, faster computational speed, less manual tasking in the model development and producing global minimum solutions because it is not a gradient descent type of algorithm where it can trap in local minima.
The developed model for the studied Mine could be adopted as an alternative ore grade estimation tool since OK method which is usually employed produced poor estimates due to the heterogeneous nature of the deposit. Furthermore, the proposed methodology can definitely be replicated for other deposits. This is because it has proven to possess excellent generalisation ability and self-adaptive characteristic feature where it can automatically learn on any given dataset from any Mine.