Issues associated with deploying CNN transfer learning to detect COVID-19 from chest X-rays

CNN architectures—brief overview

In recent years, the use of deep learning algorithms in general and convolutional neural networks (CNNs) led to many breakthroughs in a variety of computer vision applications like segmentation, recognition and object detection [22]. Deep learning methods have been shown to be successful in automating the task of feature-representation learning and gradually attempts to eliminate the tedious task of handcrafted feature engineering. Deep learning, and convolutional neural networks (CNNs), attempts to mimic the human visual cortex system in terms of structure and operation by adopting a hierarchical layer of feature representation. This approach of multi-layer feature representation made it possible to learn different image features automatically and hence enabled CNNs to outperform handcrafted-feature methods [23].

In 1960s, Hubel and Wiesel [24] studied monkey’s visual cortex system and found cells which are responsible for constructing image and detecting light signal in receptive filed. In the same vein, Hubel and Wiesel also showed that monkey’s visual field can be represented using a topographic mapping. In 1980s, Neocognitron proposed by Fukushima and Miyake [25] which is a self-organizing neural network and regarded as a predecessor of CNN. In [26], LeCun et al.’s groundbreaking work introduced modern CNN models for the purpose of handwritten digit recognition in which the architecture later popularized and known as LeNet. After LeNet architecture, convolutional layers and backpropagation algorithm for training popularized and became a fundamental building block of most of the modern CNN architectures. In 2012, AlexNet architecture, proposed by Krizhevsky et al. [27], won ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [28] by outperforming other methods and reducing the top-5 error from 26 to 15.3%. This was a turning point so that CNNs became an exceptionally popular tool to be deployed in many computer visions tasks. Roughly speaking, AlexNet is a similar version of LeNet but deeper structure and trained on 1.2 million high resolution images. Complex architectures that has millions of parameters, and hyperparameters, to train and fine tune need a substantial amount of computational time and power but again AlexNet popularized the use of powerful computational resources such as graphical processing units (GPUs) to compensate the increase in trainable parameters.

AlexNet opened the door for researchers around the world to design novel CNN models which are deep but efficient at the same time especially after ILSVRC became an annual venue for the recognition of new CNN models. The participation of technology giants such as Google, Microsoft and Facebook also helped in pushing research in this direction especially the depth of CNN architectures increased dramatically from 8 layers in 2012 to 152 layers in 2015 which helped the recognition error rate to drop to 3.5%. Pre-trained CNN architectures on ImageNet have been open-sourced and immediately used by researcher to transfer the knowledge to other application domains and promising results achieved [29]. One of the many useful features of transfer learning (TL) is that in other domains, such as medical image analysis, millions of labeled medical images are not available therefore it is natural to consider the use of fine-tuned weights and biases of CNN architectures trained on ImageNet, and other large databases, to be used for medical image analysis. Hence, we opt to use 12 deep learning architectures in a TL mode and modify their final layers to adapt to the number of classes in our investigation. The deep learning architectures that we used for the purpose of COVID19 detection from X-ray images are AlexNet, VGG16, VGG19, ResNet18, ResNet50, ResNet101, GoogleNet, InceptionV3, SqueezeNet, Inception-ReseNet-v2, Xception and DenseNet201.

In what follows we are going to briefly describe each of the 12 CNN architectures used here and highlight their distinct properties. It is out of the scope of this work to give details of all of these 12 CNN models, hence we direct interested reader to consult many survey articles on deep learning and CNN architectures such as [30, 31].

AlexNet architecture is the winner of ILSVRC 2012, proposed by Krizhevsky et al. [27] outperformed the handcrafted features significantly. AlexNet constitutes of 5 convolutional layers and 2 fully connected layers together with rectified linear unit (ReLU) activation function which is used for the first time. It can be regarded as a scaled version of LeNet except that it is a deeper architecture trained on a larger dataset of images (ImageNet) and benefitted from the GPU computational power. Hyperparameters of AlexNet fine-tuned and won 2013 ILSVRC [28] (later named ZF-Net). We use AlexNet in a transfer learning mode and modify the last layer of AlexNet according to the number of X-ray image classes, i.e. instead of 1000 classes that AlexNet trained on we change this to 4 classes because 4 X-ray classes used here which are COVID19, Bacteria, Viral and Normal. The same approach of TL is used for the rest of CNN models.

VGG architectures proposed by Oxford University’s visual geometry group [32], hence the acronym VGG, whereby they demonstrated that using small filters of size 3-by-3 in all of the convolutional layers throughout the network leads to a better performance. The main intuition behind VGG architectures is that multiple small filters in a sequence can imitate the effect of larger filters. Due to its simplicity in design and generalization power, VGG architectures are widely used. We use VGG16 and VGG19 that constitute of 16 and 19 convolutional layers, respectively.

GoogleNet architecture is the winner of ILSVRC 2014 which is proposed by Szegedy et al. [33] from Google in 2014. Novelty of GoogleNet is the innovation of inception module, which is a small network inside a bigger network. Furthermore, 1-by-1 convolutional layers/blocks used as a dimensionality reduction and feature aggregation. In total, GoogleNet is 22 layers deep with 9 inception modules. Inception V1 (GoogleNet), is later improved in terms of batch normalization, representational bottleneck and computational complexity and resulted in Inception V2 and V3. Here we opt to use GoogleNet and InceptionV3 [34] in a transfer learning mode. In the same vein, we use Xception [35], which is another architecture proposed by F. Chollet from Google which uses the idea of extreme inception module whereby depthwise convolutional layers used first then followed by pointwise convolutional layers. In other words, they replaced inception modules by depthwise separable convolutions in such a way that the total number of parameters is the same as inceptionV3 but the performance on large datasets (350 million images of 17,000 classes) are significantly higher.

ResNet architectures are proposed by He et al. [36] from Microsoft and won 2015 ILSVRC. Main innovation in ResNet architectures are the use of residual layers and skip connections to solve the problem of vanishing gradient that may result in stopping the weights in the network to further update/change. This is particularly a problem in deep networks because the value of gradient can vanish, i.e. shrink to zero, when several chain rules applied consecutively. Skipping connections will help gradians to flow backwards directly from end layers to initial layer filters enabling CNN models to deepen with 152 layers.

DenseNet can be regarded as a logical extension of ResNet which was first proposed in 2016 by Huang et al. from Facebook [37]. In DenseNet, each layer of CNN connected to every other layer in the network in a feed-forward manner which helps in reducing the risk of gradient-vanishing, fewer parameters to train, feature-map reuse and each layer takes all preceding layer features as inputs. The authors also point out that when datasets used without augmentation, DenseNet is less prone to overfitting. There are a number of DenseNet architectures, but we opt to use DenseNet201 for our analysis of COVID19 detection from X-ray images by using the weights trained on ImageNet dataset in TL mode.

SqueezeNet is a small architecture proposed by Iandola et al. [38] in 2016 that uses the idea of fire module which contain 3 filters of size 1-by-1 feed into an expanded layer (4 filters of size 1-by-1 and 4 filters of size 3-by-3). Even though the number of parameters of SqueezeNet is by 50 × less than AlexNet but achieves the same accuracy of AlexNet on ImageNet.

Inception-ResNetV2 is a combined architecture proposed by Szegedy et al. [34] in 2016 that uses the idea of inception blocks and residual layers together. The aim of using residual connections is to avoid the problem of degradation causes by deep networks and reduce the training time. The inception-resnetV2 architecture used here contains 20 inception-resnet blocks that empower the network to become 164 layers deep, and we use the pre-trained weights in these layers to assist our mission of detecting COVID19 in X-Ray images.

Proposed CNN

In this study, we designed a CNN model for COVID-19 detection from chest radiography images guided by the fact that in order to properly classify and detect COVID-19, radiologists need to discriminate COVID-19 X-rays from normal chest X-ray first, and then from other viral and bacterial infections in order to isolate and treat the patient properly. Therefore, we opt to choose the design of CNN to make one of the following predictions: (a) Normal (i.e. no infection) (b) COVID-19, (c) Viral infection (none-COVID-19) and (d) Bacterial infection. The rationale behind using these 4 cases is to aid radiologists to prioritize COVID-19 patients for PCR testing and employ treatments according to infection-specific causes. Having these requirements in mind, we designed our simple CNN architecture, named CNN-X, that constitutes of 4 parallel layers where we have 16 filters in each layer in 3 different sizes (3-by-3, 5-by-5 and 9-by-9). Batch normalization and rectified linear unit (ReLU) is then applied to the convolved images and two different types of pooling operation applied next which are average pooling and maximum pooling. The rationale behind using different filter sizes is to detect local-features using filters of size 3-by-3 and rather global features by filters of size 9-by-9 while 5-by-5 filter size is to detect what is missed by the other two filters.

Different pooling operations utilized to further reduce the dimensionality of feature maps. A stride of size 3 is adopted here, with pooling operations, to further reduce the dimension of the resulting feature maps taking into consideration the fact that there is redundant information in images and neglecting a row and a column after each pooling window is not causing a massive information loss. See Fig. 1 where we visually depict the difference between pooling of size 3-by-3 with stride 2 versus pooling of size 2-by-2 with stride 3 and conclude that we are not losing much information while reducing the size of the image/feature map further. Proposed architecture design is not deep, hence the feature map (i.e. convolved image) is not a very abstract representation of the input image yet and as such there are still redundant information.

Feature maps from the four parallel layers are then concatenated before fully connected layer. Weights are generated using Glorot method [39] with Adam optimizer [40] and 0.0003 initial learning rate. Training conducted using 20 epochs and 15 mini batch size. We visualize the structure of proposed CNN model in Fig. 2.

Dataset description

To investigate and test the CNN architectures explained in section III and IV, we used X-ray images collected from 3 publicly available sources. First dataset is a collection of 111 COVID-19 chest X-ray images collected by Cohen [17]. Second dataset is a collection of 5840 chest X-ray images of confirmed normal, bacterial and other non-COVID-19 viral infections from Kermany et al. [41]. The third dataset contains 73 confirmed COVID-19 chest X-rays collected from the following websites; Radiological Society of North America (RSNA), Radiopedia, and Italian Society of Medical and Interventional Radiology (SIRM). This dataset is also available publicly in [42]. In total, 6024 chest X-ray images used from the 3 datasets in which we divide them into four classes as follows; the total number of normal chest X-rays are 1575, confirmed bacterial infection cases are 2771, viral (Non-COVID-19) are 1494 and COVID19 images are 184. In Fig. 3 examples of all four radiographic X-ray classes are shown.

To shed more light on the number of artifacts and the nature of the artifacts present in the 3 datasets used in this work, we inspected every single image to check whether there is an artifact or not and the type of artifacts present in the images. In Table 1 we demonstrate the percentage of images that contain some form of artifact and in Fig. 4 we highlight different types of artifacts such as text and medical devices.

Each database contains different images with different sizes (i.e. the images are in different pixels resolutions). In Table 1, we showed the variety of image resolutions in the databases by presenting the minimum and maximum pixel resolution that every database contains.

As it can be seen from the percentages in Table 1, there is a high number of images that contain some form of artifacts that may affect the diagnostic results produced by CNN models. Swinging the results of any machine learning classifier by artifacts is not good and we are going to show the effect of these artifacts on diagnostic decisions made by CNN models in the rest of this paper, especially in part A of section III.

Figure 4 depicts different types of text and medical device traces present in the 3 datasets used in our experiments. Some of the artefacts can be removed by cropping or automatic segmentation such as those at the corners of the images but the artefacts like the one in the middle image in Fig. 4 is harder to remove automatically or manually. It should also be noted that despite the small amount of background present in the chest X-ray images, it does still affect the decisions of CNN models and we are going to demonstrate this in the next section.

Details of distributing the images to train set, validation set, and test set will be discussed and explained in the next section.

Experimental setup and results

We adopted transfer learning (TL) approach to investigate the performance of the CNN architectures discussed here and compare it with proposed CNN-X architecture. TL is the process of utilizing gained knowledge (learned weights) from solving one problem to a different but related problem. Weights optimized from training the 12 CNN models on ImageNet dataset used in TL mode such that weights in all layers are retrained on our X-ray images. All images from training and testing sets are resized to the suitable dimensions that each of the architectures designed for. No preprocessing applied to input images because none of the methods in the literature (so far) mentioned it and hence we followed the same norm. Training parameters in TL for all 12 CNN architectures are as follows: number of epochs = 20, mini-batch size = 15, initial learning rate = 0.0003. All experiments conducted using MATLAB version 2019b on a Core i5 CPU machine with 16 GB of RAM and 3.30 GHz. To measure CNN classification performance, four metrics were recorded which are sensitivity, specificity, F1-score and classification confidence. To be able to calculate the aforementioned metrics the following measures of test classification computed:

• True positive (TP): number of correctly identified disease X-ray images.

• False Negative (FN): number of incorrectly classified disease X-ray images.

• True Negative (TN): number of correctly identified healthy X-ray cases.

• False Positive (FP): incorrectly identified healthy X-ray cases.

Furthermore, TP refers to disease (COVID-19, bacterial or viral) X-ray images correctly identified as a disease X-ray image while FP is normal or other pneumonia cases incorrectly identified as COVID-19 disease. Sensitivity measures the proportion of diseased cases correctly detected by CNNs while specificity measure the proportion of healthy cases correctly identified as healthy by CNN models. The equation of sensitivity and specificity calculation is provided in appendix, which also contain the F1-score calculation and equation. Because the number of COVID-19 chest X-ray images is small in comparison with the other 3 classes, it is sometimes misleading to rely on sensitivity and specificity of CNN models alone. Therefore, we also report the computation of the estimate of 95% confidence interval (see the appendix) of classification errors of each of the CNN models utilised here where we assume that the CNN classification output distributed normally, i.e. follows a gaussian distribution. The smaller the confidence interval, more reliable the predictive model is and hence one expects its CNN model more likely to work on other datasets.

Three different scenarios deployed to test the performance of 12 off-the-shelf CNN architectures as well as our proposed CNN-X model which will be discussed next.

Scenario 1: normal vs COVID-19 classification (all data).

In this scheme, CNN architectures trained on 1341 normal X-ray images with 111 COVID-19 cases while 234 cases of normal with 73 cases of COVID-19 are used for testing. Table 2 below shows obtained results from all 13 CNN architectures. The aim of testing this hypothesis is to see the effect of differentiating COVID-19 from normal chest X-rays.

It can be seen from the table above that all of the CNN models (except Vgg19 and Vgg19), can be deployed successfully to detect COVID-19 X-rays with sensitivity of above 90%. However, the specificity of some of the techniques are below 90% in which we can avoid using it in practice. In this vein, one can opt to rely on the highest performing architectures such as Xception, Desnsenet201, SqueezeNet and inceptionresnetv2 as their specificity is > 99%. It should be noted that our proposed CNN architecture’s performance is comparable to other state-of-the-art CNN models whereby it achieves 93% sensitivity and specificity of 97%, which is better than AlexNet, GoogleNet, VGG19 and VGG16. Albeit excellent results in Table 2, this is not a realistic scenario to build machine learning algorithms for the purpose of COVID-19 detection in the present time because there is no guarantee that the system is not classifying other pneumonia infections as COVID-19 and vice versa. Furthermore, it may not be of a clinical significance to differentiate extreme COVID-19 cases from normal chest X-rays but it’s the diagnostics and discrimination of COVID-19 from other pneumonia is of a particular interest. Hence, we designed the second scenario to address the task of discriminating COVID-19 cases from other viral, bacterial and normal X-rays images.

Scenario 2: normal vs COVID-19 vs viral (non-COVID-19) vs bacteria

In this scenario we aim to classify X-ray images into the 4 respective classes of normal, COVID-19, Bacteria and Viral (non-COVID-19). This scenario addresses the limitation in the first scenario whereby any machine learning algorithm needs to, ultimately, discriminate not only COVID-19 chest X-ray from normal X-ray but it also needs to discriminate COVID-19 chest X-rays from other viral and bacterial infections. This is a necessary condition to stop the spread of the virus and prepare COVID-19 patients for special treatments.

A total of 1341 normal X-rays, 2529 Bacteria cases, 1346 Viral X-rays and 111 COVID-19 X-rays used for training. For testing, 234, 242, 148 and 73 X-rays of normal, Bacteria, Viral and COVID-19 used respectively. It is worth to notice that we train the model on 111 COVID chest X-rays from COVIDx dataset but we test the CNN models on 73 chest X-rays from a different source. This is critical to examine the effectiveness of feature maps learnt by CNN on one source and testing it on images coming from a different source. Table 3 below demonstrates classification performance obtained by adopting this scenario.

Scenario 3: normal vs COVID-19 vs viral vs bacteria (training on part of the data)

In this scenario we used part of the dataset to train CNN models to see the effect of each architecture with the smaller number of image samples. The rationale behind this scenario is the fact that most of the time the challenge in medical image analysis is limitation of available data for investigation and to reduce bias in having unbalanced number of images in training phase. Hence, the design of this scenario is to get more insight of how these CNN models perform in the case of limited availability of image samples.

In this scenario, four classes used with 350 X-ray images of normal, Bacteria, viral and 111 X-rays of COVID-19 for training whereas the same number of testing images used for the four classes are as scenario 2.

Table 3 shows experimental results obtained from scenario 2 and scenario 3, where Sn and Sp stand for sensitivity and specificity respectively in Table 3. It clearly depicts that none of the CNN architectures perform well on differentiating X-rays to all four classes. Perhaps the only exception is Inception-ResnetV2 that performs better in comparison with the rest of the architectures especially on normal X-rays with sensitivity of > 76% using all image samples. The good performance of Inception-ResnetV2 is due to the idea of combining residual learning with inception blocks which makes the performance to be better than using ResNet or Google/Inception architectures alone. Furthermore, we notice that all CNN models work well on detecting two of the classes, namely Bacteria and COVID-19, but not performing well on classifying normal and viral X-rays to their respective classes. This suggests that deployed CNN models learn features of bacterial and COVID-19 better than normal and non-COVID19 viral infections.

In other words, there is more similarity between features of X-ray images of viral infection and normal cases with each other and with other classes that cannot be distinguished easily. The second-best performing architecture, using all image samples, is Xception architecture with sensitivity of 97%, 94%, 66% and 82% for bacteria, COVID-19, normal and viral chest infections respectively. When it comes to scenario 3, where only 350 images used from normal, bacterial and viral chest X-rays, again Inception-ResnetV2 outperform all other CNN architectures including CNN-X. This confirms the effectiveness of Inception-ResnetV2 in terms of design and learning power. Nonetheless, we want to remind the reader that input images have not been segmented and they contain artefact that may contribute to CNN prediction but has no relation to COVID-19 infection. We confirm this point in the next section, see Figs. 5 and 6, where we demonstrate the region(s) in the image used by CNNs and some, if not all, of these regions are artifacts.

Direct comparison of best results obtained here, which is by Inception-ResnetV2, is not possible with other works in the literature because the COVID-19 images used for testing here is different and more importantly the number of testing images is 73 which is higher than the number of test images used in [2] and [7] whereby they tested their CNNs based on 8 COVID-19 images only. Nonetheless, our results are outperforming COVID-Net [2] in terms of sensitivity for viral and normal X-ray classification. The sensitivity of Inception-ResNet-V2 is again outperforms COVID-Net for bacterial, COVID-19, and viral infection classification.

In scenario 2, proposed CNN-X architecture is not performing better than any of the 12 CNN models used if we take the overall classification error obtained from each CNN architecture into consideration, see 4th column of Table 5 from the appendix. Nonetheless, CNN-X’s overall classification error is 0.341 which is comparable and close to Squeeze-Net and VGG19 with classification errors of 0.324 and 0.303 respectively. In scenario 3, CNN-X with a classification error of 0.377 outperforms 7 CNN models which are ResNet101, Xception, VGG16, AlexNet, SqueezeNet, ResNet18 and DenseNet201 with classification errors of 0.396, 0.418, 0.436, 0.443, o.446, 0.449, and 0.494 respectively. Classification errors of scenario 2 and scenario 3 can be seen in Table 5 and Table 6 in appendix together with classification confidence and F1-score of each class. Table 4 contain the elapsed time of training each of the 13 CNN models used here.

Next, we analyse qualitatively the performance of all CNN models used here to visually inspect the most discriminating regions on X-ray images used by CNNs. This step is critical so that radiologists can visualize the regions used by CNNs to predict pneumonia presence in input X-ray images.

CNN interpretability

There are many ways one can visualize the region(s) used by CNNs to predict the class label of an input image such as gradient descent class activation mappings or global average pooling class activation mappings and others [21, 43, 44].To interpret the output decision made by any of the CNN architectures investigated in this study, heatmaps of the most discriminating regions generated and visualized for the input images in testing using the method introduced in [21] which is known as class activation mappings (CAM). Using CAMs, one can highlight class specific distinctive regions used by CNNs that lead to its prediction. After fully training a CNN model, a testing image will be fed into the network and feature maps extracted from final convolutional layer. In what follows we briefly introduce the procedure of generating CAMs. Let ({A}_{u}left(x,yright)) be activation of unit (u) of the last convolutional layer at a spatial position of (left(x,yright)). Let

$${G}_{u}=sum_{x,y} {A}_{u}left(x,yright)$$

(1)

be average pooling operation and the input by the SoftMax layer is then can be defined as follows:

$${ S}_{l}=sum_{u}{w}_{u}^{l}{A}_{u}$$

(2)

where (l) is the class label, ({w}_{u}^{l}) is the weight of class (l) of the unit (u). Here, ({w}_{u}^{l}) highlights important of the activation ({A}_{u}) for a given class (l). Probability score output by SoftMax for a given class (l) can then be defined as follows:

$${P}_{l}=expleft(underset{u}{sum }{w}_{u}^{l} {A}_{u}right)times {left(sum_{u} expleft(underset{u}{sum }{w}_{u}^{l} {A}_{u}right)right)}^{-1}$$

(3)

Substituting Eq. (1) into Eq. (2) we obtain the following:

$${S}_{l}=sum_{u} {w}_{u}^{l} sum_{x,y} {A}_{u}left(x,yright) =sum_{u} sum_{x,y} {w}_{u}^{l} {A}_{u}left(x,yright)$$

(4)

Then each class (l) activation maps can be defined at each spatial position (left(x,yright)) as follows:

$${M}_{l}left(x,yright)=sum_{u} {w}_{u}^{l} {A}_{u}left(x,yright)$$

(5)

Finally, substituting activation maps for each class label in Eq. (5) into Eq. (4) we obtain the activation output by SoftMax for each class label (l) as follows:

$${ S}_{l}=sum_{x,y} {M}_{l}left(x,yright).$$

(6)

Hence, ({M}_{l}left(x,yright)) indicates the discriminative power of activation maps at the spatial grid (left(x,yright)) that leads to the decision made the CNN to classify the input image into class (l). To allow comparison to the input image, bilinear up-sampling is then applied to resize activation map to the size of input images accepted by each CNN model.

In Fig. 5 we demonstrate the image regions used by CNN models that lead to a successful class prediction. It can be observed that in very few occasions the CNN algorithms are focusing on the frontal region of the chest (i.e. lung region) where we search for signs/features of COVID-19 and other infections. Rather, they are using either regions outside the frontal view of chest area, see 1st column of row (b) and 3rd and 4th column of row (e) of Fig. 5. Direct overlaps of hot spots of CAMs with texts can be seen in Fig. 5 especially in 1st column of row(b), 1st-3rd-4th column of row (e), 1st column of row (g) and 1st–4th columns of row (j). Medical device traces, on the other hand, can also be used by CNNs on medical images to derive their decision as it can be seen in Fig. 5, 1st column of rows (b, c, g–j).

Furthermore, ranking the 13 CNN architectures deployed in this study according to CAMs will provide a new approach of using CNN architectures that are not solely based on classification results obtained. According to the intersection (overlap) between the lung region and CAMs hot spot distribution, we ranked the 13 CNN models into 7 categories (R1 being good and R7 being worst) as follows:

R1: ResNet50.

R2: InceptionV3.

R3: ResNet18 and InceptionResNet.

R4: ResNet101 and Xception.

R6: DenseNet201, SqueezeNet and AlexNet.

R7: VGG16 and VGG19.

In the same vein, incorrect classification may be caused by these artifacts, see Fig. 6 where we show examples of mis-classified images by CNNs and their corresponding CAMs to highlight the most discriminating regions lead to CNN decisions. For example, 4th column of most of the rows in Fig. 6 is an X-ray image where texts on medical images lead to an incorrect classification decision by CNNs. Specifically, there is a letter R in the top left corner and small texts in top-right corner of a viral X-ray image whereby most of the CNN architectures cheated by using features of these texts to obtain their final prediction.

In row (j) of Fig. 6, column number 3, we can see clearly that InceptionResNet used the small amount of the background in the image to derive its incorrect decision. This conclusion is mainly because there is a direct overlap between CAMs and the background region present in this image. First Column of row (e) and row (m) in Fig. 6 is a good example where regions outside ROI have been used to obtain final classification prediction by VGG19 and Xception architectures.

Therefore, we conclude that using X-ray images as it is, without preprocessing to segment the region of interest and remove some hidden noise, is not a good practice and result in a biased and misleading classification prediction.

In other words, we want to have a CNN model that learn the symptoms (i.e. features) of COVID-19 disease and its classification prediction is solely based on these features.