In silico methodologies, such as (quantitative) structure-activity relationships ((Q)SARs), are
available to predict a wide variety of toxicological properties and biological activities for
structurally diverse substances. To obtain insights in the scientific value of these predictions,
the capacity of the prediction models to generate (sufficiently) reliable results for a particular
type of compounds needs to be evaluated. In the current study, performance parameters to
predict the endpoint ‘bacterial mutagenicity’ were calculated for a battery of common
(Q)SAR tools, namely Toxtree, Derek Nexus, VEGA Consensus and Sarah Nexus. Printed
paper and board food contact material (FCM) constituents were chosen as study substances
since many of these lack experimental data, making them an interesting group for in silico
screening. Accuracy, sensitivity, specificity, positive predictivity, negative predictivity and
Matthews correlation coefficient for the individual models and for the combination of VEGA
Consensus and Sarah Nexus were determined and compared. Our results demonstrate that
performance varies among the four models, but can be increased by applying a combination
strategy. Furthermore, the importance of the applicability domain is illustrated. Limited
performance to predict the mutagenic potential of substances that are new to the model (i.e.
not included in the training set) is reported. In this context, the generally poor sensitivity for
these new substances is also addressed.