Most topologies assigned all strains to the same main clades as in the whole genome phylogeny, with a few exceptions: 33-rpoB assigned F. hispaniensis to clade 2 and 19-iglC assigned W. persica to clade 2, in subgroup F. noatunensis subsp. orientalis (in both assignments). This is
an interesting observation as rpoB was recently suggested as an alternative marker to 16S rDNA in metagenomic studies [21]. The level of incompatibility and difference in resolution compared to the whole-genome reference topology were decreased, in some cases by a considerable amount, by selecting an optimal combination of markers. Moreover, topologies based on an optimal set of markers significantly increased the average HDAC inhibitor mechanism statistical support (i.e. average bootstrap). Generally, both the degree of compatibility and resolution were improved by concatenating sets of two to seven markers in all possible combinations. However, C188-9 ic50 some combinations, in particular
considering incompatibility, might result in poorer topologies than for an estimated topology based on a single marker. This observation is consistent with previous work where concatenation of sequence data have resulted in biased phylogenetic estimates [50]. All incompatible phylogenetic signals were removed in topologies based on optimised sets of two to seven markers, in contrast to random concatenation. Totally congruent topologies were obtained by concatenating as few as only two markers (08-fabH and 35-tpiA). These two markers were included in all optimal sets. Hence, by selecting an optimal set of markers, a large improvement in resolution and compatibility can be obtained over random concatenation. An exhaustive search strategy was employed to find the optimal set of markers since the total number of available markers was relatively small. It should be pointed out that the number of possible marker combinations increases rapidly with the number
of markers considered Urocanase and soon becomes computationally intractable. As all the 742 gene fragments of the core genome in the analysed population have recently become available in [3], an interesting extension to the current work would be to find the optimal set of markers based on all those genes. Such an optimisation could be carried out by utilising one of the myriad of available optimisation techniques, such as a simulated annealing approach [51, 52]. It should be noted that we do only try to minimize the value of the objective metrics, incongruence or resolution difference, with respect to the whole-genome topology. There is no guarantee that the whole genome topology accurately resembles the true underlying species topology as systematic errors and statistical inconsistencies in the phylogenetic inference method could be amplified when analyzing whole genome data [50, 53–55].