Computational sequence alignment can be utilized to compare closely related genomes at the nucleotide level. However, on the evolutionary timescale, the genomes of different species can diverge significantly due to frequent nucleotide insertions, deletions, and rearrangements. This inherent genetic variability makes straightforward sequence alignment nearly impossible, so genomes are instead typically compared via the absence/presence of a library of common genes. However, in the absence of easily identifiable common genes, there are few available computational methods capable of comparing different whole genome sequences. In this invention, the analysis of the frequencies of nucleotides or aminoacids in a genome provides an alternative and phylogenetically deeper way of understanding and identifying relations between and within organisms.
This invention establishes the method to extract frequencies of words from the genomes of organisms. It also creates the measures to evaluate the likelihood that a sequence is derived from another one, where alternative “distances” are designed and implemented. The same strategy can be applied to identify specific genes within a genome and produce genome profiles that indicate how the distribution of a section of the genome is related to the rest. These genomic profiles reflect very deep phylogenetic relationships that cannot be elucidated by traditional sequence alignment and phylogenetic techniques.
Patent Pending
Tech Ventures Reference: IR 2546