Skip to content

Interval compatibility across patches #85

@eholgersen

Description

@eholgersen

Hi GenomeKit team! Regarding hg38 patch compatibility. Imagine we are using an annotation on hg38.p12, and define the following three intervals:

import genome_kit as gk

genome = gk.Genome('ncbi_refseq.v109') # uses hg38.p12

itv = gk.Interval('chr7', '-', 65967466, 65968201, 'hg38')
itv_p12 = gk.Interval('chr7', '-', 65967466, 65968201, 'hg38.p12')
itv_p13 = gk.Interval('chr7', '-', 65967466, 65968201, 'hg38.p13')

Only the interval itv_p12 will work with the object genome. For example, if we try to retrieve the sequence:

genome.dna(itv) # error
genome.dna(itv_p12) # works
genome.dna(itv_p13) # error

Similar errors occur when trying to find overlapping genes and transcripts, or creating new intervals based on a combination of these three intervals.

However, the sequence information on the main chromosome does not change between patches, and the intervals are actually compatible with each other. It would be useful to have either:

  • Support for combining main chromosome intervals across patches
    • In practice, this would mean that the above operations do not error out, but take advantage of the fact that the coordinates are the same across patches and return the same result as if itv_p12 were used.
  • A way of explicitly lifting/translating intervals across patches
    • Imagine we had something like a genome.make_compatible(itv) function that returns an interval on the same patch if the interval is on the main chromosome of the same major assembly.
      • In this case, genome.make_compatible(itv) and genome.make_compatible(itv_p13) should return itv_p12
    • It would be easiest if genome.make_compatible(itv_p12) still returned itv_p12 so we can call the function without checking the reference patch first.
    • If the intervals are not compatible (e.g. different major assemblies, or non-main chromosome), the function should throw an error

This would be especially useful when dealing with intervals saved in a database. Currently, we are restricted to always working with the same patch that the interval was saved on, which limits our choice of annotations. This problem will get worse over time.

Let me know if any clarificiations are needed. Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions