-
Notifications
You must be signed in to change notification settings - Fork 7
Description
Hi GenomeKit team! Regarding hg38 patch compatibility. Imagine we are using an annotation on hg38.p12, and define the following three intervals:
import genome_kit as gk
genome = gk.Genome('ncbi_refseq.v109') # uses hg38.p12
itv = gk.Interval('chr7', '-', 65967466, 65968201, 'hg38')
itv_p12 = gk.Interval('chr7', '-', 65967466, 65968201, 'hg38.p12')
itv_p13 = gk.Interval('chr7', '-', 65967466, 65968201, 'hg38.p13')
Only the interval itv_p12 will work with the object genome. For example, if we try to retrieve the sequence:
genome.dna(itv) # error
genome.dna(itv_p12) # works
genome.dna(itv_p13) # error
Similar errors occur when trying to find overlapping genes and transcripts, or creating new intervals based on a combination of these three intervals.
However, the sequence information on the main chromosome does not change between patches, and the intervals are actually compatible with each other. It would be useful to have either:
- Support for combining main chromosome intervals across patches
- In practice, this would mean that the above operations do not error out, but take advantage of the fact that the coordinates are the same across patches and return the same result as if
itv_p12were used.
- In practice, this would mean that the above operations do not error out, but take advantage of the fact that the coordinates are the same across patches and return the same result as if
- A way of explicitly lifting/translating intervals across patches
- Imagine we had something like a
genome.make_compatible(itv)function that returns an interval on the same patch if the interval is on the main chromosome of the same major assembly.- In this case,
genome.make_compatible(itv)andgenome.make_compatible(itv_p13)should returnitv_p12
- In this case,
- It would be easiest if
genome.make_compatible(itv_p12)still returneditv_p12so we can call the function without checking the reference patch first. - If the intervals are not compatible (e.g. different major assemblies, or non-main chromosome), the function should throw an error
- Imagine we had something like a
This would be especially useful when dealing with intervals saved in a database. Currently, we are restricted to always working with the same patch that the interval was saved on, which limits our choice of annotations. This problem will get worse over time.
Let me know if any clarificiations are needed. Thank you!