Skip to content

Ideas for speed improvement #2

@tetedange13

Description

@tetedange13

Hi there,

Your tool works perfect but is pretty slow : 10 minutes on a GTF of 1'756'105 rows
=> And I tested on a filtered GTF, because real one is rather 10s of millions of rows...

Could be due to usage of python regex to parse "attributes" column ?

I saw that it could be worth to "compile" regex 1st, and outside main loop ? (see: https://stackoverflow.com/questions/452104/is-it-worth-using-pythons-re-compile)
=> In practice NO time gain

There are other "regex" reimplementation out there : https://github.com/mrabarnett/mrab-regex

Otherwise, simply replace regex by simple attributes.split(" ;")
=> But that can be less robust
=> And in practice time gain is not huge either

Or maybe replace by Pola.rs, written in rust ?

Thanks !
Felix.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions