gfidx/presentation/index.qmd
eternal-flame-AD 7e8a46ded0
Upload presentation
Signed-off-by: eternal-flame-AD <yume@yumechi.jp>
2024-08-06 19:49:57 -05:00

143 lines
No EOL
6.4 KiB
Text

---
title: "General Feature index (GFidx)"
subtitle: "Space-efficient and fast index for querying large General Feature Format (GFF) files"
author: ["<redacted>"]
date: "2024-08-01"
format:
revealjs:
transition: fade
theme: sakura.scss
slideNumber: true
embed-resources: true
csl: american-chemical-society.csl
bibliography: presentation.bib
---
## Introduction {.smaller}
- General Feature Format (GFF) files are used to store genomic features and annotations. The GFF3 format is a widely used standard for representing genomic feature hierarchies.
:::: {.columns}
::: {.column width="50%"}
![Sequence Ontology Relationship](http://sequenceontology.org/img/a_slice_of_so.jpg)
:::
::: {.column width="50%"}
Each feature in a GFF file is represented as a line of tab-separated fields. The fields include:
- The range and location of the feature
- The Sequence Ontology (SO) [@noauthor_httpwwwsequenceontologyorg_nodate] term of the feature
- The Sequence Ontology (SO) parent of the feature
- Other attributes such as gene name, gene type, aliases, etc.
:::
::::
## Motivation {.smaller}
- GFF files can be very large, making it difficult to query them efficiently:
- In the Human GENCODE GFF3 file, there are 3.7 million features. A lot of genes have more than 400 features associated with them.
:::: {.columns}
::: {.column width="50%"}
- Often only a subset of the features are needed for analysis.
- Inefficient to load the entire file into memory.
- In a cluster environment, the whole file may be split across multiple nodes
- Looking up a feature by ID or attribute requires a linear scan of the file, which is time and resource consuming.
:::
::: {.column width="50%"}
- Example pipelines and operations include:
- Gene-centric analysis: large number of ID -> feature queries
- RNA alternative splicing analysis: large number of exon -> gene queries
- Genome browser: range to feature queries
- General queries: Ancestors, descendants and overlaps of a feature
:::
::::
## Existing Solutions
- Heuristic-based solutions:
- Assume that features are ordered in a way that related features are close to each other.
- No guarantee that features are ordered in this way
- Only solves the problem for looking up by relation
- Relational databases:
- Load the GFF file into a database
- Requires predefined schema for the attributes
- Querying by range or name/ID prefix is still slow
## GFidx - General Feature index {.smaller}
![GFidx structure](diagrams/all_indices.svg)
- A range index is first built to save the approximate location of each feature in the GFF file. This allows further indexing to only save a reference to the feature in the GFF file instead of the whole feature.
- Two additional types of indices are built:
- Attribute indices: for looking up features by ID or attribute prefix
- Relation tree: for looking up features by parent-child relationships
## Methodology - Lookup by range {.smaller}
:::: {.columns}
::: {.column width="40%"}
- For each sequence, split the range into non-overlapping intervals and record the location of the first and last feature that starts in each interval in the GFF file.
- Split the intervals recursively until the number of features in each interval is small enough, forming a tree structure.
- Some extra data may be returned but by controlling the granularity of the tree, a balance between the amount of unnecessary data and the size of the index can be achieved.
:::
::: {.column width="60%"}
![Range index structure](diagrams/range_index.svg){style="width: 100%"}
:::
::::
## Methodology - Lookup by attribute {.smaller}
- Instead of recording the IDs and attributes by feature, we sort them into a Trie structure, and only record the location of the corresponding feature in the GFF file.
- Fast lookup by ID or attribute prefix.
- Space-efficient: the long ENSEMBL gene IDs are stored only once.
- Fast to build: Only one pass through the GFF file is needed.
![](diagrams/attr_trie.svg){style="width: 100%"}
## Methodology - Lookup by relation {.smaller}
- Parent-child relationships are stored in a tree structure.
- Each node contains the range of the feature and a list of children.
- The tree is built by iterating through the GFF file and adding each feature to the corresponding parent node in the tree.
- Since GFF3 requires all relations to be in the Sequence Ontology structure, the tree structure can be pre-built and reused for different queries.
![](diagrams/relation_tree.svg){style="width: 100%"}
## Results {.smaller}
- A full index of the Human GENCODE [@frankish_gencode_2019] GFF3 file (1.62 GB decompressed) can be built using 3 processors in less than 20 seconds.
- All queries are done in less than 0.2 seconds. Demo queries include:
- All features of CDCA8 (Cell Division Cycle Associated 8) gene - 34 features, 600 $\mu$s, 272 KB downloaded.
- All genes in the SLC35 family (Nucleotide Sugar Transporter Family) - 2,926 features, 128 ms, 22 MB downloaded.
- All genes in chr3 from 650,000 to 1,500,000 bp - 243 features, 1.70 ms, 120 KB downloaded.
- More effort is needed to optimize the space efficiency of the index:
- Simply using an off-the-shelf binary format, the uncompressed index containing all relavant attributes is 268 MB.
- Most of the space is used by the relation index, nearly 100 MB.
## Conclusion
- GFidx has a potential to be a useful tool for both in-memory and over-the-network querying of large GFF files in certain pipelines and applications.
- Web-based services could benefit from GFidx by only downloading the relevant parts of the GFF file.
- Further space optimization and parallelization could be done especially on the trie building and relation tree building steps.
## Future Work {.smaller}
- More research on common queries and operations on GFF files.
- More flexibility on the granularity of indexes to save time and space.
- More efficient encoding of feature locations.
- Compression of the index using BGZF (used in BAM [@noauthor_bam_nodate] files).
- Parallelization of the trie and tree building.
- A WebAssembly build for use in web applications such as IGV Genome Browser: only download the file range that is currently being viewed.
- Support for features with multiple parents (allowed in GFF3 but rarely used).
- Test suites for real-world GFF3 files.
- Integration with existing bioinformatics tools and pipelines.
## References
::: {#refs}
:::