gfidx/presentation/index.qmd

---
title: "General Feature index (GFidx)"
subtitle: "Space-efficient and fast index for querying large General Feature Format (GFF) files"
author: ["<redacted>"]
date: "2024-08-01"
format:
    revealjs:
      transition: fade
      theme: sakura.scss
      slideNumber: true
      embed-resources: true
csl: american-chemical-society.csl
bibliography: presentation.bib
---

## Introduction {.smaller}

- General Feature Format (GFF) files are used to store genomic features and annotations. The GFF3 format is a widely used standard for representing genomic feature hierarchies.

:::: {.columns}

::: {.column width="50%"}
![Sequence Ontology Relationship](http://sequenceontology.org/img/a_slice_of_so.jpg)
:::

::: {.column width="50%"}
Each feature in a GFF file is represented as a line of tab-separated fields. The fields include:

- The range and location of the feature
- The Sequence Ontology (SO) [@noauthor_httpwwwsequenceontologyorg_nodate] term of the feature
- The Sequence Ontology (SO) parent of the feature
- Other attributes such as gene name, gene type, aliases, etc.
:::
::::


## Motivation {.smaller}

- GFF files can be very large, making it difficult to query them efficiently:
- In the Human GENCODE GFF3 file, there are 3.7 million features. A lot of genes have more than 400 features associated with them.

:::: {.columns}

::: {.column width="50%"}
  - Often only a subset of the features are needed for analysis.
    - Inefficient to load the entire file into memory.
    - In a cluster environment, the whole file may be split across multiple nodes
  - Looking up a feature by ID or attribute requires a linear scan of the file, which is time and resource consuming.
:::
::: {.column width="50%"}
  - Example pipelines and operations include:
    - Gene-centric analysis: large number of ID -> feature queries
    - RNA alternative splicing analysis: large number of exon -> gene queries
    - Genome browser: range to feature queries
    - General queries: Ancestors, descendants and overlaps of a feature
:::

::::

## Existing Solutions

- Heuristic-based solutions:
  - Assume that features are ordered in a way that related features are close to each other.
    - No guarantee that features are ordered in this way
    - Only solves the problem for looking up by relation
- Relational databases:
  - Load the GFF file into a database
    - Requires predefined schema for the attributes
    - Querying by range or name/ID prefix is still slow

## GFidx - General Feature index {.smaller}

![GFidx structure](diagrams/all_indices.svg)

- A range index is first built to save the approximate location of each feature in the GFF file. This allows further indexing to only save a reference to the feature in the GFF file instead of the whole feature.
- Two additional types of indices are built:
  - Attribute indices: for looking up features by ID or attribute prefix
  - Relation tree: for looking up features by parent-child relationships

## Methodology - Lookup by range {.smaller}

:::: {.columns}
::: {.column width="40%"}
- For each sequence, split the range into non-overlapping intervals and record the location of the first and last feature that starts in each interval in the GFF file.
- Split the intervals recursively until the number of features in each interval is small enough, forming a tree structure.
- Some extra data may be returned but by controlling the granularity of the tree, a balance between the amount of unnecessary data and the size of the index can be achieved.
:::
::: {.column width="60%"}
![Range index structure](diagrams/range_index.svg){style="width: 100%"}
:::
::::

## Methodology - Lookup by attribute {.smaller}

- Instead of recording the IDs and attributes by feature, we sort them into a Trie structure, and only record the location of the corresponding feature in the GFF file.
  - Fast lookup by ID or attribute prefix.
  - Space-efficient: the long ENSEMBL gene IDs are stored only once.
  - Fast to build: Only one pass through the GFF file is needed.

![](diagrams/attr_trie.svg){style="width: 100%"}

## Methodology - Lookup by relation {.smaller}

- Parent-child relationships are stored in a tree structure.
  - Each node contains the range of the feature and a list of children.
  - The tree is built by iterating through the GFF file and adding each feature to the corresponding parent node in the tree.
  - Since GFF3 requires all relations to be in the Sequence Ontology structure, the tree structure can be pre-built and reused for different queries.

![](diagrams/relation_tree.svg){style="width: 100%"}

## Results {.smaller}

- A full index of the Human GENCODE [@frankish_gencode_2019] GFF3 file (1.62 GB decompressed) can be built using 3 processors in less than 20 seconds.
- All queries are done in less than 0.2 seconds. Demo queries include:
  - All features of CDCA8 (Cell Division Cycle Associated 8) gene - 34 features, 600 $\mu$s, 272 KB downloaded.
  - All genes in the SLC35 family (Nucleotide Sugar Transporter Family) - 2,926 features, 128 ms, 22 MB downloaded.
  - All genes in chr3 from 650,000 to 1,500,000 bp - 243 features, 1.70 ms, 120 KB downloaded.
- More effort is needed to optimize the space efficiency of the index:
  - Simply using an off-the-shelf binary format, the uncompressed index containing all relavant attributes is 268 MB.
  - Most of the space is used by the relation index, nearly 100 MB.

## Conclusion

- GFidx has a potential to be a useful tool for both in-memory and over-the-network querying of large GFF files in certain pipelines and applications.
- Web-based services could benefit from GFidx by only downloading the relevant parts of the GFF file.
- Further space optimization and parallelization could be done especially on the trie building and relation tree building steps.

## Future Work {.smaller}

- More research on common queries and operations on GFF files.
- More flexibility on the granularity of indexes to save time and space.
- More efficient encoding of feature locations.
- Compression of the index using BGZF (used in BAM [@noauthor_bam_nodate] files).
- Parallelization of the trie and tree building.
- A WebAssembly build for use in web applications such as IGV Genome Browser: only download the file range that is currently being viewed.
- Support for features with multiple parents (allowed in GFF3 but rarely used).
- Test suites for real-world GFF3 files.
- Integration with existing bioinformatics tools and pipelines.

## References

::: {#refs}
:::