Filter module

The Filter module primarily performs the following tasks: extracting soft-clipped sequences, detecting telomere motifs, extracting reads containing the telomere motifs, and performing pre-assembly processing on the obtained reads. TeloComp Filter_1 outputs BAM files containing soft-clipped sequences that extend beyond the chromosomal ends, for both ONT and HiFi reads. TeloComp Filter_2 first identifies the main telomere sequence types, displaying the top 10 on the screen and saving the remaining types to a TXT file. After the user selects the desired telomere types, Filter2 extracts and outputs the corresponding reads in FASTA format, stored separately in the ONT and HiFi directories. Finally, the processed data are output to the trim_L and trim_R directories.

Terminal Overhang Read Filtering

The first step of Filter module is intended to extract soft-clipped sequences located beyond the chromosomal ends of the genome.

# optional arguments:
#   -h, --help   show this help message and exit
#   --genome     Input genome FASTA file.
#   --fai        Input genome index (FAI) file.
#   --ont        Input ONT data file (optional).
#   --hifi       Input HiFi data file (optional).
#   --threads    Number of threads to use with minimap2.
#   --motifs     A list of telomeric repeat motifs to use for filtering (optional).
#   --max_break  Maximum tolerable fracture length for soft shear.
#   --min_clip   Minimum cutting length.
#   --Ob         BAM output path after ONT filtering.
#   --Hb         HiFi filtered BAM output path.

$ telocomp_Filter_1 --genome genome.fasta \
                    --fai genome.fasta.fai \
                    --ont ont.fq.gz \
                    --hifi hifi.fastq.gz \
                    --threads 50 \
                    --Ob ont_out.bam --Hb hifi_out.bam

Telomeric Motif Detection and Read Filtering

The second step of the Filter module is designed to detect, extract, and process reads containing the predefined telomere motifs of interest, starting with the import of the BAM file.

# optional arguments:
#    -h, --help       show this help message and exit
#    --ont_bam        Input ONT BAM
#    --hifi_bam       Input HiFi BAM
#    -o, --out_dir    Output directory
#    -c, --coverage   The coverage parameter ranges from 0 to 100 and is used to trim reads
#                     according to the selected coverage level
#    -p, --parallels  Parameter for parallel processing of reads, with a default value of 5
#    --min_ratio      The proportion of the original genome sequence to the length of the
#                     reads, default=0.2

$ telocomp_Filter_2 --ont_bam ont_out.bam \
                    --hifi_bam hifi_out.bam \
                    -o output_dir/ \
                    -c 100 -p 10 --min_ratio 0.2

# If the compute node is submitted or suspended, please use:
$ echo "1" | telocomp_Filter_2 --ont_bam ont_out.bam \
                    --hifi_bam hifi_out.bam \
                    -o output_dir/ \
                    -c 100 -p 10 --min_ratio 0.2