Filter module
The Filter module primarily performs the following tasks: extracting soft-clipped sequences, detecting telomere motifs, extracting reads containing the telomere motifs, and performing pre-assembly processing on the obtained reads. TeloComp Filter_1 outputs BAM files containing soft-clipped sequences that extend beyond the chromosomal ends, for both ONT and HiFi reads. TeloComp Filter_2 first identifies the main telomere sequence types, displaying the top 10 on the screen and saving the remaining types to a TXT file. After the user selects the desired telomere types, Filter2 extracts and outputs the corresponding reads in FASTA format, stored separately in the ONT and HiFi directories. Finally, the processed data are output to the trim_L and trim_R directories.
Terminal Overhang Read Filtering
The first step of Filter module is intended to extract soft-clipped sequences located beyond the chromosomal ends of the genome.
# optional arguments:
# -h, --help show this help message and exit
# --genome Input genome FASTA file.
# --fai Input genome index (FAI) file.
# --ont Input ONT data file (optional).
# --hifi Input HiFi data file (optional).
# --threads Number of threads to use with minimap2.
# --motifs A list of telomeric repeat motifs to use for filtering (optional).
# --max_break Maximum tolerable fracture length for soft shear.
# --min_clip Minimum cutting length.
# --Ob BAM output path after ONT filtering.
# --Hb HiFi filtered BAM output path.
$ telocomp_Filter_1 --genome genome.fasta \
--fai genome.fasta.fai \
--ont ont.fq.gz \
--hifi hifi.fastq.gz \
--threads 50 \
--Ob ont_out.bam --Hb hifi_out.bam
Telomeric Motif Detection and Read Filtering
The second step of the Filter module is designed to detect, extract, and process reads containing the predefined telomere motifs of interest, starting with the import of the BAM file.
# optional arguments:
# -h, --help show this help message and exit
# --ont_bam Input ONT BAM
# --hifi_bam Input HiFi BAM
# -o, --out_dir Output directory
# -c, --coverage The coverage parameter ranges from 0 to 100 and is used to trim reads
# according to the selected coverage level
# -p, --parallels Parameter for parallel processing of reads, with a default value of 5
# --min_ratio The proportion of the original genome sequence to the length of the
# reads, default=0.2
$ telocomp_Filter_2 --ont_bam ont_out.bam \
--hifi_bam hifi_out.bam \
-o output_dir/ \
-c 100 -p 10 --min_ratio 0.2
# If the compute node is submitted or suspended, please use:
$ echo "1" | telocomp_Filter_2 --ont_bam ont_out.bam \
--hifi_bam hifi_out.bam \
-o output_dir/ \
-c 100 -p 10 --min_ratio 0.2