Extract Module
Assembly-free Module for Extracting the Longest or Shortest Telomeric Reads
The Extract module is designed for use cases where direct extraction is required without assembly. It extracts telomeric sequences located beyond the chromosomal ends and containing telomere motifs, and directly integrates the extracted sequences into the original genome.
# optional arguments:
# -h, --help show this help message and exit
# --Max_length Extract longest reads
# --Min_length Extract shortest reads
# --dir_ont Directory containing ONT files
# --dir_hifi Directory containing HiFi files
# -L , --lgsreads Long-read sequencing data
# -W , --wgs1 Path to WGS reads (read 1)
# -w , --wgs2 Path to WGS reads (read 2)
# -N , --NextPolish Path to NextPolish tool
# -t , --threads Number of threads to use (default: 20)
# --polish Perform polishing with NextPolish
# Parameters of telocomp_Complement
# --dir_Max Select the telomere reads obtained by polishing the longest reads to
# add to the genome
# --dir_Min Select the telomere reads obtained by polishing the shortest reads
# to add to the genome
# -m , --motif Telomeric repeats sequences, e.g., plant: CCCTAAA(TTTAGGG), animal:
# TTAGGG(CCCTAA), etc.
# -M , --motif_num Input the number of bases of the telomere motif
Extracting the longest telomeric reads
In this step, the longest reads are directly extracted and the results are saved to the MaxLength_L and MaxLength_R directories. The sequences used for genome completion are obtained by merging the FASTA files from MaxLength_L and MaxLength_R into MaxLength_NP.
# No polishing
$ telocomp_maxmin --Max_length \
--dir_ont algn_output_ont \
--dir_hifi algn_output_hifi \
# Polishing
$ telocomp_maxmin --Max_length \
--dir_ont algn_output_ont \
--dir_hifi algn_output_hifi \
-L HiFi.fastq.gz \
-W WGS_f1.fq.gz \
-w WGS_r2.fq.gz \
--polish \
-N /PATH/NextPolish -t 50
Extracting the shortest telomeric reads
In this step, the shortest reads are directly extracted and the results are saved to the MinLength_L and MinLength_R directories. The sequences used for genome completion are obtained by merging the FASTA files from MinLength_L and MinLength_R into MinLength_NP.
# No polishing
$ telocomp_maxmin --Min_length \
--dir_ont algn_output_ont \
--dir_hifi algn_output_hifi \
# Polishing
$ telocomp_maxmin --Min_length \
--dir_ont algn_output_ont \
--dir_hifi algn_output_hifi \
-L HiFi.fastq.gz \
-W WGS_f1.fq.gz \
-w WGS_r2.fq.gz \
--polish \
-N /PATH/NextPolish -t 50
Complement part
The Complement part operates on the longest and shortest reads extracted in the previous step, and integrates these reads into the corresponding positions of the genome.
$ telocomp_Complement --dir_Max -G /PATH/test_sequence.fasta -m CCCTAAA -M 7
$ telocomp_Complement --dir_Min -G /PATH/test_sequence.fasta -m CCCTAAA -M 7