rsRNA DATABASE

Data Introduction

In sequence analysis, rRNA-derived sequences were deleted in RNA-sequencing analysis and were considered to be by-products of rRNA degradation. In addition, rDNA is usually not included in the reference genome assembly, and even if it is included, some information may be lost because the reference transcriptome does not include all possible variations in rDNA. Many functional small non-coding RNAs derived from rRNA or its precursors exist widely in eukaryotes, but in previous studies, they are often mistaken for other types of small non-coding RNAs, such as miRNA, piRNA, etc. Despite these dilemmas, there is growing evidence that these rsRNAs may be stable biomolecules that may play a functional role in cells.

Relevant single-end sRNA-seq datasets were retrieved from the Gene Expression Omnibus (GEO, http://www.ncbi.nlm.nih.gov/geo/ ) database. All sequencing data were checked and processed for quality as well as filtered accordingly by Fastqc, Cutadapter and Fastp. The adapter sequence content of all samples was below 1%.

Mapped reads were matched to other Non-coding RNA database, such as Ensembl, Mirbase, GtRNAdb, Rfam and piRBase. Reads were mapped to these database assembly using BOWTIE with maximum of one mismatch, and were removed. All rDNA reference genomes were downloaded from the National Center for Biotechnology Information (NCBI) database as 5S rRNA (NR_023363.1), 12S rRNA (NR_137294.1), 16S rRNA (NR_137295.1) and 45S rRNA (NR_145819.1). Mapped reads retained in the previous step were mapped to all rDNA reference genomes assembly using BOWTIE with maximum of one mismatch.

We refer to the identification method of miRNA of mirdeep and mirdeep2, which is based on factors such as expression level, location distribution, and sequence length, to degenerate different isoforms of the same rsRNA sequence. For all sequences in each sample, sort according to the expression level from high to low and get the target sequences one by one. All the remaining sequences that differ from the starting position of the target sequence by 4 bases (including 4 bases) and whose length difference is ≤ 20% of the target sequence are regarded as reasonable offset sequences of the target sequence, and read counts are superimposed, and then all reasonably offset sequences are removed to complete the degeneracy of the target sequence. Integrate all degenerate sequences into a table, first logarithmized to reduce the magnitude of the data, and then normalize within each sample to characterize the proportion of the sequence. The normalized values of the same sequence in all samples are accumulated to represent the average expression level of this sequence in all samples, and only sequences with a weight > 1 are retained. 21 5S rsRNAs, 26 12S rsRNAs, 52 16S rsRNAs, and 692 45S rsRNAs were obtained. Finally, the filtered sequences were recovered with the same screening conditions (the starting position differed by 4 bases (including 4 bases), and the length difference was ≤ 20% of the target sequence), and 791 sequences were identified.