Spaces:
No application file
No application file
# Copyright 2009-2020 by Peter Cock. All rights reserved. | |
# | |
# This file is part of the Biopython distribution and governed by your | |
# choice of the "Biopython License Agreement" or the "BSD 3-Clause License". | |
# Please see the LICENSE file that should have been included as part of this | |
# package. | |
"""Bio.SeqIO support for the FASTQ and QUAL file formats. | |
Note that you are expected to use this code via the Bio.SeqIO interface, as | |
shown below. | |
The FASTQ file format is used frequently at the Wellcome Trust Sanger Institute | |
to bundle a FASTA sequence and its PHRED quality data (integers between 0 and | |
90). Rather than using a single FASTQ file, often paired FASTA and QUAL files | |
are used containing the sequence and the quality information separately. | |
The PHRED software reads DNA sequencing trace files, calls bases, and | |
assigns a non-negative quality value to each called base using a logged | |
transformation of the error probability, Q = -10 log10( Pe ), for example:: | |
Pe = 1.0, Q = 0 | |
Pe = 0.1, Q = 10 | |
Pe = 0.01, Q = 20 | |
... | |
Pe = 0.00000001, Q = 80 | |
Pe = 0.000000001, Q = 90 | |
In typical raw sequence reads, the PHRED quality valuea will be from 0 to 40. | |
In the QUAL format these quality values are held as space separated text in | |
a FASTA like file format. In the FASTQ format, each quality values is encoded | |
with a single ASCI character using chr(Q+33), meaning zero maps to the | |
character "!" and for example 80 maps to "q". For the Sanger FASTQ standard | |
the allowed range of PHRED scores is 0 to 93 inclusive. The sequences and | |
quality are then stored in pairs in a FASTA like format. | |
Unfortunately there is no official document describing the FASTQ file format, | |
and worse, several related but different variants exist. For more details, | |
please read this open access publication:: | |
The Sanger FASTQ file format for sequences with quality scores, and the | |
Solexa/Illumina FASTQ variants. | |
P.J.A.Cock (Biopython), C.J.Fields (BioPerl), N.Goto (BioRuby), | |
M.L.Heuer (BioJava) and P.M. Rice (EMBOSS). | |
Nucleic Acids Research 2010 38(6):1767-1771 | |
https://doi.org/10.1093/nar/gkp1137 | |
The good news is that Roche 454 sequencers can output files in the QUAL format, | |
and sensibly they use PHREP style scores like Sanger. Converting a pair of | |
FASTA and QUAL files into a Sanger style FASTQ file is easy. To extract QUAL | |
files from a Roche 454 SFF binary file, use the Roche off instrument command | |
line tool "sffinfo" with the -q or -qual argument. You can extract a matching | |
FASTA file using the -s or -seq argument instead. | |
The bad news is that Solexa/Illumina did things differently - they have their | |
own scoring system AND their own incompatible versions of the FASTQ format. | |
Solexa/Illumina quality scores use Q = - 10 log10 ( Pe / (1-Pe) ), which can | |
be negative. PHRED scores and Solexa scores are NOT interchangeable (but a | |
reasonable mapping can be achieved between them, and they are approximately | |
equal for higher quality reads). | |
Confusingly early Solexa pipelines produced a FASTQ like file but using their | |
own score mapping and an ASCII offset of 64. To make things worse, for the | |
Solexa/Illumina pipeline 1.3 onwards, they introduced a third variant of the | |
FASTQ file format, this time using PHRED scores (which is more consistent) but | |
with an ASCII offset of 64. | |
i.e. There are at least THREE different and INCOMPATIBLE variants of the FASTQ | |
file format: The original Sanger PHRED standard, and two from Solexa/Illumina. | |
The good news is that as of CASAVA version 1.8, Illumina sequencers will | |
produce FASTQ files using the standard Sanger encoding. | |
You are expected to use this module via the Bio.SeqIO functions, with the | |
following format names: | |
- "qual" means simple quality files using PHRED scores (e.g. from Roche 454) | |
- "fastq" means Sanger style FASTQ files using PHRED scores and an ASCII | |
offset of 33 (e.g. from the NCBI Short Read Archive and Illumina 1.8+). | |
These can potentially hold PHRED scores from 0 to 93. | |
- "fastq-sanger" is an alias for "fastq". | |
- "fastq-solexa" means old Solexa (and also very early Illumina) style FASTQ | |
files, using Solexa scores with an ASCII offset 64. These can hold Solexa | |
scores from -5 to 62. | |
- "fastq-illumina" means newer Illumina 1.3 to 1.7 style FASTQ files, using | |
PHRED scores but with an ASCII offset 64, allowing PHRED scores from 0 | |
to 62. | |
We could potentially add support for "qual-solexa" meaning QUAL files which | |
contain Solexa scores, but thus far there isn't any reason to use such files. | |
For example, consider the following short FASTQ file:: | |
@EAS54_6_R1_2_1_413_324 | |
CCCTTCTTGTCTTCAGCGTTTCTCC | |
+ | |
;;3;;;;;;;;;;;;7;;;;;;;88 | |
@EAS54_6_R1_2_1_540_792 | |
TTGGCAGGCCAAGGCCGATGGATCA | |
+ | |
;;;;;;;;;;;7;;;;;-;;;3;83 | |
@EAS54_6_R1_2_1_443_348 | |
GTTGCTTCTGGCGTGGGTGGGGGGG | |
+ | |
;;;;;;;;;;;9;7;;.7;393333 | |
This contains three reads of length 25. From the read length these were | |
probably originally from an early Solexa/Illumina sequencer but this file | |
follows the Sanger FASTQ convention (PHRED style qualities with an ASCII | |
offset of 33). This means we can parse this file using Bio.SeqIO using | |
"fastq" as the format name: | |
from Bio import SeqIO | |
for record in SeqIO.parse("Quality/example.fastq", "fastq"): | |
print("%s %s" % (record.id, record.seq)) | |
EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC | |
EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA | |
EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG | |
The qualities are held as a list of integers in each record's annotation: | |
print(record) | |
ID: EAS54_6_R1_2_1_443_348 | |
Name: EAS54_6_R1_2_1_443_348 | |
Description: EAS54_6_R1_2_1_443_348 | |
Number of features: 0 | |
Per letter annotation for: phred_quality | |
Seq('GTTGCTTCTGGCGTGGGTGGGGGGG') | |
print(record.letter_annotations["phred_quality"]) | |
[26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18] | |
You can use the SeqRecord format method to show this in the QUAL format: | |
print(record.format("qual")) | |
>EAS54_6_R1_2_1_443_348 | |
26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18 | |
24 18 18 18 18 | |
<BLANKLINE> | |
Or go back to the FASTQ format, use "fastq" (or "fastq-sanger"): | |
print(record.format("fastq")) | |
@EAS54_6_R1_2_1_443_348 | |
GTTGCTTCTGGCGTGGGTGGGGGGG | |
+ | |
;;;;;;;;;;;9;7;;.7;393333 | |
<BLANKLINE> | |
Or, using the Illumina 1.3+ FASTQ encoding (PHRED values with an ASCII offset | |
of 64): | |
print(record.format("fastq-illumina")) | |
@EAS54_6_R1_2_1_443_348 | |
GTTGCTTCTGGCGTGGGTGGGGGGG | |
+ | |
ZZZZZZZZZZZXZVZZMVZRXRRRR | |
<BLANKLINE> | |
You can also get Biopython to convert the scores and show a Solexa style | |
FASTQ file: | |
print(record.format("fastq-solexa")) | |
@EAS54_6_R1_2_1_443_348 | |
GTTGCTTCTGGCGTGGGTGGGGGGG | |
+ | |
ZZZZZZZZZZZXZVZZMVZRXRRRR | |
<BLANKLINE> | |
Notice that this is actually the same output as above using "fastq-illumina" | |
as the format! The reason for this is all these scores are high enough that | |
the PHRED and Solexa scores are almost equal. The differences become apparent | |
for poor quality reads. See the functions solexa_quality_from_phred and | |
phred_quality_from_solexa for more details. | |
If you wanted to trim your sequences (perhaps to remove low quality regions, | |
or to remove a primer sequence), try slicing the SeqRecord objects. e.g. | |
sub_rec = record[5:15] | |
print(sub_rec) | |
ID: EAS54_6_R1_2_1_443_348 | |
Name: EAS54_6_R1_2_1_443_348 | |
Description: EAS54_6_R1_2_1_443_348 | |
Number of features: 0 | |
Per letter annotation for: phred_quality | |
Seq('TTCTGGCGTG') | |
print(sub_rec.letter_annotations["phred_quality"]) | |
[26, 26, 26, 26, 26, 26, 24, 26, 22, 26] | |
print(sub_rec.format("fastq")) | |
@EAS54_6_R1_2_1_443_348 | |
TTCTGGCGTG | |
+ | |
;;;;;;9;7; | |
<BLANKLINE> | |
If you wanted to, you could read in this FASTQ file, and save it as a QUAL file: | |
from Bio import SeqIO | |
record_iterator = SeqIO.parse("Quality/example.fastq", "fastq") | |
with open("Quality/temp.qual", "w") as out_handle: | |
SeqIO.write(record_iterator, out_handle, "qual") | |
3 | |
You can of course read in a QUAL file, such as the one we just created: | |
from Bio import SeqIO | |
for record in SeqIO.parse("Quality/temp.qual", "qual"): | |
print("%s read of length %d" % (record.id, len(record.seq))) | |
EAS54_6_R1_2_1_413_324 read of length 25 | |
EAS54_6_R1_2_1_540_792 read of length 25 | |
EAS54_6_R1_2_1_443_348 read of length 25 | |
Notice that QUAL files don't have a proper sequence present! But the quality | |
information is there: | |
print(record) | |
ID: EAS54_6_R1_2_1_443_348 | |
Name: EAS54_6_R1_2_1_443_348 | |
Description: EAS54_6_R1_2_1_443_348 | |
Number of features: 0 | |
Per letter annotation for: phred_quality | |
Undefined sequence of length 25 | |
print(record.letter_annotations["phred_quality"]) | |
[26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18] | |
Just to keep things tidy, if you are following this example yourself, you can | |
delete this temporary file now: | |
import os | |
os.remove("Quality/temp.qual") | |
Sometimes you won't have a FASTQ file, but rather just a pair of FASTA and QUAL | |
files. Because the Bio.SeqIO system is designed for reading single files, you | |
would have to read the two in separately and then combine the data. However, | |
since this is such a common thing to want to do, there is a helper iterator | |
defined in this module that does this for you - PairedFastaQualIterator. | |
Alternatively, if you have enough RAM to hold all the records in memory at once, | |
then a simple dictionary approach would work: | |
from Bio import SeqIO | |
reads = SeqIO.to_dict(SeqIO.parse("Quality/example.fasta", "fasta")) | |
for rec in SeqIO.parse("Quality/example.qual", "qual"): | |
reads[rec.id].letter_annotations["phred_quality"]=rec.letter_annotations["phred_quality"] | |
You can then access any record by its key, and get both the sequence and the | |
quality scores. | |
print(reads["EAS54_6_R1_2_1_540_792"].format("fastq")) | |
@EAS54_6_R1_2_1_540_792 | |
TTGGCAGGCCAAGGCCGATGGATCA | |
+ | |
;;;;;;;;;;;7;;;;;-;;;3;83 | |
<BLANKLINE> | |
It is important that you explicitly tell Bio.SeqIO which FASTQ variant you are | |
using ("fastq" or "fastq-sanger" for the Sanger standard using PHRED values, | |
"fastq-solexa" for the original Solexa/Illumina variant, or "fastq-illumina" | |
for the more recent variant), as this cannot be detected reliably | |
automatically. | |
To illustrate this problem, let's consider an artificial example: | |
from Bio.Seq import Seq | |
from Bio.SeqRecord import SeqRecord | |
test = SeqRecord(Seq("NACGTACGTA"), id="Test", description="Made up!") | |
print(test.format("fasta")) | |
>Test Made up! | |
NACGTACGTA | |
<BLANKLINE> | |
print(test.format("fastq")) | |
Traceback (most recent call last): | |
... | |
ValueError: No suitable quality scores found in letter_annotations of SeqRecord (id=Test). | |
We created a sample SeqRecord, and can show it in FASTA format - but for QUAL | |
or FASTQ format we need to provide some quality scores. These are held as a | |
list of integers (one for each base) in the letter_annotations dictionary: | |
test.letter_annotations["phred_quality"] = [0, 1, 2, 3, 4, 5, 10, 20, 30, 40] | |
print(test.format("qual")) | |
>Test Made up! | |
0 1 2 3 4 5 10 20 30 40 | |
<BLANKLINE> | |
print(test.format("fastq")) | |
@Test Made up! | |
NACGTACGTA | |
+ | |
!"#$%&+5?I | |
<BLANKLINE> | |
We can check this FASTQ encoding - the first PHRED quality was zero, and this | |
mapped to a exclamation mark, while the final score was 40 and this mapped to | |
the letter "I": | |
ord('!') - 33 | |
0 | |
ord('I') - 33 | |
40 | |
[ord(letter)-33 for letter in '!"#$%&+5?I'] | |
[0, 1, 2, 3, 4, 5, 10, 20, 30, 40] | |
Similarly, we could produce an Illumina 1.3 to 1.7 style FASTQ file using PHRED | |
scores with an offset of 64: | |
print(test.format("fastq-illumina")) | |
@Test Made up! | |
NACGTACGTA | |
+ | |
@ABCDEJT^h | |
<BLANKLINE> | |
And we can check this too - the first PHRED score was zero, and this mapped to | |
"@", while the final score was 40 and this mapped to "h": | |
ord("@") - 64 | |
0 | |
ord("h") - 64 | |
40 | |
[ord(letter)-64 for letter in "@ABCDEJT^h"] | |
[0, 1, 2, 3, 4, 5, 10, 20, 30, 40] | |
Notice how different the standard Sanger FASTQ and the Illumina 1.3 to 1.7 style | |
FASTQ files look for the same data! Then we have the older Solexa/Illumina | |
format to consider which encodes Solexa scores instead of PHRED scores. | |
First let's see what Biopython says if we convert the PHRED scores into Solexa | |
scores (rounding to one decimal place): | |
for q in [0, 1, 2, 3, 4, 5, 10, 20, 30, 40]: | |
print("PHRED %i maps to Solexa %0.1f" % (q, solexa_quality_from_phred(q))) | |
PHRED 0 maps to Solexa -5.0 | |
PHRED 1 maps to Solexa -5.0 | |
PHRED 2 maps to Solexa -2.3 | |
PHRED 3 maps to Solexa -0.0 | |
PHRED 4 maps to Solexa 1.8 | |
PHRED 5 maps to Solexa 3.3 | |
PHRED 10 maps to Solexa 9.5 | |
PHRED 20 maps to Solexa 20.0 | |
PHRED 30 maps to Solexa 30.0 | |
PHRED 40 maps to Solexa 40.0 | |
Now here is the record using the old Solexa style FASTQ file: | |
print(test.format("fastq-solexa")) | |
@Test Made up! | |
NACGTACGTA | |
+ | |
;;>@BCJT^h | |
<BLANKLINE> | |
Again, this is using an ASCII offset of 64, so we can check the Solexa scores: | |
[ord(letter)-64 for letter in ";;>@BCJT^h"] | |
[-5, -5, -2, 0, 2, 3, 10, 20, 30, 40] | |
This explains why the last few letters of this FASTQ output matched that using | |
the Illumina 1.3 to 1.7 format - high quality PHRED scores and Solexa scores | |
are approximately equal. | |
""" | |
import warnings | |
from math import log | |
from Bio import BiopythonParserWarning | |
from Bio import BiopythonWarning | |
from Bio import BiopythonDeprecationWarning | |
from Bio import StreamModeError | |
from Bio.File import as_handle | |
from Bio.Seq import Seq | |
from Bio.SeqRecord import SeqRecord | |
from .Interfaces import _clean | |
from .Interfaces import _get_seq_string | |
from .Interfaces import SequenceIterator | |
from .Interfaces import SequenceWriter | |
# define score offsets. See discussion for differences between Sanger and | |
# Solexa offsets. | |
SANGER_SCORE_OFFSET = 33 | |
SOLEXA_SCORE_OFFSET = 64 | |
def solexa_quality_from_phred(phred_quality): | |
"""Convert a PHRED quality (range 0 to about 90) to a Solexa quality. | |
PHRED and Solexa quality scores are both log transformations of a | |
probality of error (high score = low probability of error). This function | |
takes a PHRED score, transforms it back to a probability of error, and | |
then re-expresses it as a Solexa score. This assumes the error estimates | |
are equivalent. | |
How does this work exactly? Well the PHRED quality is minus ten times the | |
base ten logarithm of the probability of error:: | |
phred_quality = -10*log(error,10) | |
Therefore, turning this round:: | |
error = 10 ** (- phred_quality / 10) | |
Now, Solexa qualities use a different log transformation:: | |
solexa_quality = -10*log(error/(1-error),10) | |
After substitution and a little manipulation we get:: | |
solexa_quality = 10*log(10**(phred_quality/10.0) - 1, 10) | |
However, real Solexa files use a minimum quality of -5. This does have a | |
good reason - a random base call would be correct 25% of the time, | |
and thus have a probability of error of 0.75, which gives 1.25 as the PHRED | |
quality, or -4.77 as the Solexa quality. Thus (after rounding), a random | |
nucleotide read would have a PHRED quality of 1, or a Solexa quality of -5. | |
Taken literally, this logarithic formula would map a PHRED quality of zero | |
to a Solexa quality of minus infinity. Of course, taken literally, a PHRED | |
score of zero means a probability of error of one (i.e. the base call is | |
definitely wrong), which is worse than random! In practice, a PHRED quality | |
of zero usually means a default value, or perhaps random - and therefore | |
mapping it to the minimum Solexa score of -5 is reasonable. | |
In conclusion, we follow EMBOSS, and take this logarithmic formula but also | |
apply a minimum value of -5.0 for the Solexa quality, and also map a PHRED | |
quality of zero to -5.0 as well. | |
Note this function will return a floating point number, it is up to you to | |
round this to the nearest integer if appropriate. e.g. | |
>>> print("%0.2f" % round(solexa_quality_from_phred(80), 2)) | |
80.00 | |
>>> print("%0.2f" % round(solexa_quality_from_phred(50), 2)) | |
50.00 | |
>>> print("%0.2f" % round(solexa_quality_from_phred(20), 2)) | |
19.96 | |
>>> print("%0.2f" % round(solexa_quality_from_phred(10), 2)) | |
9.54 | |
>>> print("%0.2f" % round(solexa_quality_from_phred(5), 2)) | |
3.35 | |
>>> print("%0.2f" % round(solexa_quality_from_phred(4), 2)) | |
1.80 | |
>>> print("%0.2f" % round(solexa_quality_from_phred(3), 2)) | |
-0.02 | |
>>> print("%0.2f" % round(solexa_quality_from_phred(2), 2)) | |
-2.33 | |
>>> print("%0.2f" % round(solexa_quality_from_phred(1), 2)) | |
-5.00 | |
>>> print("%0.2f" % round(solexa_quality_from_phred(0), 2)) | |
-5.00 | |
Notice that for high quality reads PHRED and Solexa scores are numerically | |
equal. The differences are important for poor quality reads, where PHRED | |
has a minimum of zero but Solexa scores can be negative. | |
Finally, as a special case where None is used for a "missing value", None | |
is returned: | |
>>> print(solexa_quality_from_phred(None)) | |
None | |
""" | |
if phred_quality is None: | |
# Assume None is used as some kind of NULL or NA value; return None | |
# e.g. Bio.SeqIO gives Ace contig gaps a quality of None. | |
return None | |
elif phred_quality > 0: | |
# Solexa uses a minimum value of -5, which after rounding matches a | |
# random nucleotide base call. | |
return max(-5.0, 10 * log(10 ** (phred_quality / 10.0) - 1, 10)) | |
elif phred_quality == 0: | |
# Special case, map to -5 as discussed in the docstring | |
return -5.0 | |
else: | |
raise ValueError( | |
f"PHRED qualities must be positive (or zero), not {phred_quality!r}" | |
) | |
def phred_quality_from_solexa(solexa_quality): | |
"""Convert a Solexa quality (which can be negative) to a PHRED quality. | |
PHRED and Solexa quality scores are both log transformations of a | |
probality of error (high score = low probability of error). This function | |
takes a Solexa score, transforms it back to a probability of error, and | |
then re-expresses it as a PHRED score. This assumes the error estimates | |
are equivalent. | |
The underlying formulas are given in the documentation for the sister | |
function solexa_quality_from_phred, in this case the operation is:: | |
phred_quality = 10*log(10**(solexa_quality/10.0) + 1, 10) | |
This will return a floating point number, it is up to you to round this to | |
the nearest integer if appropriate. e.g. | |
>>> print("%0.2f" % round(phred_quality_from_solexa(80), 2)) | |
80.00 | |
>>> print("%0.2f" % round(phred_quality_from_solexa(20), 2)) | |
20.04 | |
>>> print("%0.2f" % round(phred_quality_from_solexa(10), 2)) | |
10.41 | |
>>> print("%0.2f" % round(phred_quality_from_solexa(0), 2)) | |
3.01 | |
>>> print("%0.2f" % round(phred_quality_from_solexa(-5), 2)) | |
1.19 | |
Note that a solexa_quality less then -5 is not expected, will trigger a | |
warning, but will still be converted as per the logarithmic mapping | |
(giving a number between 0 and 1.19 back). | |
As a special case where None is used for a "missing value", None is | |
returned: | |
>>> print(phred_quality_from_solexa(None)) | |
None | |
""" | |
if solexa_quality is None: | |
# Assume None is used as some kind of NULL or NA value; return None | |
return None | |
if solexa_quality < -5: | |
warnings.warn( | |
f"Solexa quality less than -5 passed, {solexa_quality!r}", BiopythonWarning | |
) | |
return 10 * log(10 ** (solexa_quality / 10.0) + 1, 10) | |
def _get_phred_quality(record): | |
"""Extract PHRED qualities from a SeqRecord's letter_annotations (PRIVATE). | |
If there are no PHRED qualities, but there are Solexa qualities, those are | |
used instead after conversion. | |
""" | |
try: | |
return record.letter_annotations["phred_quality"] | |
except KeyError: | |
pass | |
try: | |
return [ | |
phred_quality_from_solexa(q) | |
for q in record.letter_annotations["solexa_quality"] | |
] | |
except KeyError: | |
raise ValueError( | |
"No suitable quality scores found in " | |
"letter_annotations of SeqRecord (id=%s)." % record.id | |
) from None | |
# Only map 0 to 93, we need to give a warning on truncating at 93 | |
_phred_to_sanger_quality_str = { | |
qp: chr(min(126, qp + SANGER_SCORE_OFFSET)) for qp in range(0, 93 + 1) | |
} | |
# Only map -5 to 93, we need to give a warning on truncating at 93 | |
_solexa_to_sanger_quality_str = { | |
qs: chr(min(126, int(round(phred_quality_from_solexa(qs)) + SANGER_SCORE_OFFSET))) | |
for qs in range(-5, 93 + 1) | |
} | |
def _get_sanger_quality_str(record): | |
"""Return a Sanger FASTQ encoded quality string (PRIVATE). | |
>>> from Bio.Seq import Seq | |
>>> from Bio.SeqRecord import SeqRecord | |
>>> r = SeqRecord(Seq("ACGTAN"), id="Test", | |
... letter_annotations = {"phred_quality":[50, 40, 30, 20, 10, 0]}) | |
>>> _get_sanger_quality_str(r) | |
'SI?5+!' | |
If as in the above example (or indeed a SeqRecord parser with Bio.SeqIO), | |
the PHRED qualities are integers, this function is able to use a very fast | |
pre-cached mapping. However, if they are floats which differ slightly, then | |
it has to do the appropriate rounding - which is slower: | |
>>> r2 = SeqRecord(Seq("ACGTAN"), id="Test2", | |
... letter_annotations = {"phred_quality":[50.0, 40.05, 29.99, 20, 9.55, 0.01]}) | |
>>> _get_sanger_quality_str(r2) | |
'SI?5+!' | |
If your scores include a None value, this raises an exception: | |
>>> r3 = SeqRecord(Seq("ACGTAN"), id="Test3", | |
... letter_annotations = {"phred_quality":[50, 40, 30, 20, 10, None]}) | |
>>> _get_sanger_quality_str(r3) | |
Traceback (most recent call last): | |
... | |
TypeError: A quality value of None was found | |
If (strangely) your record has both PHRED and Solexa scores, then the PHRED | |
scores are used in preference: | |
>>> r4 = SeqRecord(Seq("ACGTAN"), id="Test4", | |
... letter_annotations = {"phred_quality":[50, 40, 30, 20, 10, 0], | |
... "solexa_quality":[-5, -4, 0, None, 0, 40]}) | |
>>> _get_sanger_quality_str(r4) | |
'SI?5+!' | |
If there are no PHRED scores, but there are Solexa scores, these are used | |
instead (after the appropriate conversion): | |
>>> r5 = SeqRecord(Seq("ACGTAN"), id="Test5", | |
... letter_annotations = {"solexa_quality":[40, 30, 20, 10, 0, -5]}) | |
>>> _get_sanger_quality_str(r5) | |
'I?5+$"' | |
Again, integer Solexa scores can be looked up in a pre-cached mapping making | |
this very fast. You can still use approximate floating point scores: | |
>>> r6 = SeqRecord(Seq("ACGTAN"), id="Test6", | |
... letter_annotations = {"solexa_quality":[40.1, 29.7, 20.01, 10, 0.0, -4.9]}) | |
>>> _get_sanger_quality_str(r6) | |
'I?5+$"' | |
Notice that due to the limited range of printable ASCII characters, a | |
PHRED quality of 93 is the maximum that can be held in an Illumina FASTQ | |
file (using ASCII 126, the tilde). This function will issue a warning | |
in this situation. | |
""" | |
# TODO - This functions works and is fast, but it is also ugly | |
# and there is considerable repetition of code for the other | |
# two FASTQ variants. | |
try: | |
# These take priority (in case both Solexa and PHRED scores found) | |
qualities = record.letter_annotations["phred_quality"] | |
except KeyError: | |
# Fall back on solexa scores... | |
pass | |
else: | |
# Try and use the precomputed mapping: | |
try: | |
return "".join(_phred_to_sanger_quality_str[qp] for qp in qualities) | |
except KeyError: | |
# Could be a float, or a None in the list, or a high value. | |
pass | |
if None in qualities: | |
raise TypeError("A quality value of None was found") | |
if max(qualities) >= 93.5: | |
warnings.warn( | |
"Data loss - max PHRED quality 93 in Sanger FASTQ", BiopythonWarning | |
) | |
# This will apply the truncation at 93, giving max ASCII 126 | |
return "".join( | |
chr(min(126, int(round(qp)) + SANGER_SCORE_OFFSET)) for qp in qualities | |
) | |
# Fall back on the Solexa scores... | |
try: | |
qualities = record.letter_annotations["solexa_quality"] | |
except KeyError: | |
raise ValueError( | |
"No suitable quality scores found in " | |
"letter_annotations of SeqRecord (id=%s)." % record.id | |
) from None | |
# Try and use the precomputed mapping: | |
try: | |
return "".join(_solexa_to_sanger_quality_str[qs] for qs in qualities) | |
except KeyError: | |
# Either no PHRED scores, or something odd like a float or None | |
pass | |
if None in qualities: | |
raise TypeError("A quality value of None was found") | |
# Must do this the slow way, first converting the PHRED scores into | |
# Solexa scores: | |
if max(qualities) >= 93.5: | |
warnings.warn( | |
"Data loss - max PHRED quality 93 in Sanger FASTQ", BiopythonWarning | |
) | |
# This will apply the truncation at 93, giving max ASCII 126 | |
return "".join( | |
chr(min(126, int(round(phred_quality_from_solexa(qs))) + SANGER_SCORE_OFFSET)) | |
for qs in qualities | |
) | |
# Only map 0 to 62, we need to give a warning on truncating at 62 | |
assert 62 + SOLEXA_SCORE_OFFSET == 126 | |
_phred_to_illumina_quality_str = { | |
qp: chr(qp + SOLEXA_SCORE_OFFSET) for qp in range(0, 62 + 1) | |
} | |
# Only map -5 to 62, we need to give a warning on truncating at 62 | |
_solexa_to_illumina_quality_str = { | |
qs: chr(int(round(phred_quality_from_solexa(qs))) + SOLEXA_SCORE_OFFSET) | |
for qs in range(-5, 62 + 1) | |
} | |
def _get_illumina_quality_str(record): | |
"""Return an Illumina 1.3 to 1.7 FASTQ encoded quality string (PRIVATE). | |
Notice that due to the limited range of printable ASCII characters, a | |
PHRED quality of 62 is the maximum that can be held in an Illumina FASTQ | |
file (using ASCII 126, the tilde). This function will issue a warning | |
in this situation. | |
""" | |
# TODO - This functions works and is fast, but it is also ugly | |
# and there is considerable repetition of code for the other | |
# two FASTQ variants. | |
try: | |
# These take priority (in case both Solexa and PHRED scores found) | |
qualities = record.letter_annotations["phred_quality"] | |
except KeyError: | |
# Fall back on solexa scores... | |
pass | |
else: | |
# Try and use the precomputed mapping: | |
try: | |
return "".join(_phred_to_illumina_quality_str[qp] for qp in qualities) | |
except KeyError: | |
# Could be a float, or a None in the list, or a high value. | |
pass | |
if None in qualities: | |
raise TypeError("A quality value of None was found") | |
if max(qualities) >= 62.5: | |
warnings.warn( | |
"Data loss - max PHRED quality 62 in Illumina FASTQ", BiopythonWarning | |
) | |
# This will apply the truncation at 62, giving max ASCII 126 | |
return "".join( | |
chr(min(126, int(round(qp)) + SOLEXA_SCORE_OFFSET)) for qp in qualities | |
) | |
# Fall back on the Solexa scores... | |
try: | |
qualities = record.letter_annotations["solexa_quality"] | |
except KeyError: | |
raise ValueError( | |
"No suitable quality scores found in " | |
"letter_annotations of SeqRecord (id=%s)." % record.id | |
) from None | |
# Try and use the precomputed mapping: | |
try: | |
return "".join(_solexa_to_illumina_quality_str[qs] for qs in qualities) | |
except KeyError: | |
# Either no PHRED scores, or something odd like a float or None | |
pass | |
if None in qualities: | |
raise TypeError("A quality value of None was found") | |
# Must do this the slow way, first converting the PHRED scores into | |
# Solexa scores: | |
if max(qualities) >= 62.5: | |
warnings.warn( | |
"Data loss - max PHRED quality 62 in Illumina FASTQ", BiopythonWarning | |
) | |
# This will apply the truncation at 62, giving max ASCII 126 | |
return "".join( | |
chr(min(126, int(round(phred_quality_from_solexa(qs))) + SOLEXA_SCORE_OFFSET)) | |
for qs in qualities | |
) | |
# Only map 0 to 62, we need to give a warning on truncating at 62 | |
assert 62 + SOLEXA_SCORE_OFFSET == 126 | |
_solexa_to_solexa_quality_str = { | |
qs: chr(min(126, qs + SOLEXA_SCORE_OFFSET)) for qs in range(-5, 62 + 1) | |
} | |
# Only map -5 to 62, we need to give a warning on truncating at 62 | |
_phred_to_solexa_quality_str = { | |
qp: chr(min(126, int(round(solexa_quality_from_phred(qp))) + SOLEXA_SCORE_OFFSET)) | |
for qp in range(0, 62 + 1) | |
} | |
def _get_solexa_quality_str(record): | |
"""Return a Solexa FASTQ encoded quality string (PRIVATE). | |
Notice that due to the limited range of printable ASCII characters, a | |
Solexa quality of 62 is the maximum that can be held in a Solexa FASTQ | |
file (using ASCII 126, the tilde). This function will issue a warning | |
in this situation. | |
""" | |
# TODO - This functions works and is fast, but it is also ugly | |
# and there is considerable repetition of code for the other | |
# two FASTQ variants. | |
try: | |
# These take priority (in case both Solexa and PHRED scores found) | |
qualities = record.letter_annotations["solexa_quality"] | |
except KeyError: | |
# Fall back on PHRED scores... | |
pass | |
else: | |
# Try and use the precomputed mapping: | |
try: | |
return "".join(_solexa_to_solexa_quality_str[qs] for qs in qualities) | |
except KeyError: | |
# Could be a float, or a None in the list, or a high value. | |
pass | |
if None in qualities: | |
raise TypeError("A quality value of None was found") | |
if max(qualities) >= 62.5: | |
warnings.warn( | |
"Data loss - max Solexa quality 62 in Solexa FASTQ", BiopythonWarning | |
) | |
# This will apply the truncation at 62, giving max ASCII 126 | |
return "".join( | |
chr(min(126, int(round(qs)) + SOLEXA_SCORE_OFFSET)) for qs in qualities | |
) | |
# Fall back on the PHRED scores... | |
try: | |
qualities = record.letter_annotations["phred_quality"] | |
except KeyError: | |
raise ValueError( | |
"No suitable quality scores found in " | |
"letter_annotations of SeqRecord (id=%s)." % record.id | |
) from None | |
# Try and use the precomputed mapping: | |
try: | |
return "".join(_phred_to_solexa_quality_str[qp] for qp in qualities) | |
except KeyError: | |
# Either no PHRED scores, or something odd like a float or None | |
# or too big to be in the cache | |
pass | |
if None in qualities: | |
raise TypeError("A quality value of None was found") | |
# Must do this the slow way, first converting the PHRED scores into | |
# Solexa scores: | |
if max(qualities) >= 62.5: | |
warnings.warn( | |
"Data loss - max Solexa quality 62 in Solexa FASTQ", BiopythonWarning | |
) | |
return "".join( | |
chr(min(126, int(round(solexa_quality_from_phred(qp))) + SOLEXA_SCORE_OFFSET)) | |
for qp in qualities | |
) | |
# TODO - Default to nucleotide or even DNA? | |
def FastqGeneralIterator(source): | |
"""Iterate over Fastq records as string tuples (not as SeqRecord objects). | |
Arguments: | |
- source - input stream opened in text mode, or a path to a file | |
This code does not try to interpret the quality string numerically. It | |
just returns tuples of the title, sequence and quality as strings. For | |
the sequence and quality, any whitespace (such as new lines) is removed. | |
Our SeqRecord based FASTQ iterators call this function internally, and then | |
turn the strings into a SeqRecord objects, mapping the quality string into | |
a list of numerical scores. If you want to do a custom quality mapping, | |
then you might consider calling this function directly. | |
For parsing FASTQ files, the title string from the "@" line at the start | |
of each record can optionally be omitted on the "+" lines. If it is | |
repeated, it must be identical. | |
The sequence string and the quality string can optionally be split over | |
multiple lines, although several sources discourage this. In comparison, | |
for the FASTA file format line breaks between 60 and 80 characters are | |
the norm. | |
**WARNING** - Because the "@" character can appear in the quality string, | |
this can cause problems as this is also the marker for the start of | |
a new sequence. In fact, the "+" sign can also appear as well. Some | |
sources recommended having no line breaks in the quality to avoid this, | |
but even that is not enough, consider this example:: | |
@071113_EAS56_0053:1:1:998:236 | |
TTTCTTGCCCCCATAGACTGAGACCTTCCCTAAATA | |
+071113_EAS56_0053:1:1:998:236 | |
IIIIIIIIIIIIIIIIIIIIIIIIIIIIICII+III | |
@071113_EAS56_0053:1:1:182:712 | |
ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG | |
+ | |
@IIIIIIIIIIIIIIICDIIIII<%<6&-*).(*%+ | |
@071113_EAS56_0053:1:1:153:10 | |
TGTTCTGAAGGAAGGTGTGCGTGCGTGTGTGTGTGT | |
+ | |
IIIIIIIIIIIICIIGIIIII>IAIIIE65I=II:6 | |
@071113_EAS56_0053:1:3:990:501 | |
TGGGAGGTTTTATGTGGA | |
AAGCAGCAATGTACAAGA | |
+ | |
IIIIIII.IIIIII1@44 | |
@-7.%<&+/$/%4(++(% | |
This is four PHRED encoded FASTQ entries originally from an NCBI source | |
(given the read length of 36, these are probably Solexa Illumina reads where | |
the quality has been mapped onto the PHRED values). | |
This example has been edited to illustrate some of the nasty things allowed | |
in the FASTQ format. Firstly, on the "+" lines most but not all of the | |
(redundant) identifiers are omitted. In real files it is likely that all or | |
none of these extra identifiers will be present. | |
Secondly, while the first three sequences have been shown without line | |
breaks, the last has been split over multiple lines. In real files any line | |
breaks are likely to be consistent. | |
Thirdly, some of the quality string lines start with an "@" character. For | |
the second record this is unavoidable. However for the fourth sequence this | |
only happens because its quality string is split over two lines. A naive | |
parser could wrongly treat any line starting with an "@" as the beginning of | |
a new sequence! This code copes with this possible ambiguity by keeping | |
track of the length of the sequence which gives the expected length of the | |
quality string. | |
Using this tricky example file as input, this short bit of code demonstrates | |
what this parsing function would return: | |
>>> with open("Quality/tricky.fastq") as handle: | |
... for (title, sequence, quality) in FastqGeneralIterator(handle): | |
... print(title) | |
... print("%s %s" % (sequence, quality)) | |
... | |
071113_EAS56_0053:1:1:998:236 | |
TTTCTTGCCCCCATAGACTGAGACCTTCCCTAAATA IIIIIIIIIIIIIIIIIIIIIIIIIIIIICII+III | |
071113_EAS56_0053:1:1:182:712 | |
ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG @IIIIIIIIIIIIIIICDIIIII<%<6&-*).(*%+ | |
071113_EAS56_0053:1:1:153:10 | |
TGTTCTGAAGGAAGGTGTGCGTGCGTGTGTGTGTGT IIIIIIIIIIIICIIGIIIII>IAIIIE65I=II:6 | |
071113_EAS56_0053:1:3:990:501 | |
TGGGAGGTTTTATGTGGAAAGCAGCAATGTACAAGA IIIIIII.IIIIII1@44@-7.%<&+/$/%4(++(% | |
Finally we note that some sources state that the quality string should | |
start with "!" (which using the PHRED mapping means the first letter always | |
has a quality score of zero). This rather restrictive rule is not widely | |
observed, so is therefore ignored here. One plus point about this "!" rule | |
is that (provided there are no line breaks in the quality sequence) it | |
would prevent the above problem with the "@" character. | |
""" | |
try: | |
handle = open(source) | |
except TypeError: | |
handle = source | |
if handle.read(0) != "": | |
raise StreamModeError("Fastq files must be opened in text mode") from None | |
try: | |
try: | |
line = next(handle) | |
except StopIteration: | |
return # Premature end of file, or just empty? | |
while True: | |
if line[0] != "@": | |
raise ValueError( | |
"Records in Fastq files should start with '@' character" | |
) | |
title_line = line[1:].rstrip() | |
seq_string = "" | |
# There will now be one or more sequence lines; keep going until we | |
# find the "+" marking the quality line: | |
for line in handle: | |
if line[0] == "+": | |
break | |
seq_string += line.rstrip() | |
else: | |
if seq_string: | |
raise ValueError("End of file without quality information.") | |
else: | |
raise ValueError("Unexpected end of file") | |
# The title here is optional, but if present must match! | |
second_title = line[1:].rstrip() | |
if second_title and second_title != title_line: | |
raise ValueError("Sequence and quality captions differ.") | |
# This is going to slow things down a little, but assuming | |
# this isn't allowed we should try and catch it here: | |
if " " in seq_string or "\t" in seq_string: | |
raise ValueError("Whitespace is not allowed in the sequence.") | |
seq_len = len(seq_string) | |
# There will now be at least one line of quality data, followed by | |
# another sequence, or EOF | |
line = None | |
quality_string = "" | |
for line in handle: | |
if line[0] == "@": | |
# This COULD be the start of a new sequence. However, it MAY just | |
# be a line of quality data which starts with a "@" character. We | |
# should be able to check this by looking at the sequence length | |
# and the amount of quality data found so far. | |
if len(quality_string) >= seq_len: | |
# We expect it to be equal if this is the start of a new record. | |
# If the quality data is longer, we'll raise an error below. | |
break | |
# Continue - its just some (more) quality data. | |
quality_string += line.rstrip() | |
else: | |
if line is None: | |
raise ValueError("Unexpected end of file") | |
line = None | |
if seq_len != len(quality_string): | |
raise ValueError( | |
"Lengths of sequence and quality values differs for %s (%i and %i)." | |
% (title_line, seq_len, len(quality_string)) | |
) | |
# Return the record and then continue... | |
yield (title_line, seq_string, quality_string) | |
if line is None: | |
break | |
finally: | |
if handle is not source: | |
handle.close() | |
class FastqPhredIterator(SequenceIterator): | |
"""Parser for FASTQ files.""" | |
def __init__(self, source, alphabet=None, title2ids=None): | |
"""Iterate over FASTQ records as SeqRecord objects. | |
Arguments: | |
- source - input stream opened in text mode, or a path to a file | |
- alphabet - optional alphabet, no longer used. Leave as None. | |
- title2ids (DEPRECATED) - A function that, when given the title line | |
from the FASTQ file (without the beginning >), will return the id, | |
name and description (in that order) for the record as a tuple of | |
strings. If this is not given, then the entire title line will be | |
used as the description, and the first word as the id and name. | |
The use of title2ids matches that of Bio.SeqIO.FastaIO. | |
For each sequence in a (Sanger style) FASTQ file there is a matching string | |
encoding the PHRED qualities (integers between 0 and about 90) using ASCII | |
values with an offset of 33. | |
For example, consider a file containing three short reads:: | |
@EAS54_6_R1_2_1_413_324 | |
CCCTTCTTGTCTTCAGCGTTTCTCC | |
+ | |
;;3;;;;;;;;;;;;7;;;;;;;88 | |
@EAS54_6_R1_2_1_540_792 | |
TTGGCAGGCCAAGGCCGATGGATCA | |
+ | |
;;;;;;;;;;;7;;;;;-;;;3;83 | |
@EAS54_6_R1_2_1_443_348 | |
GTTGCTTCTGGCGTGGGTGGGGGGG | |
+ | |
;;;;;;;;;;;9;7;;.7;393333 | |
For each sequence (e.g. "CCCTTCTTGTCTTCAGCGTTTCTCC") there is a matching | |
string encoding the PHRED qualities using a ASCII values with an offset of | |
33 (e.g. ";;3;;;;;;;;;;;;7;;;;;;;88"). | |
Using this module directly you might run: | |
>>> with open("Quality/example.fastq") as handle: | |
... for record in FastqPhredIterator(handle): | |
... print("%s %s" % (record.id, record.seq)) | |
EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC | |
EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA | |
EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG | |
Typically however, you would call this via Bio.SeqIO instead with "fastq" | |
(or "fastq-sanger") as the format: | |
>>> from Bio import SeqIO | |
>>> with open("Quality/example.fastq") as handle: | |
... for record in SeqIO.parse(handle, "fastq"): | |
... print("%s %s" % (record.id, record.seq)) | |
EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC | |
EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA | |
EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG | |
If you want to look at the qualities, they are record in each record's | |
per-letter-annotation dictionary as a simple list of integers: | |
>>> print(record.letter_annotations["phred_quality"]) | |
[26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18] | |
The title2ids argument is deprecated. Instead, please use a generator | |
function to modify the records returned by the parser. For example, to | |
store the mean PHRED quality in the record description, use | |
>>> from statistics import mean | |
>>> def modify_records(records): | |
... for record in records: | |
... record.description = mean(record.letter_annotations['phred_quality']) | |
... yield record | |
... | |
>>> with open('Quality/example.fastq') as handle: | |
... for record in modify_records(FastqPhredIterator(handle)): | |
... print(record.id, record.description) | |
... | |
EAS54_6_R1_2_1_413_324 25.28 | |
EAS54_6_R1_2_1_540_792 24.52 | |
EAS54_6_R1_2_1_443_348 23.4 | |
""" | |
if alphabet is not None: | |
raise ValueError("The alphabet argument is no longer supported") | |
if title2ids is not None: | |
warnings.warn( | |
"The title2ids argument is deprecated. Instead, please use a " | |
"generator function to modify records returned by the parser. " | |
"For example, to change the record description to a counter, " | |
"use\n" | |
"\n" | |
">>> from statistics import mean\n" | |
">>> def modify_records(records):\n" | |
"... for record in records:\n" | |
"... record.description = mean(record.letter_annotations['phred_quality'])\n" | |
"... yield record\n" | |
"...\n" | |
">>> with open('Quality/example.fastq') as handle:\n" | |
"... for record in modify_records(FastqPhredIterator(handle)):\n" | |
"... print(record.id, record.description)\n" | |
"\n", | |
BiopythonDeprecationWarning, | |
) | |
self.title2ids = title2ids | |
super().__init__(source, mode="t", fmt="Fastq") | |
def parse(self, handle): | |
"""Start parsing the file, and return a SeqRecord generator.""" | |
records = self.iterate(handle) | |
return records | |
def iterate(self, handle): | |
"""Parse the file and generate SeqRecord objects.""" | |
title2ids = self.title2ids | |
assert SANGER_SCORE_OFFSET == ord("!") | |
# Originally, I used a list expression for each record: | |
# | |
# qualities = [ord(letter)-SANGER_SCORE_OFFSET for letter in quality_string] | |
# | |
# Precomputing is faster, perhaps partly by avoiding the subtractions. | |
q_mapping = { | |
chr(letter): letter - SANGER_SCORE_OFFSET | |
for letter in range(SANGER_SCORE_OFFSET, 94 + SANGER_SCORE_OFFSET) | |
} | |
for title_line, seq_string, quality_string in FastqGeneralIterator(handle): | |
if title2ids: | |
id, name, descr = title2ids(title_line) | |
else: | |
descr = title_line | |
id = descr.split()[0] | |
name = id | |
record = SeqRecord(Seq(seq_string), id=id, name=name, description=descr) | |
try: | |
qualities = [q_mapping[letter] for letter in quality_string] | |
except KeyError: | |
raise ValueError("Invalid character in quality string") from None | |
# For speed, will now use a dirty trick to speed up assigning the | |
# qualities. We do this to bypass the length check imposed by the | |
# per-letter-annotations restricted dict (as this has already been | |
# checked by FastqGeneralIterator). This is equivalent to: | |
# record.letter_annotations["phred_quality"] = qualities | |
dict.__setitem__(record._per_letter_annotations, "phred_quality", qualities) | |
yield record | |
def FastqSolexaIterator(source, alphabet=None, title2ids=None): | |
r"""Parse old Solexa/Illumina FASTQ like files (which differ in the quality mapping). | |
The optional arguments are the same as those for the FastqPhredIterator. | |
For each sequence in Solexa/Illumina FASTQ files there is a matching string | |
encoding the Solexa integer qualities using ASCII values with an offset | |
of 64. Solexa scores are scaled differently to PHRED scores, and Biopython | |
will NOT perform any automatic conversion when loading. | |
NOTE - This file format is used by the OLD versions of the Solexa/Illumina | |
pipeline. See also the FastqIlluminaIterator function for the NEW version. | |
For example, consider a file containing these five records:: | |
@SLXA-B3_649_FC8437_R1_1_1_610_79 | |
GATGTGCAATACCTTTGTAGAGGAA | |
+SLXA-B3_649_FC8437_R1_1_1_610_79 | |
YYYYYYYYYYYYYYYYYYWYWYYSU | |
@SLXA-B3_649_FC8437_R1_1_1_397_389 | |
GGTTTGAGAAAGAGAAATGAGATAA | |
+SLXA-B3_649_FC8437_R1_1_1_397_389 | |
YYYYYYYYYWYYYYWWYYYWYWYWW | |
@SLXA-B3_649_FC8437_R1_1_1_850_123 | |
GAGGGTGTTGATCATGATGATGGCG | |
+SLXA-B3_649_FC8437_R1_1_1_850_123 | |
YYYYYYYYYYYYYWYYWYYSYYYSY | |
@SLXA-B3_649_FC8437_R1_1_1_362_549 | |
GGAAACAAAGTTTTTCTCAACATAG | |
+SLXA-B3_649_FC8437_R1_1_1_362_549 | |
YYYYYYYYYYYYYYYYYYWWWWYWY | |
@SLXA-B3_649_FC8437_R1_1_1_183_714 | |
GTATTATTTAATGGCATACACTCAA | |
+SLXA-B3_649_FC8437_R1_1_1_183_714 | |
YYYYYYYYYYWYYYYWYWWUWWWQQ | |
Using this module directly you might run: | |
>>> with open("Quality/solexa_example.fastq") as handle: | |
... for record in FastqSolexaIterator(handle): | |
... print("%s %s" % (record.id, record.seq)) | |
SLXA-B3_649_FC8437_R1_1_1_610_79 GATGTGCAATACCTTTGTAGAGGAA | |
SLXA-B3_649_FC8437_R1_1_1_397_389 GGTTTGAGAAAGAGAAATGAGATAA | |
SLXA-B3_649_FC8437_R1_1_1_850_123 GAGGGTGTTGATCATGATGATGGCG | |
SLXA-B3_649_FC8437_R1_1_1_362_549 GGAAACAAAGTTTTTCTCAACATAG | |
SLXA-B3_649_FC8437_R1_1_1_183_714 GTATTATTTAATGGCATACACTCAA | |
Typically however, you would call this via Bio.SeqIO instead with | |
"fastq-solexa" as the format: | |
>>> from Bio import SeqIO | |
>>> with open("Quality/solexa_example.fastq") as handle: | |
... for record in SeqIO.parse(handle, "fastq-solexa"): | |
... print("%s %s" % (record.id, record.seq)) | |
SLXA-B3_649_FC8437_R1_1_1_610_79 GATGTGCAATACCTTTGTAGAGGAA | |
SLXA-B3_649_FC8437_R1_1_1_397_389 GGTTTGAGAAAGAGAAATGAGATAA | |
SLXA-B3_649_FC8437_R1_1_1_850_123 GAGGGTGTTGATCATGATGATGGCG | |
SLXA-B3_649_FC8437_R1_1_1_362_549 GGAAACAAAGTTTTTCTCAACATAG | |
SLXA-B3_649_FC8437_R1_1_1_183_714 GTATTATTTAATGGCATACACTCAA | |
If you want to look at the qualities, they are recorded in each record's | |
per-letter-annotation dictionary as a simple list of integers: | |
>>> print(record.letter_annotations["solexa_quality"]) | |
[25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 23, 25, 25, 25, 25, 23, 25, 23, 23, 21, 23, 23, 23, 17, 17] | |
These scores aren't very good, but they are high enough that they map | |
almost exactly onto PHRED scores: | |
>>> print("%0.2f" % phred_quality_from_solexa(25)) | |
25.01 | |
Let's look at faked example read which is even worse, where there are | |
more noticeable differences between the Solexa and PHRED scores:: | |
@slxa_0001_1_0001_01 | |
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN | |
+slxa_0001_1_0001_01 | |
hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<; | |
Again, you would typically use Bio.SeqIO to read this file in (rather than | |
calling the Bio.SeqIO.QualtityIO module directly). Most FASTQ files will | |
contain thousands of reads, so you would normally use Bio.SeqIO.parse() | |
as shown above. This example has only as one entry, so instead we can | |
use the Bio.SeqIO.read() function: | |
>>> from Bio import SeqIO | |
>>> with open("Quality/solexa_faked.fastq") as handle: | |
... record = SeqIO.read(handle, "fastq-solexa") | |
>>> print("%s %s" % (record.id, record.seq)) | |
slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN | |
>>> print(record.letter_annotations["solexa_quality"]) | |
[40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2, -3, -4, -5] | |
These quality scores are so low that when converted from the Solexa scheme | |
into PHRED scores they look quite different: | |
>>> print("%0.2f" % phred_quality_from_solexa(-1)) | |
2.54 | |
>>> print("%0.2f" % phred_quality_from_solexa(-5)) | |
1.19 | |
Note you can use the Bio.SeqIO.write() function or the SeqRecord's format | |
method to output the record(s): | |
>>> print(record.format("fastq-solexa")) | |
@slxa_0001_1_0001_01 | |
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN | |
+ | |
hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<; | |
<BLANKLINE> | |
Note this output is slightly different from the input file as Biopython | |
has left out the optional repetition of the sequence identifier on the "+" | |
line. If you want the to use PHRED scores, use "fastq" or "qual" as the | |
output format instead, and Biopython will do the conversion for you: | |
>>> print(record.format("fastq")) | |
@slxa_0001_1_0001_01 | |
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN | |
+ | |
IHGFEDCBA@?>=<;:9876543210/.-,++*)('&&%%$$##"" | |
<BLANKLINE> | |
>>> print(record.format("qual")) | |
>slxa_0001_1_0001_01 | |
40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 | |
20 19 18 17 16 15 14 13 12 11 10 10 9 8 7 6 5 5 4 4 3 3 2 2 | |
1 1 | |
<BLANKLINE> | |
As shown above, the poor quality Solexa reads have been mapped to the | |
equivalent PHRED score (e.g. -5 to 1 as shown earlier). | |
""" | |
if alphabet is not None: | |
raise ValueError("The alphabet argument is no longer supported") | |
q_mapping = { | |
chr(letter): letter - SOLEXA_SCORE_OFFSET | |
for letter in range(SOLEXA_SCORE_OFFSET - 5, 63 + SOLEXA_SCORE_OFFSET) | |
} | |
for title_line, seq_string, quality_string in FastqGeneralIterator(source): | |
if title2ids: | |
id, name, descr = title2ids(title_line) | |
else: | |
descr = title_line | |
id = descr.split()[0] | |
name = id | |
record = SeqRecord(Seq(seq_string), id=id, name=name, description=descr) | |
try: | |
qualities = [q_mapping[letter] for letter in quality_string] | |
# DO NOT convert these into PHRED qualities automatically! | |
except KeyError: | |
raise ValueError("Invalid character in quality string") from None | |
# Dirty trick to speed up this line: | |
# record.letter_annotations["solexa_quality"] = qualities | |
dict.__setitem__(record._per_letter_annotations, "solexa_quality", qualities) | |
yield record | |
def FastqIlluminaIterator(source, alphabet=None, title2ids=None): | |
"""Parse Illumina 1.3 to 1.7 FASTQ like files (which differ in the quality mapping). | |
The optional arguments are the same as those for the FastqPhredIterator. | |
For each sequence in Illumina 1.3+ FASTQ files there is a matching string | |
encoding PHRED integer qualities using ASCII values with an offset of 64. | |
>>> from Bio import SeqIO | |
>>> record = SeqIO.read("Quality/illumina_faked.fastq", "fastq-illumina") | |
>>> print("%s %s" % (record.id, record.seq)) | |
Test ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN | |
>>> max(record.letter_annotations["phred_quality"]) | |
40 | |
>>> min(record.letter_annotations["phred_quality"]) | |
0 | |
NOTE - Older versions of the Solexa/Illumina pipeline encoded Solexa scores | |
with an ASCII offset of 64. They are approximately equal but only for high | |
quality reads. If you have an old Solexa/Illumina file with negative | |
Solexa scores, and try and read this as an Illumina 1.3+ file it will fail: | |
>>> record2 = SeqIO.read("Quality/solexa_faked.fastq", "fastq-illumina") | |
Traceback (most recent call last): | |
... | |
ValueError: Invalid character in quality string | |
NOTE - True Sanger style FASTQ files use PHRED scores with an offset of 33. | |
""" | |
if alphabet is not None: | |
raise ValueError("The alphabet argument is no longer supported") | |
q_mapping = { | |
chr(letter): letter - SOLEXA_SCORE_OFFSET | |
for letter in range(SOLEXA_SCORE_OFFSET, 63 + SOLEXA_SCORE_OFFSET) | |
} | |
for title_line, seq_string, quality_string in FastqGeneralIterator(source): | |
if title2ids: | |
id, name, descr = title2ids(title_line) | |
else: | |
descr = title_line | |
id = descr.split()[0] | |
name = id | |
record = SeqRecord(Seq(seq_string), id=id, name=name, description=descr) | |
try: | |
qualities = [q_mapping[letter] for letter in quality_string] | |
except KeyError: | |
raise ValueError("Invalid character in quality string") from None | |
# Dirty trick to speed up this line: | |
# record.letter_annotations["phred_quality"] = qualities | |
dict.__setitem__(record._per_letter_annotations, "phred_quality", qualities) | |
yield record | |
class QualPhredIterator(SequenceIterator): | |
"""Parser for QUAL files with PHRED quality scores but no sequence.""" | |
def __init__(self, source, alphabet=None, title2ids=None): | |
"""For QUAL files which include PHRED quality scores, but no sequence. | |
For example, consider this short QUAL file:: | |
>EAS54_6_R1_2_1_413_324 | |
26 26 18 26 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 | |
26 26 26 23 23 | |
>EAS54_6_R1_2_1_540_792 | |
26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 26 12 26 26 | |
26 18 26 23 18 | |
>EAS54_6_R1_2_1_443_348 | |
26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18 | |
24 18 18 18 18 | |
Using this module directly you might run: | |
>>> with open("Quality/example.qual") as handle: | |
... for record in QualPhredIterator(handle): | |
... print("%s read of length %d" % (record.id, len(record.seq))) | |
EAS54_6_R1_2_1_413_324 read of length 25 | |
EAS54_6_R1_2_1_540_792 read of length 25 | |
EAS54_6_R1_2_1_443_348 read of length 25 | |
Typically however, you would call this via Bio.SeqIO instead with "qual" | |
as the format: | |
>>> from Bio import SeqIO | |
>>> with open("Quality/example.qual") as handle: | |
... for record in SeqIO.parse(handle, "qual"): | |
... print("%s read of length %d" % (record.id, len(record.seq))) | |
EAS54_6_R1_2_1_413_324 read of length 25 | |
EAS54_6_R1_2_1_540_792 read of length 25 | |
EAS54_6_R1_2_1_443_348 read of length 25 | |
Only the sequence length is known, as the QUAL file does not contain | |
the sequence string itself. | |
The quality scores themselves are available as a list of integers | |
in each record's per-letter-annotation: | |
>>> print(record.letter_annotations["phred_quality"]) | |
[26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18] | |
You can still slice one of these SeqRecord objects: | |
>>> sub_record = record[5:10] | |
>>> print("%s %s" % (sub_record.id, sub_record.letter_annotations["phred_quality"])) | |
EAS54_6_R1_2_1_443_348 [26, 26, 26, 26, 26] | |
As of Biopython 1.59, this parser will accept files with negatives quality | |
scores but will replace them with the lowest possible PHRED score of zero. | |
This will trigger a warning, previously it raised a ValueError exception. | |
""" | |
if alphabet is not None: | |
raise ValueError("The alphabet argument is no longer supported") | |
self.title2ids = title2ids | |
super().__init__(source, mode="t", fmt="QUAL") | |
def parse(self, handle): | |
"""Start parsing the file, and return a SeqRecord generator.""" | |
records = self.iterate(handle) | |
return records | |
def iterate(self, handle): | |
"""Parse the file and generate SeqRecord objects.""" | |
title2ids = self.title2ids | |
# Skip any text before the first record (e.g. blank lines, comments) | |
for line in handle: | |
if line[0] == ">": | |
break | |
else: | |
return | |
while True: | |
if line[0] != ">": | |
raise ValueError( | |
"Records in Fasta files should start with '>' character" | |
) | |
if title2ids: | |
id, name, descr = title2ids(line[1:].rstrip()) | |
else: | |
descr = line[1:].rstrip() | |
id = descr.split()[0] | |
name = id | |
qualities = [] | |
for line in handle: | |
if line[0] == ">": | |
break | |
qualities.extend(int(word) for word in line.split()) | |
else: | |
line = None | |
if qualities and min(qualities) < 0: | |
warnings.warn( | |
"Negative quality score %i found, substituting PHRED zero instead." | |
% min(qualities), | |
BiopythonParserWarning, | |
) | |
qualities = [max(0, q) for q in qualities] | |
# Return the record and then continue... | |
sequence = Seq(None, length=len(qualities)) | |
record = SeqRecord(sequence, id=id, name=name, description=descr) | |
# Dirty trick to speed up this line: | |
# record.letter_annotations["phred_quality"] = qualities | |
dict.__setitem__(record._per_letter_annotations, "phred_quality", qualities) | |
yield record | |
if line is None: | |
return # StopIteration | |
raise ValueError("Unrecognised QUAL record format.") | |
class FastqPhredWriter(SequenceWriter): | |
"""Class to write standard FASTQ format files (using PHRED quality scores) (OBSOLETE). | |
Although you can use this class directly, you are strongly encouraged | |
to use the ``as_fastq`` function, or top level ``Bio.SeqIO.write()`` | |
function instead via the format name "fastq" or the alias "fastq-sanger". | |
For example, this code reads in a standard Sanger style FASTQ file | |
(using PHRED scores) and re-saves it as another Sanger style FASTQ file: | |
>>> from Bio import SeqIO | |
>>> record_iterator = SeqIO.parse("Quality/example.fastq", "fastq") | |
>>> with open("Quality/temp.fastq", "w") as out_handle: | |
... SeqIO.write(record_iterator, out_handle, "fastq") | |
3 | |
You might want to do this if the original file included extra line breaks, | |
which while valid may not be supported by all tools. The output file from | |
Biopython will have each sequence on a single line, and each quality | |
string on a single line (which is considered desirable for maximum | |
compatibility). | |
In this next example, an old style Solexa/Illumina FASTQ file (using Solexa | |
quality scores) is converted into a standard Sanger style FASTQ file using | |
PHRED qualities: | |
>>> from Bio import SeqIO | |
>>> record_iterator = SeqIO.parse("Quality/solexa_example.fastq", "fastq-solexa") | |
>>> with open("Quality/temp.fastq", "w") as out_handle: | |
... SeqIO.write(record_iterator, out_handle, "fastq") | |
5 | |
This code is also called if you use the .format("fastq") method of a | |
SeqRecord, or .format("fastq-sanger") if you prefer that alias. | |
Note that Sanger FASTQ files have an upper limit of PHRED quality 93, which is | |
encoded as ASCII 126, the tilde. If your quality scores are truncated to fit, a | |
warning is issued. | |
P.S. To avoid cluttering up your working directory, you can delete this | |
temporary file now: | |
>>> import os | |
>>> os.remove("Quality/temp.fastq") | |
""" | |
assert SANGER_SCORE_OFFSET == ord("!") | |
def write_record(self, record): | |
"""Write a single FASTQ record to the file.""" | |
assert self._header_written | |
assert not self._footer_written | |
self._record_written = True | |
# TODO - Is an empty sequence allowed in FASTQ format? | |
seq = record.seq | |
if seq is None: | |
raise ValueError(f"No sequence for record {record.id}") | |
qualities_str = _get_sanger_quality_str(record) | |
if len(qualities_str) != len(seq): | |
raise ValueError( | |
"Record %s has sequence length %i but %i quality scores" | |
% (record.id, len(seq), len(qualities_str)) | |
) | |
# FASTQ files can include a description, just like FASTA files | |
# (at least, this is what the NCBI Short Read Archive does) | |
id = self.clean(record.id) | |
description = self.clean(record.description) | |
if description and description.split(None, 1)[0] == id: | |
# The description includes the id at the start | |
title = description | |
elif description: | |
title = f"{id} {description}" | |
else: | |
title = id | |
self.handle.write(f"@{title}\n{seq}\n+\n{qualities_str}\n") | |
def as_fastq(record): | |
"""Turn a SeqRecord into a Sanger FASTQ formatted string. | |
This is used internally by the SeqRecord's .format("fastq") | |
method and by the SeqIO.write(..., ..., "fastq") function, | |
and under the format alias "fastq-sanger" as well. | |
""" | |
seq_str = _get_seq_string(record) | |
qualities_str = _get_sanger_quality_str(record) | |
if len(qualities_str) != len(seq_str): | |
raise ValueError( | |
"Record %s has sequence length %i but %i quality scores" | |
% (record.id, len(seq_str), len(qualities_str)) | |
) | |
id = _clean(record.id) | |
description = _clean(record.description) | |
if description and description.split(None, 1)[0] == id: | |
title = description | |
elif description: | |
title = f"{id} {description}" | |
else: | |
title = id | |
return f"@{title}\n{seq_str}\n+\n{qualities_str}\n" | |
class QualPhredWriter(SequenceWriter): | |
"""Class to write QUAL format files (using PHRED quality scores) (OBSOLETE). | |
Although you can use this class directly, you are strongly encouraged | |
to use the ``as_qual`` function, or top level ``Bio.SeqIO.write()`` | |
function instead. | |
For example, this code reads in a FASTQ file and saves the quality scores | |
into a QUAL file: | |
>>> from Bio import SeqIO | |
>>> record_iterator = SeqIO.parse("Quality/example.fastq", "fastq") | |
>>> with open("Quality/temp.qual", "w") as out_handle: | |
... SeqIO.write(record_iterator, out_handle, "qual") | |
3 | |
This code is also called if you use the .format("qual") method of a | |
SeqRecord. | |
P.S. Don't forget to clean up the temp file if you don't need it anymore: | |
>>> import os | |
>>> os.remove("Quality/temp.qual") | |
""" | |
def __init__(self, handle, wrap=60, record2title=None): | |
"""Create a QUAL writer. | |
Arguments: | |
- handle - Handle to an output file, e.g. as returned | |
by open(filename, "w") | |
- wrap - Optional line length used to wrap sequence lines. | |
Defaults to wrapping the sequence at 60 characters. Use | |
zero (or None) for no wrapping, giving a single long line | |
for the sequence. | |
- record2title - Optional function to return the text to be | |
used for the title line of each record. By default a | |
combination of the record.id and record.description is | |
used. If the record.description starts with the record.id, | |
then just the record.description is used. | |
The record2title argument is present for consistency with the | |
Bio.SeqIO.FastaIO writer class. | |
""" | |
super().__init__(handle) | |
# self.handle = handle | |
self.wrap = None | |
if wrap: | |
if wrap < 1: | |
raise ValueError | |
self.wrap = wrap | |
self.record2title = record2title | |
def write_record(self, record): | |
"""Write a single QUAL record to the file.""" | |
assert self._header_written | |
assert not self._footer_written | |
self._record_written = True | |
handle = self.handle | |
wrap = self.wrap | |
if self.record2title: | |
title = self.clean(self.record2title(record)) | |
else: | |
id = self.clean(record.id) | |
description = self.clean(record.description) | |
if description and description.split(None, 1)[0] == id: | |
# The description includes the id at the start | |
title = description | |
elif description: | |
title = f"{id} {description}" | |
else: | |
title = id | |
handle.write(f">{title}\n") | |
qualities = _get_phred_quality(record) | |
try: | |
# This rounds to the nearest integer. | |
# TODO - can we record a float in a qual file? | |
qualities_strs = [("%i" % round(q, 0)) for q in qualities] | |
except TypeError: | |
if None in qualities: | |
raise TypeError("A quality value of None was found") from None | |
else: | |
raise | |
if wrap > 5: | |
# Fast wrapping | |
data = " ".join(qualities_strs) | |
while True: | |
if len(data) <= wrap: | |
self.handle.write(data + "\n") | |
break | |
else: | |
# By construction there must be spaces in the first X chars | |
# (unless we have X digit or higher quality scores!) | |
i = data.rfind(" ", 0, wrap) | |
handle.write(data[:i] + "\n") | |
data = data[i + 1 :] | |
elif wrap: | |
# Safe wrapping | |
while qualities_strs: | |
line = qualities_strs.pop(0) | |
while qualities_strs and len(line) + 1 + len(qualities_strs[0]) < wrap: | |
line += " " + qualities_strs.pop(0) | |
handle.write(line + "\n") | |
else: | |
# No wrapping | |
data = " ".join(qualities_strs) | |
handle.write(data + "\n") | |
def as_qual(record): | |
"""Turn a SeqRecord into a QUAL formatted string. | |
This is used internally by the SeqRecord's .format("qual") | |
method and by the SeqIO.write(..., ..., "qual") function. | |
""" | |
id = _clean(record.id) | |
description = _clean(record.description) | |
if description and description.split(None, 1)[0] == id: | |
title = description | |
elif description: | |
title = f"{id} {description}" | |
else: | |
title = id | |
lines = [f">{title}\n"] | |
qualities = _get_phred_quality(record) | |
try: | |
# This rounds to the nearest integer. | |
# TODO - can we record a float in a qual file? | |
qualities_strs = [("%i" % round(q, 0)) for q in qualities] | |
except TypeError: | |
if None in qualities: | |
raise TypeError("A quality value of None was found") from None | |
else: | |
raise | |
# Safe wrapping | |
while qualities_strs: | |
line = qualities_strs.pop(0) | |
while qualities_strs and len(line) + 1 + len(qualities_strs[0]) < 60: | |
line += " " + qualities_strs.pop(0) | |
lines.append(line + "\n") | |
return "".join(lines) | |
class FastqSolexaWriter(SequenceWriter): | |
r"""Write old style Solexa/Illumina FASTQ format files (with Solexa qualities) (OBSOLETE). | |
This outputs FASTQ files like those from the early Solexa/Illumina | |
pipeline, using Solexa scores and an ASCII offset of 64. These are | |
NOT compatible with the standard Sanger style PHRED FASTQ files. | |
If your records contain a "solexa_quality" entry under letter_annotations, | |
this is used, otherwise any "phred_quality" entry will be used after | |
conversion using the solexa_quality_from_phred function. If neither style | |
of quality scores are present, an exception is raised. | |
Although you can use this class directly, you are strongly encouraged | |
to use the ``as_fastq_solexa`` function, or top-level ``Bio.SeqIO.write()`` | |
function instead. For example, this code reads in a FASTQ file and re-saves | |
it as another FASTQ file: | |
>>> from Bio import SeqIO | |
>>> record_iterator = SeqIO.parse("Quality/solexa_example.fastq", "fastq-solexa") | |
>>> with open("Quality/temp.fastq", "w") as out_handle: | |
... SeqIO.write(record_iterator, out_handle, "fastq-solexa") | |
5 | |
You might want to do this if the original file included extra line breaks, | |
which (while valid) may not be supported by all tools. The output file | |
from Biopython will have each sequence on a single line, and each quality | |
string on a single line (which is considered desirable for maximum | |
compatibility). | |
This code is also called if you use the .format("fastq-solexa") method of | |
a SeqRecord. For example, | |
>>> record = SeqIO.read("Quality/sanger_faked.fastq", "fastq-sanger") | |
>>> print(record.format("fastq-solexa")) | |
@Test PHRED qualities from 40 to 0 inclusive | |
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN | |
+ | |
hgfedcba`_^]\[ZYXWVUTSRQPONMLKJHGFECB@>;; | |
<BLANKLINE> | |
Note that Solexa FASTQ files have an upper limit of Solexa quality 62, which is | |
encoded as ASCII 126, the tilde. If your quality scores must be truncated to fit, | |
a warning is issued. | |
P.S. Don't forget to delete the temp file if you don't need it anymore: | |
>>> import os | |
>>> os.remove("Quality/temp.fastq") | |
""" | |
def write_record(self, record): | |
"""Write a single FASTQ record to the file.""" | |
assert self._header_written | |
assert not self._footer_written | |
self._record_written = True | |
# TODO - Is an empty sequence allowed in FASTQ format? | |
seq = record.seq | |
if seq is None: | |
raise ValueError(f"No sequence for record {record.id}") | |
qualities_str = _get_solexa_quality_str(record) | |
if len(qualities_str) != len(seq): | |
raise ValueError( | |
"Record %s has sequence length %i but %i quality scores" | |
% (record.id, len(seq), len(qualities_str)) | |
) | |
# FASTQ files can include a description, just like FASTA files | |
# (at least, this is what the NCBI Short Read Archive does) | |
id = self.clean(record.id) | |
description = self.clean(record.description) | |
if description and description.split(None, 1)[0] == id: | |
# The description includes the id at the start | |
title = description | |
elif description: | |
title = f"{id} {description}" | |
else: | |
title = id | |
self.handle.write(f"@{title}\n{seq}\n+\n{qualities_str}\n") | |
def as_fastq_solexa(record): | |
"""Turn a SeqRecord into a Solexa FASTQ formatted string. | |
This is used internally by the SeqRecord's .format("fastq-solexa") | |
method and by the SeqIO.write(..., ..., "fastq-solexa") function. | |
""" | |
seq_str = _get_seq_string(record) | |
qualities_str = _get_solexa_quality_str(record) | |
if len(qualities_str) != len(seq_str): | |
raise ValueError( | |
"Record %s has sequence length %i but %i quality scores" | |
% (record.id, len(seq_str), len(qualities_str)) | |
) | |
id = _clean(record.id) | |
description = _clean(record.description) | |
if description and description.split(None, 1)[0] == id: | |
# The description includes the id at the start | |
title = description | |
elif description: | |
title = f"{id} {description}" | |
else: | |
title = id | |
return f"@{title}\n{seq_str}\n+\n{qualities_str}\n" | |
class FastqIlluminaWriter(SequenceWriter): | |
r"""Write Illumina 1.3+ FASTQ format files (with PHRED quality scores) (OBSOLETE). | |
This outputs FASTQ files like those from the Solexa/Illumina 1.3+ pipeline, | |
using PHRED scores and an ASCII offset of 64. Note these files are NOT | |
compatible with the standard Sanger style PHRED FASTQ files which use an | |
ASCII offset of 32. | |
Although you can use this class directly, you are strongly encouraged to | |
use the ``as_fastq_illumina`` or top-level ``Bio.SeqIO.write()`` function | |
with format name "fastq-illumina" instead. This code is also called if you | |
use the .format("fastq-illumina") method of a SeqRecord. For example, | |
>>> from Bio import SeqIO | |
>>> record = SeqIO.read("Quality/sanger_faked.fastq", "fastq-sanger") | |
>>> print(record.format("fastq-illumina")) | |
@Test PHRED qualities from 40 to 0 inclusive | |
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN | |
+ | |
hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@ | |
<BLANKLINE> | |
Note that Illumina FASTQ files have an upper limit of PHRED quality 62, which is | |
encoded as ASCII 126, the tilde. If your quality scores are truncated to fit, a | |
warning is issued. | |
""" | |
def write_record(self, record): | |
"""Write a single FASTQ record to the file.""" | |
assert self._header_written | |
assert not self._footer_written | |
self._record_written = True | |
# TODO - Is an empty sequence allowed in FASTQ format? | |
seq = record.seq | |
if seq is None: | |
raise ValueError(f"No sequence for record {record.id}") | |
qualities_str = _get_illumina_quality_str(record) | |
if len(qualities_str) != len(seq): | |
raise ValueError( | |
"Record %s has sequence length %i but %i quality scores" | |
% (record.id, len(seq), len(qualities_str)) | |
) | |
# FASTQ files can include a description, just like FASTA files | |
# (at least, this is what the NCBI Short Read Archive does) | |
id = self.clean(record.id) | |
description = self.clean(record.description) | |
if description and description.split(None, 1)[0] == id: | |
# The description includes the id at the start | |
title = description | |
elif description: | |
title = f"{id} {description}" | |
else: | |
title = id | |
self.handle.write(f"@{title}\n{seq}\n+\n{qualities_str}\n") | |
def as_fastq_illumina(record): | |
"""Turn a SeqRecord into an Illumina FASTQ formatted string. | |
This is used internally by the SeqRecord's .format("fastq-illumina") | |
method and by the SeqIO.write(..., ..., "fastq-illumina") function. | |
""" | |
seq_str = _get_seq_string(record) | |
qualities_str = _get_illumina_quality_str(record) | |
if len(qualities_str) != len(seq_str): | |
raise ValueError( | |
"Record %s has sequence length %i but %i quality scores" | |
% (record.id, len(seq_str), len(qualities_str)) | |
) | |
id = _clean(record.id) | |
description = _clean(record.description) | |
if description and description.split(None, 1)[0] == id: | |
title = description | |
elif description: | |
title = f"{id} {description}" | |
else: | |
title = id | |
return f"@{title}\n{seq_str}\n+\n{qualities_str}\n" | |
def PairedFastaQualIterator(fasta_source, qual_source, alphabet=None, title2ids=None): | |
"""Iterate over matched FASTA and QUAL files as SeqRecord objects. | |
For example, consider this short QUAL file with PHRED quality scores:: | |
>EAS54_6_R1_2_1_413_324 | |
26 26 18 26 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 | |
26 26 26 23 23 | |
>EAS54_6_R1_2_1_540_792 | |
26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 26 12 26 26 | |
26 18 26 23 18 | |
>EAS54_6_R1_2_1_443_348 | |
26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18 | |
24 18 18 18 18 | |
And a matching FASTA file:: | |
>EAS54_6_R1_2_1_413_324 | |
CCCTTCTTGTCTTCAGCGTTTCTCC | |
>EAS54_6_R1_2_1_540_792 | |
TTGGCAGGCCAAGGCCGATGGATCA | |
>EAS54_6_R1_2_1_443_348 | |
GTTGCTTCTGGCGTGGGTGGGGGGG | |
You can parse these separately using Bio.SeqIO with the "qual" and | |
"fasta" formats, but then you'll get a group of SeqRecord objects with | |
no sequence, and a matching group with the sequence but not the | |
qualities. Because it only deals with one input file handle, Bio.SeqIO | |
can't be used to read the two files together - but this function can! | |
For example, | |
>>> with open("Quality/example.fasta") as f: | |
... with open("Quality/example.qual") as q: | |
... for record in PairedFastaQualIterator(f, q): | |
... print("%s %s" % (record.id, record.seq)) | |
... | |
EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC | |
EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA | |
EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG | |
As with the FASTQ or QUAL parsers, if you want to look at the qualities, | |
they are in each record's per-letter-annotation dictionary as a simple | |
list of integers: | |
>>> print(record.letter_annotations["phred_quality"]) | |
[26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18] | |
If you have access to data as a FASTQ format file, using that directly | |
would be simpler and more straight forward. Note that you can easily use | |
this function to convert paired FASTA and QUAL files into FASTQ files: | |
>>> from Bio import SeqIO | |
>>> with open("Quality/example.fasta") as f: | |
... with open("Quality/example.qual") as q: | |
... SeqIO.write(PairedFastaQualIterator(f, q), "Quality/temp.fastq", "fastq") | |
... | |
3 | |
And don't forget to clean up the temp file if you don't need it anymore: | |
>>> import os | |
>>> os.remove("Quality/temp.fastq") | |
""" | |
if alphabet is not None: | |
raise ValueError("The alphabet argument is no longer supported") | |
from Bio.SeqIO.FastaIO import FastaIterator | |
fasta_iter = FastaIterator(fasta_source, title2ids=title2ids) | |
qual_iter = QualPhredIterator(qual_source, title2ids=title2ids) | |
# Using zip wouldn't load everything into memory, but also would not catch | |
# any extra records found in only one file. | |
while True: | |
try: | |
f_rec = next(fasta_iter) | |
except StopIteration: | |
f_rec = None | |
try: | |
q_rec = next(qual_iter) | |
except StopIteration: | |
q_rec = None | |
if f_rec is None and q_rec is None: | |
# End of both files | |
break | |
if f_rec is None: | |
raise ValueError("FASTA file has more entries than the QUAL file.") | |
if q_rec is None: | |
raise ValueError("QUAL file has more entries than the FASTA file.") | |
if f_rec.id != q_rec.id: | |
raise ValueError( | |
f"FASTA and QUAL entries do not match ({f_rec.id} vs {q_rec.id})." | |
) | |
if len(f_rec) != len(q_rec.letter_annotations["phred_quality"]): | |
raise ValueError( | |
f"Sequence length and number of quality scores disagree for {f_rec.id}" | |
) | |
# Merge the data.... | |
f_rec.letter_annotations["phred_quality"] = q_rec.letter_annotations[ | |
"phred_quality" | |
] | |
yield f_rec | |
# Done | |
def _fastq_generic(in_file, out_file, mapping): | |
"""FASTQ helper function where can't have data loss by truncation (PRIVATE).""" | |
# For real speed, don't even make SeqRecord and Seq objects! | |
count = 0 | |
null = chr(0) | |
with as_handle(out_file, "w") as out_handle: | |
for title, seq, old_qual in FastqGeneralIterator(in_file): | |
count += 1 | |
# map the qual... | |
qual = old_qual.translate(mapping) | |
if null in qual: | |
raise ValueError("Invalid character in quality string") | |
out_handle.write(f"@{title}\n{seq}\n+\n{qual}\n") | |
return count | |
def _fastq_generic2(in_file, out_file, mapping, truncate_char, truncate_msg): | |
"""FASTQ helper function where there could be data loss by truncation (PRIVATE).""" | |
# For real speed, don't even make SeqRecord and Seq objects! | |
count = 0 | |
null = chr(0) | |
with as_handle(out_file, "w") as out_handle: | |
for title, seq, old_qual in FastqGeneralIterator(in_file): | |
count += 1 | |
# map the qual... | |
qual = old_qual.translate(mapping) | |
if null in qual: | |
raise ValueError("Invalid character in quality string") | |
if truncate_char in qual: | |
qual = qual.replace(truncate_char, chr(126)) | |
warnings.warn(truncate_msg, BiopythonWarning) | |
out_handle.write(f"@{title}\n{seq}\n+\n{qual}\n") | |
return count | |
def _fastq_sanger_convert_fastq_sanger(in_file, out_file): | |
"""Fast Sanger FASTQ to Sanger FASTQ conversion (PRIVATE). | |
Useful for removing line wrapping and the redundant second identifier | |
on the plus lines. Will check also check the quality string is valid. | |
Avoids creating SeqRecord and Seq objects in order to speed up this | |
conversion. | |
""" | |
# Map unexpected chars to null | |
mapping = "".join( | |
[chr(0) for ascii in range(0, 33)] | |
+ [chr(ascii) for ascii in range(33, 127)] | |
+ [chr(0) for ascii in range(127, 256)] | |
) | |
assert len(mapping) == 256 | |
return _fastq_generic(in_file, out_file, mapping) | |
def _fastq_solexa_convert_fastq_solexa(in_file, out_file): | |
"""Fast Solexa FASTQ to Solexa FASTQ conversion (PRIVATE). | |
Useful for removing line wrapping and the redundant second identifier | |
on the plus lines. Will check also check the quality string is valid. | |
Avoids creating SeqRecord and Seq objects in order to speed up this | |
conversion. | |
""" | |
# Map unexpected chars to null | |
mapping = "".join( | |
[chr(0) for ascii in range(0, 59)] | |
+ [chr(ascii) for ascii in range(59, 127)] | |
+ [chr(0) for ascii in range(127, 256)] | |
) | |
assert len(mapping) == 256 | |
return _fastq_generic(in_file, out_file, mapping) | |
def _fastq_illumina_convert_fastq_illumina(in_file, out_file): | |
"""Fast Illumina 1.3+ FASTQ to Illumina 1.3+ FASTQ conversion (PRIVATE). | |
Useful for removing line wrapping and the redundant second identifier | |
on the plus lines. Will check also check the quality string is valid. | |
Avoids creating SeqRecord and Seq objects in order to speed up this | |
conversion. | |
""" | |
# Map unexpected chars to null | |
mapping = "".join( | |
[chr(0) for ascii in range(0, 64)] | |
+ [chr(ascii) for ascii in range(64, 127)] | |
+ [chr(0) for ascii in range(127, 256)] | |
) | |
assert len(mapping) == 256 | |
return _fastq_generic(in_file, out_file, mapping) | |
def _fastq_illumina_convert_fastq_sanger(in_file, out_file): | |
"""Fast Illumina 1.3+ FASTQ to Sanger FASTQ conversion (PRIVATE). | |
Avoids creating SeqRecord and Seq objects in order to speed up this | |
conversion. | |
""" | |
# Map unexpected chars to null | |
mapping = "".join( | |
[chr(0) for ascii in range(0, 64)] | |
+ [chr(33 + q) for q in range(0, 62 + 1)] | |
+ [chr(0) for ascii in range(127, 256)] | |
) | |
assert len(mapping) == 256 | |
return _fastq_generic(in_file, out_file, mapping) | |
def _fastq_sanger_convert_fastq_illumina(in_file, out_file): | |
"""Fast Sanger FASTQ to Illumina 1.3+ FASTQ conversion (PRIVATE). | |
Avoids creating SeqRecord and Seq objects in order to speed up this | |
conversion. Will issue a warning if the scores had to be truncated at 62 | |
(maximum possible in the Illumina 1.3+ FASTQ format) | |
""" | |
# Map unexpected chars to null | |
trunc_char = chr(1) | |
mapping = "".join( | |
[chr(0) for ascii in range(0, 33)] | |
+ [chr(64 + q) for q in range(0, 62 + 1)] | |
+ [trunc_char for ascii in range(96, 127)] | |
+ [chr(0) for ascii in range(127, 256)] | |
) | |
assert len(mapping) == 256 | |
return _fastq_generic2( | |
in_file, | |
out_file, | |
mapping, | |
trunc_char, | |
"Data loss - max PHRED quality 62 in Illumina 1.3+ FASTQ", | |
) | |
def _fastq_solexa_convert_fastq_sanger(in_file, out_file): | |
"""Fast Solexa FASTQ to Sanger FASTQ conversion (PRIVATE). | |
Avoids creating SeqRecord and Seq objects in order to speed up this | |
conversion. | |
""" | |
# Map unexpected chars to null | |
mapping = "".join( | |
[chr(0) for ascii in range(0, 59)] | |
+ [ | |
chr(33 + int(round(phred_quality_from_solexa(q)))) | |
for q in range(-5, 62 + 1) | |
] | |
+ [chr(0) for ascii in range(127, 256)] | |
) | |
assert len(mapping) == 256 | |
return _fastq_generic(in_file, out_file, mapping) | |
def _fastq_sanger_convert_fastq_solexa(in_file, out_file): | |
"""Fast Sanger FASTQ to Solexa FASTQ conversion (PRIVATE). | |
Avoids creating SeqRecord and Seq objects in order to speed up this | |
conversion. Will issue a warning if the scores had to be truncated at 62 | |
(maximum possible in the Solexa FASTQ format) | |
""" | |
# Map unexpected chars to null | |
trunc_char = chr(1) | |
mapping = "".join( | |
[chr(0) for ascii in range(0, 33)] | |
+ [chr(64 + int(round(solexa_quality_from_phred(q)))) for q in range(0, 62 + 1)] | |
+ [trunc_char for ascii in range(96, 127)] | |
+ [chr(0) for ascii in range(127, 256)] | |
) | |
assert len(mapping) == 256 | |
return _fastq_generic2( | |
in_file, | |
out_file, | |
mapping, | |
trunc_char, | |
"Data loss - max Solexa quality 62 in Solexa FASTQ", | |
) | |
def _fastq_solexa_convert_fastq_illumina(in_file, out_file): | |
"""Fast Solexa FASTQ to Illumina 1.3+ FASTQ conversion (PRIVATE). | |
Avoids creating SeqRecord and Seq objects in order to speed up this | |
conversion. | |
""" | |
# Map unexpected chars to null | |
mapping = "".join( | |
[chr(0) for ascii in range(0, 59)] | |
+ [ | |
chr(64 + int(round(phred_quality_from_solexa(q)))) | |
for q in range(-5, 62 + 1) | |
] | |
+ [chr(0) for ascii in range(127, 256)] | |
) | |
assert len(mapping) == 256 | |
return _fastq_generic(in_file, out_file, mapping) | |
def _fastq_illumina_convert_fastq_solexa(in_file, out_file): | |
"""Fast Illumina 1.3+ FASTQ to Solexa FASTQ conversion (PRIVATE). | |
Avoids creating SeqRecord and Seq objects in order to speed up this | |
conversion. | |
""" | |
# Map unexpected chars to null | |
mapping = "".join( | |
[chr(0) for ascii in range(0, 64)] | |
+ [chr(64 + int(round(solexa_quality_from_phred(q)))) for q in range(0, 62 + 1)] | |
+ [chr(0) for ascii in range(127, 256)] | |
) | |
assert len(mapping) == 256 | |
return _fastq_generic(in_file, out_file, mapping) | |
def _fastq_convert_fasta(in_file, out_file): | |
"""Fast FASTQ to FASTA conversion (PRIVATE). | |
Avoids dealing with the FASTQ quality encoding, and creating SeqRecord and | |
Seq objects in order to speed up this conversion. | |
NOTE - This does NOT check the characters used in the FASTQ quality string | |
are valid! | |
""" | |
# For real speed, don't even make SeqRecord and Seq objects! | |
count = 0 | |
with as_handle(out_file, "w") as out_handle: | |
for title, seq, qual in FastqGeneralIterator(in_file): | |
count += 1 | |
out_handle.write(f">{title}\n") | |
# Do line wrapping | |
for i in range(0, len(seq), 60): | |
out_handle.write(seq[i : i + 60] + "\n") | |
return count | |
def _fastq_convert_tab(in_file, out_file): | |
"""Fast FASTQ to simple tabbed conversion (PRIVATE). | |
Avoids dealing with the FASTQ quality encoding, and creating SeqRecord and | |
Seq objects in order to speed up this conversion. | |
NOTE - This does NOT check the characters used in the FASTQ quality string | |
are valid! | |
""" | |
# For real speed, don't even make SeqRecord and Seq objects! | |
count = 0 | |
with as_handle(out_file, "w") as out_handle: | |
for title, seq, qual in FastqGeneralIterator(in_file): | |
count += 1 | |
out_handle.write(f"{title.split(None, 1)[0]}\t{seq}\n") | |
return count | |
def _fastq_convert_qual(in_file, out_file, mapping): | |
"""FASTQ helper function for QUAL output (PRIVATE). | |
Mapping should be a dictionary mapping expected ASCII characters from the | |
FASTQ quality string to PHRED quality scores (as strings). | |
""" | |
# For real speed, don't even make SeqRecord and Seq objects! | |
count = 0 | |
with as_handle(out_file, "w") as out_handle: | |
for title, seq, qual in FastqGeneralIterator(in_file): | |
count += 1 | |
out_handle.write(f">{title}\n") | |
# map the qual... note even with Sanger encoding max 2 digits | |
try: | |
qualities_strs = [mapping[ascii] for ascii in qual] | |
except KeyError: | |
raise ValueError("Invalid character in quality string") from None | |
data = " ".join(qualities_strs) | |
while len(data) > 60: | |
# Know quality scores are either 1 or 2 digits, so there | |
# must be a space in any three consecutive characters. | |
if data[60] == " ": | |
out_handle.write(data[:60] + "\n") | |
data = data[61:] | |
elif data[59] == " ": | |
out_handle.write(data[:59] + "\n") | |
data = data[60:] | |
else: | |
assert data[58] == " ", "Internal logic failure in wrapping" | |
out_handle.write(data[:58] + "\n") | |
data = data[59:] | |
out_handle.write(data + "\n") | |
return count | |
def _fastq_sanger_convert_qual(in_file, out_file): | |
"""Fast Sanger FASTQ to QUAL conversion (PRIVATE).""" | |
mapping = {chr(q + 33): str(q) for q in range(0, 93 + 1)} | |
return _fastq_convert_qual(in_file, out_file, mapping) | |
def _fastq_solexa_convert_qual(in_file, out_file): | |
"""Fast Solexa FASTQ to QUAL conversion (PRIVATE).""" | |
mapping = { | |
chr(q + 64): str(int(round(phred_quality_from_solexa(q)))) | |
for q in range(-5, 62 + 1) | |
} | |
return _fastq_convert_qual(in_file, out_file, mapping) | |
def _fastq_illumina_convert_qual(in_file, out_file): | |
"""Fast Illumina 1.3+ FASTQ to QUAL conversion (PRIVATE).""" | |
mapping = {chr(q + 64): str(q) for q in range(0, 62 + 1)} | |
return _fastq_convert_qual(in_file, out_file, mapping) | |
if __name__ == "__main__": | |
from Bio._utils import run_doctest | |
run_doctest(verbose=0) | |