File size: 29,953 Bytes
b7731cd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
# Copyright 1999-2000 by Jeffrey Chang.  All rights reserved.
# Copyright 2008-2013 by Michiel de Hoon.  All rights reserved.
# Revisions copyright 2011-2016 by Peter Cock. All rights reserved.
# Revisions copyright 2015 by Eric Rasche. All rights reserved.
# Revisions copyright 2015 by Carlos Pena. All rights reserved.
#
# This file is part of the Biopython distribution and governed by your
# choice of the "Biopython License Agreement" or the "BSD 3-Clause License".
# Please see the LICENSE file that should have been included as part of this
# package.

"""Provides code to access NCBI over the WWW.

The main Entrez web page is available at:
http://www.ncbi.nlm.nih.gov/Entrez/

Entrez Programming Utilities web page is available at:
http://www.ncbi.nlm.nih.gov/books/NBK25501/

This module provides a number of functions like ``efetch`` (short for
Entrez Fetch) which will return the data as a handle object. This is
a standard interface used in Python for reading data from a file, or
in this case a remote network connection, and provides methods like
``.read()`` or offers iteration over the contents line by line. See
also "What the heck is a handle?" in the Biopython Tutorial and
Cookbook: http://biopython.org/DIST/docs/tutorial/Tutorial.html
http://biopython.org/DIST/docs/tutorial/Tutorial.pdf
The handle returned by these functions can be either in text mode or
in binary mode, depending on the data requested and the results
returned by NCBI Entrez. Typically, XML data will be in binary mode
while other data will be in text mode, as required by the downstream
parser to parse the data.

Unlike a handle to a file on disk from the ``open(filename)`` function,
which has a ``.name`` attribute giving the filename, the handles from
``Bio.Entrez`` all have a ``.url`` attribute instead giving the URL
used to connect to the NCBI Entrez API.

The ``epost``, ``efetch``, and ``esummary`` tools take an "id" parameter
which corresponds to one or more database UIDs (or accession.version
identifiers in the case of sequence databases such as "nuccore" or
"protein"). The Python value of the "id" keyword passed to these functions
may be either a single ID as a string or integer or multiple IDs as an
iterable of strings/integers. You may also pass a single string containing
multiple IDs delimited by commas. The ``elink`` tool also accepts multiple
IDs but the argument is handled differently than the other three. See that
function's docstring for more information.

All the functions that send requests to the NCBI Entrez API will
automatically respect the NCBI rate limit (of 3 requests per second
without an API key, or 10 requests per second with an API key) and
will automatically retry when encountering transient failures
(i.e. connection failures or HTTP 5XX codes). By default, Biopython
does a maximum of three tries before giving up, and sleeps for 15
seconds between tries. You can tweak these parameters by setting
``Bio.Entrez.max_tries`` and ``Bio.Entrez.sleep_between_tries``.

The Entrez module also provides an XML parser which takes a handle
as input.

Variables:

    - email        Set the Entrez email parameter (default is not set).
    - tool         Set the Entrez tool parameter (default is ``biopython``).
    - api_key      Personal API key from NCBI. If not set, only 3 queries per
      second are allowed. 10 queries per seconds otherwise with a
      valid API key.
    - max_tries    Configures how many times failed requests will be
      automatically retried on error (default is 3).
    - sleep_between_tries   The delay, in seconds, before retrying a request on
      error (default is 15).

Functions:

    - efetch       Retrieves records in the requested format from a list of one or
      more primary IDs or from the user's environment
    - epost        Posts a file containing a list of primary IDs for future use in
      the user's environment to use with subsequent search strategies
    - esearch      Searches and retrieves primary IDs (for use in EFetch, ELink,
      and ESummary) and term translations and optionally retains
      results for future use in the user's environment.
    - elink        Checks for the existence of an external or Related Articles link
      from a list of one or more primary IDs.  Retrieves primary IDs
      and relevancy scores for links to Entrez databases or Related
      Articles;  creates a hyperlink to the primary LinkOut provider
      for a specific ID and database, or lists LinkOut URLs
      and Attributes for multiple IDs.
    - einfo        Provides field index term counts, last update, and available
      links for each database.
    - esummary     Retrieves document summaries from a list of primary IDs or from
      the user's environment.
    - egquery      Provides Entrez database counts in XML for a single search
      using Global Query.
    - espell       Retrieves spelling suggestions.
    - ecitmatch    Retrieves PubMed IDs (PMIDs) that correspond to a set of
      input citation strings.

    - read         Parses the XML results returned by any of the above functions.
      Alternatively, the XML data can be read from a file opened in binary mode.
      Typical usage is:

          >>> from Bio import Entrez
          >>> Entrez.email = "[email protected]"
          >>> handle = Entrez.einfo() # or esearch, efetch, ...
          >>> record = Entrez.read(handle)
          >>> handle.close()

       where record is now a Python dictionary or list.

    - parse        Parses the XML results returned by those of the above functions
      which can return multiple records - such as efetch, esummary
      and elink. Typical usage is:

          >>> handle = Entrez.esummary(db="pubmed", id="19304878,14630660", retmode="xml")
          >>> records = Entrez.parse(handle)
          >>> for record in records:
          ...     # each record is a Python dictionary or list.
          ...     print(record['Title'])
          Biopython: freely available Python tools for computational molecular biology and bioinformatics.
          PDB file parser and structure class implemented in Python.
          >>> handle.close()

      This function is appropriate only if the XML file contains
      multiple records, and is particular useful for large files.

    - _open        Internally used function.

"""

import time
import warnings
import io
from urllib.error import URLError, HTTPError
from urllib.parse import urlencode
from urllib.request import urlopen, Request


email = None
max_tries = 3
sleep_between_tries = 15
tool = "biopython"
api_key = None


# XXX retmode?
def epost(db, **keywds):
    """Post a file of identifiers for future use.

    Posts a file containing a list of UIs for future use in the user's
    environment to use with subsequent search strategies.

    See the online documentation for an explanation of the parameters:
    http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EPost

    :returns: Handle to the results.
    :raises urllib.error.URLError: If there's a network error.
    """
    cgi = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/epost.fcgi"
    variables = {"db": db}
    variables.update(keywds)
    request = _build_request(cgi, variables, post=True)
    return _open(request)


def efetch(db, **keywords):
    """Fetch Entrez results which are returned as a handle.

    EFetch retrieves records in the requested format from a list or set of one or
    more UIs or from user's environment.

    See the online documentation for an explanation of the parameters:
    http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EFetch

    Short example:

    >>> from Bio import Entrez
    >>> Entrez.email = "[email protected]"
    >>> handle = Entrez.efetch(db="nucleotide", id="AY851612", rettype="gb", retmode="text")
    >>> print(handle.readline().strip())
    LOCUS       AY851612                 892 bp    DNA     linear   PLN 10-APR-2007
    >>> handle.close()

    This will automatically use an HTTP POST rather than HTTP GET if there
    are over 200 identifiers as recommended by the NCBI.

    **Warning:** The NCBI changed the default retmode in Feb 2012, so many
    databases which previously returned text output now give XML.

    :returns: Handle to the results.
    :raises urllib.error.URLError: If there's a network error.
    """
    cgi = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
    variables = {"db": db}
    variables.update(keywords)
    request = _build_request(cgi, variables)
    return _open(request)


def esearch(db, term, **keywds):
    """Run an Entrez search and return a handle to the results.

    ESearch searches and retrieves primary IDs (for use in EFetch, ELink
    and ESummary) and term translations, and optionally retains results
    for future use in the user's environment.

    See the online documentation for an explanation of the parameters:
    http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch

    Short example:

    >>> from Bio import Entrez
    >>> Entrez.email = "[email protected]"
    >>> handle = Entrez.esearch(db="nucleotide", retmax=10, term="opuntia[ORGN] accD", idtype="acc")
    >>> record = Entrez.read(handle)
    >>> handle.close()
    >>> int(record["Count"]) >= 2
    True
    >>> "EF590893.1" in record["IdList"]
    True
    >>> "EF590892.1" in record["IdList"]
    True

    :returns: Handle to the results, which are always in XML format.
    :raises urllib.error.URLError: If there's a network error.
    """
    cgi = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
    variables = {"db": db, "term": term}
    variables.update(keywds)
    request = _build_request(cgi, variables)
    return _open(request)


def elink(**keywds):
    """Check for linked external articles and return a handle.

    ELink checks for the existence of an external or Related Articles link
    from a list of one or more primary IDs;  retrieves IDs and relevancy
    scores for links to Entrez databases or Related Articles; creates a
    hyperlink to the primary LinkOut provider for a specific ID and
    database, or lists LinkOut URLs and attributes for multiple IDs.

    See the online documentation for an explanation of the parameters:
    http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ELink

    Note that ELink treats the "id" parameter differently than the other
    tools when multiple values are given. You should generally pass multiple
    UIDs as a list of strings or integers. This will provide a "one-to-one"
    mapping from source database UIDs to destination database UIDs in the
    result. If multiple source UIDs are passed as a single comma-delimited
    string all destination UIDs will be mixed together in the result.

    This example finds articles related to the Biopython application
    note's entry in the PubMed database:

    >>> from Bio import Entrez
    >>> Entrez.email = "[email protected]"
    >>> pmid = "19304878"
    >>> handle = Entrez.elink(dbfrom="pubmed", id=pmid, linkname="pubmed_pubmed")
    >>> record = Entrez.read(handle)
    >>> handle.close()
    >>> print(record[0]["LinkSetDb"][0]["LinkName"])
    pubmed_pubmed
    >>> linked = [link["Id"] for link in record[0]["LinkSetDb"][0]["Link"]]
    >>> "14630660" in linked
    True

    This is explained in much more detail in the Biopython Tutorial.

    :returns: Handle to the results, by default in XML format.
    :raises urllib.error.URLError: If there's a network error.
    """
    cgi = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi"
    variables = {}
    variables.update(keywds)
    request = _build_request(cgi, variables, join_ids=False)
    return _open(request)


def einfo(**keywds):
    """Return a summary of the Entrez databases as a results handle.

    EInfo provides field names, index term counts, last update, and
    available links for each Entrez database.

    See the online documentation for an explanation of the parameters:
    http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EInfo

    Short example:

    >>> from Bio import Entrez
    >>> Entrez.email = "[email protected]"
    >>> record = Entrez.read(Entrez.einfo())
    >>> 'pubmed' in record['DbList']
    True

    :returns: Handle to the results, by default in XML format.
    :raises urllib.error.URLError: If there's a network error.
    """
    cgi = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi"
    variables = {}
    variables.update(keywds)
    request = _build_request(cgi, variables)
    return _open(request)


def esummary(**keywds):
    """Retrieve document summaries as a results handle.

    ESummary retrieves document summaries from a list of primary IDs or
    from the user's environment.

    See the online documentation for an explanation of the parameters:
    http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESummary

    This example discovers more about entry 19923 in the structure
    database:

    >>> from Bio import Entrez
    >>> Entrez.email = "[email protected]"
    >>> handle = Entrez.esummary(db="structure", id="19923")
    >>> record = Entrez.read(handle)
    >>> handle.close()
    >>> print(record[0]["Id"])
    19923
    >>> print(record[0]["PdbDescr"])
    Crystal Structure Of E. Coli Aconitase B


    :returns: Handle to the results, by default in XML format.
    :raises urllib.error.URLError: If there's a network error.
    """
    cgi = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi"
    variables = {}
    variables.update(keywds)
    request = _build_request(cgi, variables)
    return _open(request)


def egquery(**keywds):
    """Provide Entrez database counts for a global search.

    EGQuery provides Entrez database counts in XML for a single search
    using Global Query.

    See the online documentation for an explanation of the parameters:
    http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EGQuery

    This quick example based on a longer version from the Biopython
    Tutorial just checks there are over 60 matches for 'Biopython'
    in PubMedCentral:

    >>> from Bio import Entrez
    >>> Entrez.email = "[email protected]"
    >>> handle = Entrez.egquery(term="biopython")
    >>> record = Entrez.read(handle)
    >>> handle.close()
    >>> for row in record["eGQueryResult"]:
    ...     if "pmc" in row["DbName"]:
    ...         print(int(row["Count"]) > 60)
    True

    :returns: Handle to the results, by default in XML format.
    :raises urllib.error.URLError: If there's a network error.
    """
    cgi = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/egquery.fcgi"
    variables = {}
    variables.update(keywds)
    request = _build_request(cgi, variables)
    return _open(request)


def espell(**keywds):
    """Retrieve spelling suggestions as a results handle.

    ESpell retrieves spelling suggestions, if available.

    See the online documentation for an explanation of the parameters:
    http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESpell

    Short example:

    >>> from Bio import Entrez
    >>> Entrez.email = "[email protected]"
    >>> record = Entrez.read(Entrez.espell(term="biopythooon"))
    >>> print(record["Query"])
    biopythooon
    >>> print(record["CorrectedQuery"])
    biopython

    :returns: Handle to the results, by default in XML format.
    :raises urllib.error.URLError: If there's a network error.
    """
    cgi = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/espell.fcgi"
    variables = {}
    variables.update(keywds)
    request = _build_request(cgi, variables)
    return _open(request)


def _update_ecitmatch_variables(keywds):
    # XML is the only supported value, and it actually returns TXT.
    variables = {"retmode": "xml"}
    citation_keys = (
        "journal_title",
        "year",
        "volume",
        "first_page",
        "author_name",
        "key",
    )

    # Accept pre-formatted strings
    if isinstance(keywds["bdata"], str):
        variables.update(keywds)
    else:
        # Alternatively accept a nicer interface
        variables["db"] = keywds["db"]
        bdata = []
        for citation in keywds["bdata"]:
            formatted_citation = "|".join(
                [citation.get(key, "") for key in citation_keys]
            )
            bdata.append(formatted_citation)
        variables["bdata"] = "\r".join(bdata)
    return variables


def ecitmatch(**keywds):
    """Retrieve PMIDs for input citation strings, returned as a handle.

    ECitMatch retrieves PubMed IDs (PMIDs) that correspond to a set of input
    citation strings.

    See the online documentation for an explanation of the parameters:
    http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ECitMatch

    Short example:

    >>> from Bio import Entrez
    >>> Entrez.email = "[email protected]"
    >>> citation_1 = {"journal_title": "proc natl acad sci u s a",
    ...               "year": "1991", "volume": "88", "first_page": "3248",
    ...               "author_name": "mann bj", "key": "citation_1"}
    >>> handle = Entrez.ecitmatch(db="pubmed", bdata=[citation_1])
    >>> print(handle.read().strip().split("|"))
    ['proc natl acad sci u s a', '1991', '88', '3248', 'mann bj', 'citation_1', '2014248']
    >>> handle.close()

    :returns: Handle to the results, by default in plain text.
    :raises urllib.error.URLError: If there's a network error.
    """
    cgi = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/ecitmatch.cgi"
    variables = _update_ecitmatch_variables(keywds)
    request = _build_request(cgi, variables, ecitmatch=True)
    return _open(request)


def read(handle, validate=True, escape=False, ignore_errors=False):
    """Parse an XML file from the NCBI Entrez Utilities into python objects.

    This function parses an XML file created by NCBI's Entrez Utilities,
    returning a multilevel data structure of Python lists and dictionaries.
    Most XML files returned by NCBI's Entrez Utilities can be parsed by
    this function, provided its DTD is available. Biopython includes the
    DTDs for most commonly used Entrez Utilities.

    The handle must be in binary mode. This allows the parser to detect the
    encoding from the XML file, and to use it to convert all text in the XML
    to the correct Unicode string. The functions in Bio.Entrez to access NCBI
    Entrez will automatically return XML data in binary mode. For files,
    please use mode "rb" when opening the file, as in

        >>> from Bio import Entrez
        >>> handle = open("Entrez/esearch1.xml", "rb")  # opened in binary mode
        >>> record = Entrez.read(handle)
        >>> print(record['QueryTranslation'])
        biopython[All Fields]
        >>> handle.close()

    If validate is True (default), the parser will validate the XML file
    against the DTD, and raise an error if the XML file contains tags that
    are not represented in the DTD. If validate is False, the parser will
    simply skip such tags.

    If escape is True, all characters that are not valid HTML are replaced
    by HTML escape characters to guarantee that the returned strings are
    valid HTML fragments. For example, a less-than sign (<) is replaced by
    &lt;. If escape is False (default), the string is returned as is.

    If ignore_errors is False (default), any error messages in the XML file
    will raise a RuntimeError. If ignore_errors is True, error messages will
    be stored as ErrorElement items, without raising an exception.

    Whereas the data structure seems to consist of generic Python lists,
    dictionaries, strings, and so on, each of these is actually a class
    derived from the base type. This allows us to store the attributes
    (if any) of each element in a dictionary my_element.attributes, and
    the tag name in my_element.tag.
    """
    from .Parser import DataHandler

    handler = DataHandler(validate, escape, ignore_errors)
    record = handler.read(handle)
    return record


def parse(handle, validate=True, escape=False, ignore_errors=False):
    """Parse an XML file from the NCBI Entrez Utilities into python objects.

    This function parses an XML file created by NCBI's Entrez Utilities,
    returning a multilevel data structure of Python lists and dictionaries.
    This function is suitable for XML files that (in Python) can be represented
    as a list of individual records. Whereas 'read' reads the complete file
    and returns a single Python list, 'parse' is a generator function that
    returns the records one by one. This function is therefore particularly
    useful for parsing large files.

    Most XML files returned by NCBI's Entrez Utilities can be parsed by
    this function, provided its DTD is available. Biopython includes the
    DTDs for most commonly used Entrez Utilities.

    The handle must be in binary mode. This allows the parser to detect the
    encoding from the XML file, and to use it to convert all text in the XML
    to the correct Unicode string. The functions in Bio.Entrez to access NCBI
    Entrez will automatically return XML data in binary mode. For files,
    please use mode "rb" when opening the file, as in

        >>> from Bio import Entrez
        >>> handle = open("Entrez/pubmed1.xml", "rb")  # opened in binary mode
        >>> records = Entrez.parse(handle)
        >>> for record in records:
        ...     print(record['MedlineCitation']['Article']['Journal']['Title'])
        ...
        Social justice (San Francisco, Calif.)
        Biochimica et biophysica acta
        >>> handle.close()

    If validate is True (default), the parser will validate the XML file
    against the DTD, and raise an error if the XML file contains tags that
    are not represented in the DTD. If validate is False, the parser will
    simply skip such tags.

    If escape is True, all characters that are not valid HTML are replaced
    by HTML escape characters to guarantee that the returned strings are
    valid HTML fragments. For example, a less-than sign (<) is replaced by
    &lt;. If escape is False (default), the string is returned as is.

    If ignore_errors is False (default), any error messages in the XML file
    will raise a RuntimeError. If ignore_errors is True, error messages will
    be stored as ErrorElement items, without raising an exception.

    Whereas the data structure seems to consist of generic Python lists,
    dictionaries, strings, and so on, each of these is actually a class
    derived from the base type. This allows us to store the attributes
    (if any) of each element in a dictionary my_element.attributes, and
    the tag name in my_element.tag.
    """
    from .Parser import DataHandler

    handler = DataHandler(validate, escape, ignore_errors)
    records = handler.parse(handle)
    return records


def _open(request):
    """Make an HTTP request to Entrez, handling errors and enforcing rate limiting (PRIVATE).

    Does some simple error checking and will try again after certain types of errors, up to
    ``max_retries`` times. This function also enforces the "up to three queries per second
    rule" to avoid abusing the NCBI servers (this limit is increased to 10 if using an API key).

    :param req_or_cgi: A Request object returned by ``_build_request``.
    :type req_or_cgi: urllib.request.Request
    :returns: Handle to HTTP response as returned by ``urllib.request.urlopen``. Will be wrapped in
        an ``io.TextIOWrapper`` if its content type is plain text.
    :rtype: http.client.HTTPResponse or io.TextIOWrapper
    :raises urllib.error.URLError: Errors raised by ``urlopen`` past the maximum number of retries.
    """
    # NCBI requirement: At most three queries per second if no API key is provided.
    # Equivalently, at least a third of second between queries
    # Using just 0.333333334 seconds sometimes hit the NCBI rate limit,
    # the slightly longer pause of 0.37 seconds has been more reliable.
    delay = 0.1 if _has_api_key(request) else 0.37
    current = time.time()
    wait = _open.previous + delay - current
    if wait > 0:
        time.sleep(wait)
        _open.previous = current + wait
    else:
        _open.previous = current

    for i in range(max_tries):
        try:
            handle = urlopen(request)
        except HTTPError as exception:
            # Reraise if the final try fails
            if i >= max_tries - 1:
                raise
            # Reraise if the exception is triggered by a HTTP 4XX error
            # indicating some kind of bad request, UNLESS it's specifically a
            # 429 "Too Many Requests" response. NCBI seems to sometimes
            # erroneously return 429s even when their rate limit is
            # honored (and indeed even with the rate-limit-related fudging
            # higher up in this function in place), so the best we can do is
            # treat them as a serverside error and try again after sleeping
            # for a bit.
            if exception.code // 100 == 4 and exception.code != 429:
                raise
        except URLError:
            # Reraise if the final try fails
            if i >= max_tries - 1:
                raise
            # Treat as a transient error and try again after a brief delay:
            time.sleep(sleep_between_tries)
        else:
            break

    subtype = handle.headers.get_content_subtype()
    if subtype == "plain":
        url = handle.url
        handle = io.TextIOWrapper(handle, encoding="UTF-8")
        handle.url = url
    return handle


_open.previous = 0


def _build_request(cgi, params=None, post=None, ecitmatch=False, join_ids=True):
    """Build a Request object for an E-utility.

    :param str cgi: base URL for the CGI script to access.
    :param params: Mapping containing options to pass to the CGI script. Keys must be strings.
    :type params: dict or None
    :param bool post: Whether to use the HTTP POST method rather than GET. By default (``post=None``),
        POST is used if the URL encoded parameters would be over 1000 characters long, as is
        suggested in the E-Utilities documentation.
    :param bool ecitmatch: Don't URL-encode pipe ("|") characters, this is expected by the ecitmatch
        tool.
    :param bool join_ids: Passed to ``_construct_params``.
    :returns: A request object ready to be passed to ``_open``.
    :rtype: urllib.request.Request
    """
    params = _construct_params(params, join_ids=join_ids)

    params_str = urlencode(params, doseq=True)
    if ecitmatch:
        params_str = params_str.replace("%7C", "|")

    # By default, post is None. Set to a boolean to over-ride length choice:
    if post is None and len(params_str) > 1000:
        post = True

    # NCBI prefers an HTTP POST instead of an HTTP GET if there are more than about 200 IDs
    if post is None and "id" in params:
        idcount = params["id"].count(",") + 1
        if idcount >= 200:
            post = True

    if post:
        return Request(cgi, data=params_str.encode("utf8"), method="POST")
    else:
        return Request(cgi + "?" + params_str, method="GET")


def _construct_params(params, join_ids=True):
    """Construct/format parameter dict for an Entrez request.

    :param params: User-supplied parameters.
    :type params: dict or None
    :param bool join_ids: If True and the "id" key of ``params`` is a list
        containing multiple UIDs, join them into a single comma-delimited string.
    :returns: Parameters with defaults added and keys with None values removed.
    :rtype: dict
    """
    if params is None:
        params = {}

    # Tell Entrez that we are using Biopython (or whatever the user has
    # specified explicitly in the parameters or by changing the default)
    params.setdefault("tool", tool)

    # Tell Entrez who we are
    params.setdefault("email", email)
    params.setdefault("api_key", api_key)

    # Remove None values from the parameters
    for key, value in list(params.items()):
        if value is None:
            del params[key]

    # Warn if email not set
    if "email" not in params:
        warnings.warn(
            """
            Email address is not specified.

            To make use of NCBI's E-utilities, NCBI requires you to specify your
            email address with each request.  As an example, if your email address
            is [email protected], you can specify it as follows:
               from Bio import Entrez
               Entrez.email = '[email protected]'
            In case of excessive usage of the E-utilities, NCBI will attempt to contact
            a user at the email address provided before blocking access to the
            E-utilities.""",
            UserWarning,
        )

    # Format "id" parameter properly
    if join_ids and "id" in params:
        params["id"] = _format_ids(params["id"])

    return params


def _format_ids(ids):
    """Convert one or more UIDs to a single comma-delimited string.

    Input may be a single ID as an integer or string, an iterable of strings/ints,
    or a string of IDs already separated by commas.
    """
    if isinstance(ids, int):
        # Single integer, just convert to str
        return str(ids)

    if isinstance(ids, str):
        # String which represents one or more IDs joined by commas
        # Remove any whitespace around commas if they are present
        return ",".join(id.strip() for id in ids.split(","))

    # Not a string or integer, assume iterable
    return ",".join(map(str, ids))


def _has_api_key(request):
    """Check if a Request has the api_key parameter set, to set the rate limit.

    Works with GET or POST requests.
    """
    if request.method == "POST":
        return b"api_key=" in request.data
    return "api_key=" in request.full_url


if __name__ == "__main__":
    from Bio._utils import run_doctest

    run_doctest()