NCBI EntrezからFastaファイルをダウンロード

NCBI Entrez は、30以上もの生物学的な目的で作成されたデータベースに対する統合的なテキストベースの検索、情報抽出システムです。
BiopythonパッケージのBio.Entrezモジュールを使えば、このシステムをpythonから手軽に使えちゃいます。 ちなみに、Bio.EntrezはEntrez Programming Utilities(a.k.a EUtils)を利用しているようです。

Biopythonはfastqファイルの処理くらいにしか使っていなかったけど、これは使えそう。。
ってことで試してみる。

ユーザー認証

NCBIにはデータベースにアクセスするための統合的なインターフェースとしてEntrezが用意されています。
Biopythonでは、EntrezのAPIにアクセスするためのラッパーが用意されており、それを用いると自動的にEntrezのお作法に則ったリクエストを投げることができます。

この機能を使うために、まずはユーザー認証を行いましょう。

from Bio import Entrez
Entrez.email = "A.N.Other@example.com"

Entrez.email でメールアドレスを登録してからクエリを投げないと弾かれます。 使用量が限度を超えた場合、アクセスをブロックする前にこのメールアドレスを介して連絡がくるそうです。

使用できるデータベースの検索

ブラウザだと↓

f:id:kimoppy126:20180913093926p:plain

EUtilsの返り値同様、基本xml形式で返さた値がオブジェクト内に格納されます。

handle = Entrez.einfo()
result = handle.read()
print(result)

<?xml version="1.0" encoding="UTF-8" ?>                                                             
<!DOCTYPE eInfoResult PUBLIC "-//NLM//DTD einfo 20130322//EN" "https://eutils.ncbi.nlm.nih.gov/euti$
s/dtd/20130322/einfo.dtd">                                                                          
<eInfoResult>                                                                                       
<DbList>                                                                                            
                                                                                                    
        <DbName>pubmed</DbName>                                                                     
        <DbName>protein</DbName>                                                                    
        <DbName>nuccore</DbName>                                                                    
        <DbName>ipg</DbName>                                                                        
        <DbName>nucleotide</DbName>                                                                 
        <DbName>nucgss</DbName>                                                                     
        <DbName>nucest</DbName>                                                                     
        <DbName>structure</DbName>                                                                  
        <DbName>sparcle</DbName>                                                                    
        <DbName>genome</DbName>                                                                     
        <DbName>annotinfo</DbName>                                                                  
        <DbName>assembly</DbName>                                                                   
        <DbName>bioproject</DbName>                                                                 
        <DbName>biosample</DbName>                                                                  
        <DbName>blastdbinfo</DbName>                                                                
        <DbName>books</DbName>                                                                      
        <DbName>cdd</DbName>                                                                        
        <DbName>clinvar</DbName>                                                                    
        <DbName>clone</DbName>                                                                      
        <DbName>gap</DbName>                                                                        
        <DbName>gapplus</DbName>                                                                    
        <DbName>grasp</DbName>                                                                      
        <DbName>dbvar</DbName>                                                                      
        <DbName>gene</DbName>                                                                       
        <DbName>gds</DbName>                                                                        
        <DbName>geoprofiles</DbName>                                                                
        <DbName>homologene</DbName>                                                                 
        <DbName>medgen</DbName>                                                                     
        <DbName>mesh</DbName>                                                                       
        <DbName>ncbisearch</DbName>                                                                 
        <DbName>nlmcatalog</DbName>                                                                 
        <DbName>omim</DbName>                                                                       
        <DbName>orgtrack</DbName>                                                                   
        <DbName>pmc</DbName>                                                                        
        <DbName>popset</DbName>                                                                     
        <DbName>probe</DbName>                                                                      
        <DbName>proteinclusters</DbName>                                                            
        <DbName>pcassay</DbName>                                                                    
        <DbName>biosystems</DbName>                                                                 
        <DbName>pccompound</DbName>                                                                 
        <DbName>pcsubstance</DbName>                                                                
        <DbName>pubmedhealth</DbName>                                                               
        <DbName>seqannot</DbName>                                                                   
        <DbName>snp</DbName> 
</DbList>      
               
</eInfoResult> 

これをパースし、オブジェクト内に格納するにはEntrez.read()を使用します。
データベースの一覧はDbList をキーとする辞書内に格納されています。

handle = Entrez.einfo()        
result = Entrez.read(handle)   
print(result['DbList'])

['pubmed', 'protein', 'nuccore', 'ipg', 'nucleotide', 'nucgss', 'nucest', 'structure', 'spa
rcle', 'genome', 'annotinfo', 'assembly', 'bioproject', 'biosample', 'blastdbinfo', 'books', 'cdd', 
'clinvar', 'clone', 'gap', 'gapplus', 'grasp', 'dbvar', 'gene', 'gds', 'geoprofiles', 'homologene', 
'medgen', 'mesh', 'ncbisearch', 'nlmcatalog', 'omim', 'orgtrack', 'pmc', 'popset', 'probe', 'protein
clusters', 'pcassay', 'biosystems', 'pccompound', 'pcsubstance', 'pubmedhealth', 'seqannot', 'snp', 
'sra', 'taxonomy', 'biocollections', 'unigene', 'gencoll', 'gtr']                                   

Entrez内の検索

ブラウザだと↓ f:id:kimoppy126:20180913083658p:plain

関心のあるキーワードで検索を行いたい場合。 例えば、

Cypripediodeae亜科のラン植物のmatK遺伝子

についての配列が欲しいときは、以下のように行います。

Entrez.email = "A.N.Other@example.com"     # Always tell NCBI who you are                
handle = Entrez.esearch(db="nucleotide", term="Cypripedioideae[Orgn] AND matK[Gene]")  
result = Entrez.read(handle) 

ヒットした論文数

print(result["Count"]) 

542

ヒットしたGenBank ID

print(result["IdList"])                                                                    

['1746542926', '1746542924', '1746542922', '1746542920', '1746542918', '1746542916', '1746542914', '1746542912', '1746542910', '1746542908', '1746542906', '1746542904', '1746542902', '1746542900', '1746542898', '1746542896', '1746542894', '1746542892', '1746542890', '1746542888']

デフォルトでは取得するIDの上限が20に設定されているため、すべてのIDを取得するには下記のようにretmaxを指定してあげる必要があります。

handle = Entrez.esearch(db="nucleotide", term="Cypripedioideae[Orgn] AND matK[Gene]")  
result = Entrez.read(handle)
if int(rec_list['RetMax']) < int(rec_list['Count']):
    handle = Entrez.esearch(db="nucleotide", term="Cypripedioideae[Orgn] AND matK[Gene]",
                            retmax=rec_list['Count'])
    rec_list = Entrez.read(handle)

Entrezからの完全な情報の取得

ブラウザだと↓ f:id:kimoppy126:20180913092519p:plain

ヒットしたGenBank IDの一番最初のものについて、詳細を見てみましょう。
詳細の閲覧には、Bio.Entrez.efetchを使用します。

handle = Entrez.efetch(db="nucleotide", id="1746542926", rettype="gb", retmode="text")
print(handle.read()) 
LOCUS       MN419894                2471 bp    DNA     linear   INV 25-SEP-2019
DEFINITION  Plasmodium falciparum isolate PA1876 chloroquine resistance
            transporter (crt) gene, partial cds.
ACCESSION   MN419894
VERSION     MN419894.1
KEYWORDS    .
SOURCE      Plasmodium falciparum (malaria parasite P. falciparum)
  ORGANISM  Plasmodium falciparum
            Eukaryota; Sar; Alveolata; Apicomplexa; Aconoidasida; Haemosporida;
            Plasmodiidae; Plasmodium; Plasmodium (Laverania).
REFERENCE   1  (bases 1 to 2471)
  AUTHORS   Zhao,Y., Liu,Z., Soe,M.T., Wang,L., Soe,T.N., Wei,H., Than,A.,
            Aung,P.L., Li,Y., Zhang,X., Hu,Y., Wei,H., Zhang,Y., Burgess,J.,
            Siddiqui,F.A., Menezes,L., Wang,Q., Kyaw,M.P., Cao,Y. and Cui,L.
  TITLE     Genetic Variations Associated with Drug Resistance Markers in
            Asymptomatic Plasmodium falciparum Infections in Myanmar
  JOURNAL   Genes (Basel) 10 (9), E692 (2019)
   PUBMED   31505774
  REMARK    Publication Status: Online-Only
REFERENCE   2  (bases 1 to 2471)
  AUTHORS   Zhao,Y., Liu,Z., Soe,M.T., Wang,L., Soe,T.N., Wei,H., Than,A.,
            Aung,P.L., Li,Y., Zhang,X., Hu,Y., Wei,H., Zhang,Y., Burgess,J.,
            Siddiqui,F.A., Menezes,L., Wang,Q., Kyaw,M.P., Cao,Y. and Cui,L.
  TITLE     Direct Submission
  JOURNAL   Submitted (05-SEP-2019) Department of Immunology, College of Basic
            Medical Sciences, China Medical University, No.77 Puhe Road,
            Shenyang, Liaoning 110122, China
COMMENT     ##Assembly-Data-START##
            Sequencing Technology :: Sanger dideoxy sequencing
            ##Assembly-Data-END##
FEATURES             Location/Qualifiers
     source          1..2471
                     /organism="Plasmodium falciparum"
                     /mol_type="genomic DNA"
                     /strain="Paletwa"
                     /isolate="PA1876"
                     /db_xref="taxon:5833"
     gene            <1..>2471
                     /gene="crt"
                     /locus_tag="PF3D7_0709000"
                     /note="Pfcrt"
     mRNA            join(<1..91,192..460,561..733,834..966,1067..1138,
                     1239..1314,1411..1493,1594..1644,1745..1801,1902..1994,
                     2095..2139,2240..2294,2395..>2471)
                     /gene="crt"
                     /locus_tag="PF3D7_0709000"
                     /product="chloroquine resistance transporter"
     CDS             join(1..91,192..460,561..733,834..966,1067..1138,
                     1239..1314,1411..1493,1594..1644,1745..1801,1902..1994,
                     2095..2139,2240..2294,2395..2471)
                     /gene="crt"
                     /locus_tag="PF3D7_0709000"
                     /codon_start=1
                     /product="chloroquine resistance transporter"
                     /protein_id="QEQ91169.1"
                     /translation="MKFASKKNNQKNSSKNDERYRELDNLVQEGNGSRLGGGSCLGKC
                     AHVFKLIFKEIKDNIFIYILSIIYLSVCVIETIFAKRTLNKIGNYSFVTSETHNFICM
                     IMFFIVYSLFGNKKGNSKERHRSFNLQFFAISMLDACSVILAFIGLTRTTGNIQSFVL
                     QLSIPINMFFCFLILRYRYHLYNYLGAVIIVVTIALVEMKLSFETQEENSIIFNLVLI
                     SSLIPVCFSNMTREIVFKKYKIDILRLNAMVSFFQLFTSCLILPVYTLPFLKELHLPY
                     NEIWTNIKNGFACLFLGRNTVVENCGLGMAKLCDDCDGAWKTFALFSFFNICDNLITS
                     YIIDKFSTMTYTIVSCIQGPATAIAYYFKFLAGDVVIEPRLLDFVTLFGYLFGSIIYR
                     VGNIILERKKMRNEENEDSEGELTNVDSIITQ"
     gap             92..191
                     /estimated_length=unknown
     gap             461..560
                     /estimated_length=unknown
     gap             734..833
                     /estimated_length=unknown
     gap             967..1066
                     /estimated_length=unknown
     gap             1139..1238
                     /estimated_length=unknown
     gap             1494..1593
                     /estimated_length=unknown
     gap             1645..1744
                     /estimated_length=unknown
     gap             1802..1901
                     /estimated_length=unknown
     gap             1995..2094
                     /estimated_length=unknown
     gap             2140..2239
                     /estimated_length=unknown
     gap             2295..2394
                     /estimated_length=unknown
ORIGIN      
        1 atgaaattcg caagtaaaaa aaataatcaa aaaaattcaa gcaaaaatga cgagcgttat
       61 agagaattag ataatttagt acaagaagga annnnnnnnn nnnnnnnnnn nnnnnnnnnn
      121 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn
      181 nnnnnnnnnn natggctcac gtttaggtgg aggttcttgt cttggtaaat gtgctcatgt
      241 gtttaaactt atttttaaag agattaagga taatattttt atttatattt taagtattat
      301 ttatttaagt gtatgtgtaa ttgaaacaat ttttgctaaa agaactttaa acaaaattgg
      361 taactatagt tttgtaacat ccgaaactca caactttatt tgtatgatta tgttctttat
      421 tgtttattcc ttatttggaa ataaaaaggg aaattcaaaa nnnnnnnnnn nnnnnnnnnn
      481 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn
      541 nnnnnnnnnn nnnnnnnnnn gaacgacacc gaagctttaa tttacaattt tttgctatat
      601 ccatgttaga tgcctgttca gtcattttgg ccttcatagg tcttacaaga actactggaa
      661 atatccaatc atttgttctt caattaagta ttcctattaa tatgttcttc tgctttttaa
      721 tattaagata tagnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn
      781 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnatatcac
      841 ttatacaatt atctcggagc agttattatt gttgtaacaa tagctcttgt agaaatgaaa
      901 ttatcttttg aaacacaaga agaaaattct atcatattta atcttgtctt aattagttcc
      961 ttaattnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn
     1021 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnncctg tatgcttttc
     1081 aaacatgaca agggaaatag tttttaaaaa atataagatt gacattttaa gattaaatnn
     1141 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn
     1201 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnngc tatggtatcc tttttccaat
     1261 tgttcacttc ttgtcttata ttacctgtat acacccttcc atttttaaaa gaacnnnnnn
     1321 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn
     1381 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn ttcatttacc atataatgaa atatggacaa
     1441 atataaaaaa tggtttcgca tgtttattct tgggaagaaa cacagtcgta gagnnnnnnn
     1501 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn
     1561 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnaattgtg gtcttggtat ggctaagtta
     1621 tgtgatgatt gtgacggagc atggnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn
     1681 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn
     1741 nnnnaaaacc ttcgcattgt tttccttctt taacatttgt gataatttaa taaccagcta
     1801 tnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn
     1861 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nattatcgac aaattttcta
     1921 ccatgacata tactattgtt agttgtatac aaggtccagc aacagcaatt gcttattact
     1981 ttaaattctt agccnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn
     2041 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnggtgat
     2101 gttgtaatag aaccaagatt attagatttc gtaactttgn nnnnnnnnnn nnnnnnnnnn
     2161 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn
     2221 nnnnnnnnnn nnnnnnnnnt ttggctacct atttggttct ataatttacc gtgtaggaaa
     2281 tattatctta gaaannnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn
     2341 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnngaaaaa
     2401 aaatgagaaa tgaagaaaat gaagattccg aaggagaatt aaccaacgtc gattcaatta
     2461 ttacacaata a
//                                                            

SeqRecord型のオブジェクトからPubMedIDのついたアノテーション情報を出力

handle = Entrez.efetch(db="nucleotide", id="1746542926", rettype="gb", retmode="text")
rec_list = list(SeqIO.parse(handle, 'gb'))
rec = rec_list[0]
refs = rec.annotations['references']
for ref in refs:
    if ref.pubmed_id != '':
        print(ref.pubmed_id)
        handle = Entrez.efetch(db="pubmed", id=[ref.pubmed_id],
                                rettype="medline", retmode="text")
        records = Medline.parse(handle)
        for med_rec in records:
            for k, v in med_rec.items():
                print('%s: %s' % (k, v))
31505774
PMID: 31505774
OWN: NLM
STAT: MEDLINE
DCOM: 20200116
LR: 20200116
IS: 2073-4425 (Electronic) 2073-4425 (Linking)
VI: 10
IP: 9
DP: 2019 Sep 9
TI: Genetic Variations Associated with Drug Resistance Markers in Asymptomatic Plasmodium falciparum Infections in Myanmar.
LID: E692 [pii] 10.3390/genes10090692 [doi]
AB: The emergence and spread of drug resistance is a problem hindering malaria elimination in Southeast Asia. In this study, genetic variations in drug resistance markers of Plasmodium falciparum were determined in parasites from asymptomatic populations located in three geographically dispersed townships of Myanmar by PCR and sequencing. Mutations in dihydrofolate reductase (pfdhfr), dihydropteroate synthase (pfdhps), chloroquine resistance transporter (pfcrt), multidrug resistance protein 1 (pfmdr1), multidrug resistance-associated protein 1 (pfmrp1), and Kelch protein 13 (k13) were present in 92.3%, 97.6%, 84.0%, 98.8%, and 68.3% of the parasites, respectively. The pfcrt K76T, pfmdr1 N86Y, pfmdr1 I185K, and pfmrp1 I876V mutations were present in 82.7%, 2.5%, 87.5%, and 59.8% isolates, respectively. The most prevalent haplotypes for pfdhfr, pfdhps, pfcrt and pfmdr1 were 51I/59R/108N/164L, 436A/437G/540E/581A, 74I/75E/76T/220S/271E/326N/356T/371I, and 86N/130E/184Y/185K/1225V, respectively. In addition, 57 isolates had three different point mutations (K191T, F446I, and P574L) and three types of N-terminal insertions (N, NN, NNN) in the k13 gene. In total, 43 distinct haplotypes potentially associated with multidrug resistance were identified. These findings demonstrate a high prevalence of multidrug-resistant P. falciparum in asymptomatic infections from diverse townships in Myanmar, emphasizing the importance of targeting asymptomatic infections to prevent the spread of drug-resistant P. falciparum.
FAU: ['Zhao, Yan', 'Liu, Ziling', 'Soe, Myat Thu', 'Wang, Lin', 'Soe, Than Naing', 'Wei, Huanping', 'Than, Aye', 'Aung, Pyae Linn', 'Li, Yuling', 'Zhang, Xuexing', 'Hu, Yubing', 'Wei, Haichao', 'Zhang, Yangminghui', 'Burgess, Jessica', 'Siddiqui, Faiza A', 'Menezes, Lynette', 'Wang, Qinghui', 'Kyaw, Myat Phone', 'Cao, Yaming', 'Cui, Liwang']
AU: ['Zhao Y', 'Liu Z', 'Soe MT', 'Wang L', 'Soe TN', 'Wei H', 'Than A', 'Aung PL', 'Li Y', 'Zhang X', 'Hu Y', 'Wei H', 'Zhang Y', 'Burgess J', 'Siddiqui FA', 'Menezes L', 'Wang Q', 'Kyaw MP', 'Cao Y', 'Cui L']
AD: Department of Immunology, College of Basic Medical Sciences, China Medical University, Shenyang 110122, China. yzhao90@cmu.edu.cn. Department of Immunology, College of Basic Medical Sciences, China Medical University, Shenyang 110122, China. zlliu87@cmu.edu.cn. Myanmar Health Network Organization, Yangon 11211, Myanmar. dr.myatthusoe@gmail.com. Department of Immunology, College of Basic Medical Sciences, China Medical University, Shenyang 110122, China. lwang95@cmu.edu.cn. Department of Public Health, Ministry of Health and Sports, Nay Pyi Taw 15011, Myanmar. thannaingsoe@mohs.gov.mm. Department of Immunology, College of Basic Medical Sciences, China Medical University, Shenyang 110122, China. hpwei@cmu.edu.cn. Myanmar Health Network Organization, Yangon 11211, Myanmar. ayethan1957@gmail.com. Myanmar Health Network Organization, Yangon 11211, Myanmar. pyaelinnag@gmail.com. Department of Immunology, College of Basic Medical Sciences, China Medical University, Shenyang 110122, China. ylli88@cmu.edu.cn. Department of Immunology, College of Basic Medical Sciences, China Medical University, Shenyang 110122, China. zhangxuexing@cmu.edu.cn. Department of Immunology, College of Basic Medical Sciences, China Medical University, Shenyang 110122, China. ybhu@cmu.edu.cn. Department of Immunology, College of Basic Medical Sciences, China Medical University, Shenyang 110122, China. hcwei@cmu.edu.cn. Department of Immunology, College of Basic Medical Sciences, China Medical University, Shenyang 110122, China. zymh@cmu.edu.cn. Department of Internal Medicine, Morsani College of Medicine, University of South Florida, 3720 Spectrum Boulevard, Tampa, FL 33612, USA. jessicaburge@health.usf.edu. Department of Internal Medicine, Morsani College of Medicine, University of South Florida, 3720 Spectrum Boulevard, Tampa, FL 33612, USA. faiza@health.usf.edu. Department of Internal Medicine, Morsani College of Medicine, University of South Florida, 3720 Spectrum Boulevard, Tampa, FL 33612, USA. lmenezes@health.usf.edu. Department of Immunology, College of Basic Medical Sciences, China Medical University, Shenyang 110122, China. qhwang@cmu.edu.cn. Myanmar Health Network Organization, Yangon 11211, Myanmar. kyaw606@gmail.com. Department of Immunology, College of Basic Medical Sciences, China Medical University, Shenyang 110122, China. ymcao@cmu.edu.cn. Department of Internal Medicine, Morsani College of Medicine, University of South Florida, 3720 Spectrum Boulevard, Tampa, FL 33612, USA. lcui@health.usf.edu.
AUID: ['ORCID: 0000-0001-9208-929X', 'ORCID: 0000-0003-4635-3701', 'ORCID: 0000-0002-8338-1974']
LA: ['eng']
GR: ['U19AI089672/National Institute of Allergy and Infectious Diseases/International']
PT: ['Journal Article', 'Research Support, N.I.H., Extramural']
DEP: 20190909
PL: Switzerland
TA: Genes (Basel)
JT: Genes
JID: 101551097
RN: ['0 (Antimalarials)', '0 (Multidrug Resistance-Associated Proteins)', '0 (Protozoan Proteins)', 'EC 1.5.1.3 (Tetrahydrofolate Dehydrogenase)', 'EC 2.5.1.15 (Dihydropteroate Synthase)', 'Y49M64GZ4Q (multidrug resistance-associated protein 1)']
SB: IM
MH: ['Antimalarials/*pharmacology', 'Dihydropteroate Synthase/genetics', '*Drug Resistance, Multiple', 'Humans', 'Malaria/epidemiology/*parasitology', 'Multidrug Resistance-Associated Proteins/genetics', 'Myanmar', 'Plasmodium falciparum/drug effects/*genetics/pathogenicity', '*Polymorphism, Genetic', 'Protozoan Proteins/genetics', 'Tetrahydrofolate Dehydrogenase/genetics']
PMC: PMC6770986
OTO: ['NOTNLM']
OT: ['*Plasmodium falciparum', '*asymptomatic infection', '*drug resistance genes', '*haplotypes', '*multidrug resistance']
EDAT: 2019/09/12 06:00
MHDA: 2020/01/17 06:00
CRDT: ['2019/09/12 06:00']
PHST: ['2019/08/02 00:00 [received]', '2019/08/31 00:00 [revised]', '2019/09/04 00:00 [accepted]', '2019/09/12 06:00 [entrez]', '2019/09/12 06:00 [pubmed]', '2020/01/17 06:00 [medline]']
AID: ['genes10090692 [pii]', '10.3390/genes10090692 [doi]']
PST: epublish
SO: Genes (Basel). 2019 Sep 9;10(9). pii: genes10090692. doi: 10.3390/genes10090692.

EntrezからのFasta形式データの取得

rettype="fasta"を指定すれば、Fasta形式でデータを取得することができます。

handle = Entrez.efetch(db="nucleotide", id="1434742847", rettype="fasta", retmode="text")
print(handle.read())                                                                       

>MF543506.1 Cypripedium calceolus voucher CYCAOL02-210813 maturase K (matK) gene, partial cds; chlor
oplast                                                                                              
AATTATGTGTCAGATCTACTAATACCCCATCCCATCCATCTGGAAATCTTGGTTCAAATCCTGCAATGCT                              
GGATCAAGGATGTTCCTTCTTTGCATTTATTGCGATTGCTTTTCCACGAATATCATTATTTTAATAGTCT                              
CATTACTTCAAAAAAAAGCATTTACGCCTTTTCAAGAATAAAGAAAAGATTCCTTTGGTTCCTATATAAT                              
TCTTATGTATATGAATGCGAATATCTATTCCATTTTCTTCGTAAACAGTCTTCTTATTTACGATCAACAT                              
CTTCTGGAGTGTTTCTTGAGCGAACACATTTCTATGTAAAAATAGAACATCTTATAGTAGTGTGTTGTAA                              
TTCTTTTCATAGGATCCTATGCTTTCTCAAGGATCCTTTCATGCATTATGTTCGATATCAAGGAAAAGCA                              
ATTCTGGCTTCAAAGGGAACTCTTATTCTGATGAAGAAATGGAAATTTCATCTTGTTAATTTTTGGCAAT                              
CTTATTTGCACTTTTGGTCTCAACCGTATAGGATCCATATAAAGCAATTATACAACTATTCCTTCTCTTT                              
TCTGGGGTATTTTTCAAGTGTACTAGAAAATCATTTGGTAGTAAGAAATCAAATGCTAGAGAATTCATTT                              
CTAATAAATATTATGACTAAGAAATTAGATACCATAGCCCCAGTTATTTCTCTTATTGGATCATTGTCGA                              
AAGCTCAATTTTGTACTGTATTGGGCCATCCTATTAGTAAACCGATCTGGACCGATTTATCGGATTCTGA                              
TATTCTTGATCGATTTTGCCGGATATGTAGAAATCTTTGTCGTTATCACAGCGGATCCTCAAAA                                    

これを利用すると、以下のようにファイルが存在しない場合、
対応するfastqファイルをダウンロードするスクリプトが書けます。

import os
from Bio import SeqIO
from Bio import Entrez
Entrez.email = "A.N.Other@example.com"  # Always tell NCBI who you are
filename = "gi_186972394.fasta"
if not os.path.isfile(filename):
    # Downloading...
    net_handle = Entrez.efetch(db="nucleotide", id="186972394", rettype="fasta", retmode="text")
    out_handle = open(filename, "w")
    out_handle.write(net_handle.read())
    out_handle.close()
    net_handle.close()
    print("Saved")

まとめ

  • Bio.Entrezを使えばNCBIの検索ウィジェットでできることはほぼできちゃう。
  • fastaへの変換もできちゃう。そのままファイル出力もできちゃう。