NCBI Entrez は、30以上もの生物学的な目的で作成されたデータベースに対する統合的なテキストベースの検索、情報抽出システムです。
BiopythonパッケージのBio.Entrezモジュールを使えば、このシステムをpythonから手軽に使えちゃいます。
ちなみに、Bio.EntrezはEntrez Programming Utilities(a.k.a EUtils)を利用しているようです。
Biopythonはfastqファイルの処理くらいにしか使っていなかったけど、これは使えそう。。
ってことで試してみる。
ユーザー認証
NCBIにはデータベースにアクセスするための統合的なインターフェースとしてEntrezが用意されています。
Biopythonでは、EntrezのAPIにアクセスするためのラッパーが用意されており、それを用いると自動的にEntrezのお作法に則ったリクエストを投げることができます。
この機能を使うために、まずはユーザー認証を行いましょう。
from Bio import Entrez Entrez.email = "A.N.Other@example.com"
Entrez.email でメールアドレスを登録してからクエリを投げないと弾かれます。 使用量が限度を超えた場合、アクセスをブロックする前にこのメールアドレスを介して連絡がくるそうです。
使用できるデータベースの検索
ブラウザだと↓
EUtilsの返り値同様、基本xml形式で返さた値がオブジェクト内に格納されます。
handle = Entrez.einfo() result = handle.read() print(result) <?xml version="1.0" encoding="UTF-8" ?> <!DOCTYPE eInfoResult PUBLIC "-//NLM//DTD einfo 20130322//EN" "https://eutils.ncbi.nlm.nih.gov/euti$ s/dtd/20130322/einfo.dtd"> <eInfoResult> <DbList> <DbName>pubmed</DbName> <DbName>protein</DbName> <DbName>nuccore</DbName> <DbName>ipg</DbName> <DbName>nucleotide</DbName> <DbName>nucgss</DbName> <DbName>nucest</DbName> <DbName>structure</DbName> <DbName>sparcle</DbName> <DbName>genome</DbName> <DbName>annotinfo</DbName> <DbName>assembly</DbName> <DbName>bioproject</DbName> <DbName>biosample</DbName> <DbName>blastdbinfo</DbName> <DbName>books</DbName> <DbName>cdd</DbName> <DbName>clinvar</DbName> <DbName>clone</DbName> <DbName>gap</DbName> <DbName>gapplus</DbName> <DbName>grasp</DbName> <DbName>dbvar</DbName> <DbName>gene</DbName> <DbName>gds</DbName> <DbName>geoprofiles</DbName> <DbName>homologene</DbName> <DbName>medgen</DbName> <DbName>mesh</DbName> <DbName>ncbisearch</DbName> <DbName>nlmcatalog</DbName> <DbName>omim</DbName> <DbName>orgtrack</DbName> <DbName>pmc</DbName> <DbName>popset</DbName> <DbName>probe</DbName> <DbName>proteinclusters</DbName> <DbName>pcassay</DbName> <DbName>biosystems</DbName> <DbName>pccompound</DbName> <DbName>pcsubstance</DbName> <DbName>pubmedhealth</DbName> <DbName>seqannot</DbName> <DbName>snp</DbName> </DbList> </eInfoResult>
これをパースし、オブジェクト内に格納するにはEntrez.read()
を使用します。
データベースの一覧はDbList
をキーとする辞書内に格納されています。
handle = Entrez.einfo() result = Entrez.read(handle) print(result['DbList']) ['pubmed', 'protein', 'nuccore', 'ipg', 'nucleotide', 'nucgss', 'nucest', 'structure', 'spa rcle', 'genome', 'annotinfo', 'assembly', 'bioproject', 'biosample', 'blastdbinfo', 'books', 'cdd', 'clinvar', 'clone', 'gap', 'gapplus', 'grasp', 'dbvar', 'gene', 'gds', 'geoprofiles', 'homologene', 'medgen', 'mesh', 'ncbisearch', 'nlmcatalog', 'omim', 'orgtrack', 'pmc', 'popset', 'probe', 'protein clusters', 'pcassay', 'biosystems', 'pccompound', 'pcsubstance', 'pubmedhealth', 'seqannot', 'snp', 'sra', 'taxonomy', 'biocollections', 'unigene', 'gencoll', 'gtr']
Entrez内の検索
ブラウザだと↓
関心のあるキーワードで検索を行いたい場合。 例えば、
Cypripediodeae亜科のラン植物のmatK遺伝子
についての配列が欲しいときは、以下のように行います。
Entrez.email = "A.N.Other@example.com" # Always tell NCBI who you are handle = Entrez.esearch(db="nucleotide", term="Cypripedioideae[Orgn] AND matK[Gene]") result = Entrez.read(handle)
ヒットした論文数
print(result["Count"]) 542
ヒットしたGenBank ID
print(result["IdList"]) ['1746542926', '1746542924', '1746542922', '1746542920', '1746542918', '1746542916', '1746542914', '1746542912', '1746542910', '1746542908', '1746542906', '1746542904', '1746542902', '1746542900', '1746542898', '1746542896', '1746542894', '1746542892', '1746542890', '1746542888']
デフォルトでは取得するIDの上限が20に設定されているため、すべてのIDを取得するには下記のようにretmax
を指定してあげる必要があります。
handle = Entrez.esearch(db="nucleotide", term="Cypripedioideae[Orgn] AND matK[Gene]") result = Entrez.read(handle) if int(rec_list['RetMax']) < int(rec_list['Count']): handle = Entrez.esearch(db="nucleotide", term="Cypripedioideae[Orgn] AND matK[Gene]", retmax=rec_list['Count']) rec_list = Entrez.read(handle)
Entrezからの完全な情報の取得
ブラウザだと↓
ヒットしたGenBank IDの一番最初のものについて、詳細を見てみましょう。
詳細の閲覧には、Bio.Entrez.efetch
を使用します。
handle = Entrez.efetch(db="nucleotide", id="1746542926", rettype="gb", retmode="text") print(handle.read())
LOCUS MN419894 2471 bp DNA linear INV 25-SEP-2019 DEFINITION Plasmodium falciparum isolate PA1876 chloroquine resistance transporter (crt) gene, partial cds. ACCESSION MN419894 VERSION MN419894.1 KEYWORDS . SOURCE Plasmodium falciparum (malaria parasite P. falciparum) ORGANISM Plasmodium falciparum Eukaryota; Sar; Alveolata; Apicomplexa; Aconoidasida; Haemosporida; Plasmodiidae; Plasmodium; Plasmodium (Laverania). REFERENCE 1 (bases 1 to 2471) AUTHORS Zhao,Y., Liu,Z., Soe,M.T., Wang,L., Soe,T.N., Wei,H., Than,A., Aung,P.L., Li,Y., Zhang,X., Hu,Y., Wei,H., Zhang,Y., Burgess,J., Siddiqui,F.A., Menezes,L., Wang,Q., Kyaw,M.P., Cao,Y. and Cui,L. TITLE Genetic Variations Associated with Drug Resistance Markers in Asymptomatic Plasmodium falciparum Infections in Myanmar JOURNAL Genes (Basel) 10 (9), E692 (2019) PUBMED 31505774 REMARK Publication Status: Online-Only REFERENCE 2 (bases 1 to 2471) AUTHORS Zhao,Y., Liu,Z., Soe,M.T., Wang,L., Soe,T.N., Wei,H., Than,A., Aung,P.L., Li,Y., Zhang,X., Hu,Y., Wei,H., Zhang,Y., Burgess,J., Siddiqui,F.A., Menezes,L., Wang,Q., Kyaw,M.P., Cao,Y. and Cui,L. TITLE Direct Submission JOURNAL Submitted (05-SEP-2019) Department of Immunology, College of Basic Medical Sciences, China Medical University, No.77 Puhe Road, Shenyang, Liaoning 110122, China COMMENT ##Assembly-Data-START## Sequencing Technology :: Sanger dideoxy sequencing ##Assembly-Data-END## FEATURES Location/Qualifiers source 1..2471 /organism="Plasmodium falciparum" /mol_type="genomic DNA" /strain="Paletwa" /isolate="PA1876" /db_xref="taxon:5833" gene <1..>2471 /gene="crt" /locus_tag="PF3D7_0709000" /note="Pfcrt" mRNA join(<1..91,192..460,561..733,834..966,1067..1138, 1239..1314,1411..1493,1594..1644,1745..1801,1902..1994, 2095..2139,2240..2294,2395..>2471) /gene="crt" /locus_tag="PF3D7_0709000" /product="chloroquine resistance transporter" CDS join(1..91,192..460,561..733,834..966,1067..1138, 1239..1314,1411..1493,1594..1644,1745..1801,1902..1994, 2095..2139,2240..2294,2395..2471) /gene="crt" /locus_tag="PF3D7_0709000" /codon_start=1 /product="chloroquine resistance transporter" /protein_id="QEQ91169.1" /translation="MKFASKKNNQKNSSKNDERYRELDNLVQEGNGSRLGGGSCLGKC AHVFKLIFKEIKDNIFIYILSIIYLSVCVIETIFAKRTLNKIGNYSFVTSETHNFICM IMFFIVYSLFGNKKGNSKERHRSFNLQFFAISMLDACSVILAFIGLTRTTGNIQSFVL QLSIPINMFFCFLILRYRYHLYNYLGAVIIVVTIALVEMKLSFETQEENSIIFNLVLI SSLIPVCFSNMTREIVFKKYKIDILRLNAMVSFFQLFTSCLILPVYTLPFLKELHLPY NEIWTNIKNGFACLFLGRNTVVENCGLGMAKLCDDCDGAWKTFALFSFFNICDNLITS YIIDKFSTMTYTIVSCIQGPATAIAYYFKFLAGDVVIEPRLLDFVTLFGYLFGSIIYR VGNIILERKKMRNEENEDSEGELTNVDSIITQ" gap 92..191 /estimated_length=unknown gap 461..560 /estimated_length=unknown gap 734..833 /estimated_length=unknown gap 967..1066 /estimated_length=unknown gap 1139..1238 /estimated_length=unknown gap 1494..1593 /estimated_length=unknown gap 1645..1744 /estimated_length=unknown gap 1802..1901 /estimated_length=unknown gap 1995..2094 /estimated_length=unknown gap 2140..2239 /estimated_length=unknown gap 2295..2394 /estimated_length=unknown ORIGIN 1 atgaaattcg caagtaaaaa aaataatcaa aaaaattcaa gcaaaaatga cgagcgttat 61 agagaattag ataatttagt acaagaagga annnnnnnnn nnnnnnnnnn nnnnnnnnnn 121 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn 181 nnnnnnnnnn natggctcac gtttaggtgg aggttcttgt cttggtaaat gtgctcatgt 241 gtttaaactt atttttaaag agattaagga taatattttt atttatattt taagtattat 301 ttatttaagt gtatgtgtaa ttgaaacaat ttttgctaaa agaactttaa acaaaattgg 361 taactatagt tttgtaacat ccgaaactca caactttatt tgtatgatta tgttctttat 421 tgtttattcc ttatttggaa ataaaaaggg aaattcaaaa nnnnnnnnnn nnnnnnnnnn 481 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn 541 nnnnnnnnnn nnnnnnnnnn gaacgacacc gaagctttaa tttacaattt tttgctatat 601 ccatgttaga tgcctgttca gtcattttgg ccttcatagg tcttacaaga actactggaa 661 atatccaatc atttgttctt caattaagta ttcctattaa tatgttcttc tgctttttaa 721 tattaagata tagnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn 781 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnatatcac 841 ttatacaatt atctcggagc agttattatt gttgtaacaa tagctcttgt agaaatgaaa 901 ttatcttttg aaacacaaga agaaaattct atcatattta atcttgtctt aattagttcc 961 ttaattnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn 1021 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnncctg tatgcttttc 1081 aaacatgaca agggaaatag tttttaaaaa atataagatt gacattttaa gattaaatnn 1141 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn 1201 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnngc tatggtatcc tttttccaat 1261 tgttcacttc ttgtcttata ttacctgtat acacccttcc atttttaaaa gaacnnnnnn 1321 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn 1381 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn ttcatttacc atataatgaa atatggacaa 1441 atataaaaaa tggtttcgca tgtttattct tgggaagaaa cacagtcgta gagnnnnnnn 1501 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn 1561 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnaattgtg gtcttggtat ggctaagtta 1621 tgtgatgatt gtgacggagc atggnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn 1681 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn 1741 nnnnaaaacc ttcgcattgt tttccttctt taacatttgt gataatttaa taaccagcta 1801 tnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn 1861 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nattatcgac aaattttcta 1921 ccatgacata tactattgtt agttgtatac aaggtccagc aacagcaatt gcttattact 1981 ttaaattctt agccnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn 2041 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnggtgat 2101 gttgtaatag aaccaagatt attagatttc gtaactttgn nnnnnnnnnn nnnnnnnnnn 2161 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn 2221 nnnnnnnnnn nnnnnnnnnt ttggctacct atttggttct ataatttacc gtgtaggaaa 2281 tattatctta gaaannnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn 2341 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnngaaaaa 2401 aaatgagaaa tgaagaaaat gaagattccg aaggagaatt aaccaacgtc gattcaatta 2461 ttacacaata a //
SeqRecord型のオブジェクトからPubMedIDのついたアノテーション情報を出力
handle = Entrez.efetch(db="nucleotide", id="1746542926", rettype="gb", retmode="text") rec_list = list(SeqIO.parse(handle, 'gb')) rec = rec_list[0] refs = rec.annotations['references'] for ref in refs: if ref.pubmed_id != '': print(ref.pubmed_id) handle = Entrez.efetch(db="pubmed", id=[ref.pubmed_id], rettype="medline", retmode="text") records = Medline.parse(handle) for med_rec in records: for k, v in med_rec.items(): print('%s: %s' % (k, v))
31505774 PMID: 31505774 OWN: NLM STAT: MEDLINE DCOM: 20200116 LR: 20200116 IS: 2073-4425 (Electronic) 2073-4425 (Linking) VI: 10 IP: 9 DP: 2019 Sep 9 TI: Genetic Variations Associated with Drug Resistance Markers in Asymptomatic Plasmodium falciparum Infections in Myanmar. LID: E692 [pii] 10.3390/genes10090692 [doi] AB: The emergence and spread of drug resistance is a problem hindering malaria elimination in Southeast Asia. In this study, genetic variations in drug resistance markers of Plasmodium falciparum were determined in parasites from asymptomatic populations located in three geographically dispersed townships of Myanmar by PCR and sequencing. Mutations in dihydrofolate reductase (pfdhfr), dihydropteroate synthase (pfdhps), chloroquine resistance transporter (pfcrt), multidrug resistance protein 1 (pfmdr1), multidrug resistance-associated protein 1 (pfmrp1), and Kelch protein 13 (k13) were present in 92.3%, 97.6%, 84.0%, 98.8%, and 68.3% of the parasites, respectively. The pfcrt K76T, pfmdr1 N86Y, pfmdr1 I185K, and pfmrp1 I876V mutations were present in 82.7%, 2.5%, 87.5%, and 59.8% isolates, respectively. The most prevalent haplotypes for pfdhfr, pfdhps, pfcrt and pfmdr1 were 51I/59R/108N/164L, 436A/437G/540E/581A, 74I/75E/76T/220S/271E/326N/356T/371I, and 86N/130E/184Y/185K/1225V, respectively. In addition, 57 isolates had three different point mutations (K191T, F446I, and P574L) and three types of N-terminal insertions (N, NN, NNN) in the k13 gene. In total, 43 distinct haplotypes potentially associated with multidrug resistance were identified. These findings demonstrate a high prevalence of multidrug-resistant P. falciparum in asymptomatic infections from diverse townships in Myanmar, emphasizing the importance of targeting asymptomatic infections to prevent the spread of drug-resistant P. falciparum. FAU: ['Zhao, Yan', 'Liu, Ziling', 'Soe, Myat Thu', 'Wang, Lin', 'Soe, Than Naing', 'Wei, Huanping', 'Than, Aye', 'Aung, Pyae Linn', 'Li, Yuling', 'Zhang, Xuexing', 'Hu, Yubing', 'Wei, Haichao', 'Zhang, Yangminghui', 'Burgess, Jessica', 'Siddiqui, Faiza A', 'Menezes, Lynette', 'Wang, Qinghui', 'Kyaw, Myat Phone', 'Cao, Yaming', 'Cui, Liwang'] AU: ['Zhao Y', 'Liu Z', 'Soe MT', 'Wang L', 'Soe TN', 'Wei H', 'Than A', 'Aung PL', 'Li Y', 'Zhang X', 'Hu Y', 'Wei H', 'Zhang Y', 'Burgess J', 'Siddiqui FA', 'Menezes L', 'Wang Q', 'Kyaw MP', 'Cao Y', 'Cui L'] AD: Department of Immunology, College of Basic Medical Sciences, China Medical University, Shenyang 110122, China. yzhao90@cmu.edu.cn. Department of Immunology, College of Basic Medical Sciences, China Medical University, Shenyang 110122, China. zlliu87@cmu.edu.cn. Myanmar Health Network Organization, Yangon 11211, Myanmar. dr.myatthusoe@gmail.com. Department of Immunology, College of Basic Medical Sciences, China Medical University, Shenyang 110122, China. lwang95@cmu.edu.cn. Department of Public Health, Ministry of Health and Sports, Nay Pyi Taw 15011, Myanmar. thannaingsoe@mohs.gov.mm. Department of Immunology, College of Basic Medical Sciences, China Medical University, Shenyang 110122, China. hpwei@cmu.edu.cn. Myanmar Health Network Organization, Yangon 11211, Myanmar. ayethan1957@gmail.com. Myanmar Health Network Organization, Yangon 11211, Myanmar. pyaelinnag@gmail.com. Department of Immunology, College of Basic Medical Sciences, China Medical University, Shenyang 110122, China. ylli88@cmu.edu.cn. Department of Immunology, College of Basic Medical Sciences, China Medical University, Shenyang 110122, China. zhangxuexing@cmu.edu.cn. Department of Immunology, College of Basic Medical Sciences, China Medical University, Shenyang 110122, China. ybhu@cmu.edu.cn. Department of Immunology, College of Basic Medical Sciences, China Medical University, Shenyang 110122, China. hcwei@cmu.edu.cn. Department of Immunology, College of Basic Medical Sciences, China Medical University, Shenyang 110122, China. zymh@cmu.edu.cn. Department of Internal Medicine, Morsani College of Medicine, University of South Florida, 3720 Spectrum Boulevard, Tampa, FL 33612, USA. jessicaburge@health.usf.edu. Department of Internal Medicine, Morsani College of Medicine, University of South Florida, 3720 Spectrum Boulevard, Tampa, FL 33612, USA. faiza@health.usf.edu. Department of Internal Medicine, Morsani College of Medicine, University of South Florida, 3720 Spectrum Boulevard, Tampa, FL 33612, USA. lmenezes@health.usf.edu. Department of Immunology, College of Basic Medical Sciences, China Medical University, Shenyang 110122, China. qhwang@cmu.edu.cn. Myanmar Health Network Organization, Yangon 11211, Myanmar. kyaw606@gmail.com. Department of Immunology, College of Basic Medical Sciences, China Medical University, Shenyang 110122, China. ymcao@cmu.edu.cn. Department of Internal Medicine, Morsani College of Medicine, University of South Florida, 3720 Spectrum Boulevard, Tampa, FL 33612, USA. lcui@health.usf.edu. AUID: ['ORCID: 0000-0001-9208-929X', 'ORCID: 0000-0003-4635-3701', 'ORCID: 0000-0002-8338-1974'] LA: ['eng'] GR: ['U19AI089672/National Institute of Allergy and Infectious Diseases/International'] PT: ['Journal Article', 'Research Support, N.I.H., Extramural'] DEP: 20190909 PL: Switzerland TA: Genes (Basel) JT: Genes JID: 101551097 RN: ['0 (Antimalarials)', '0 (Multidrug Resistance-Associated Proteins)', '0 (Protozoan Proteins)', 'EC 1.5.1.3 (Tetrahydrofolate Dehydrogenase)', 'EC 2.5.1.15 (Dihydropteroate Synthase)', 'Y49M64GZ4Q (multidrug resistance-associated protein 1)'] SB: IM MH: ['Antimalarials/*pharmacology', 'Dihydropteroate Synthase/genetics', '*Drug Resistance, Multiple', 'Humans', 'Malaria/epidemiology/*parasitology', 'Multidrug Resistance-Associated Proteins/genetics', 'Myanmar', 'Plasmodium falciparum/drug effects/*genetics/pathogenicity', '*Polymorphism, Genetic', 'Protozoan Proteins/genetics', 'Tetrahydrofolate Dehydrogenase/genetics'] PMC: PMC6770986 OTO: ['NOTNLM'] OT: ['*Plasmodium falciparum', '*asymptomatic infection', '*drug resistance genes', '*haplotypes', '*multidrug resistance'] EDAT: 2019/09/12 06:00 MHDA: 2020/01/17 06:00 CRDT: ['2019/09/12 06:00'] PHST: ['2019/08/02 00:00 [received]', '2019/08/31 00:00 [revised]', '2019/09/04 00:00 [accepted]', '2019/09/12 06:00 [entrez]', '2019/09/12 06:00 [pubmed]', '2020/01/17 06:00 [medline]'] AID: ['genes10090692 [pii]', '10.3390/genes10090692 [doi]'] PST: epublish SO: Genes (Basel). 2019 Sep 9;10(9). pii: genes10090692. doi: 10.3390/genes10090692.
EntrezからのFasta形式データの取得
rettype="fasta"
を指定すれば、Fasta形式でデータを取得することができます。
handle = Entrez.efetch(db="nucleotide", id="1434742847", rettype="fasta", retmode="text") print(handle.read()) >MF543506.1 Cypripedium calceolus voucher CYCAOL02-210813 maturase K (matK) gene, partial cds; chlor oplast AATTATGTGTCAGATCTACTAATACCCCATCCCATCCATCTGGAAATCTTGGTTCAAATCCTGCAATGCT GGATCAAGGATGTTCCTTCTTTGCATTTATTGCGATTGCTTTTCCACGAATATCATTATTTTAATAGTCT CATTACTTCAAAAAAAAGCATTTACGCCTTTTCAAGAATAAAGAAAAGATTCCTTTGGTTCCTATATAAT TCTTATGTATATGAATGCGAATATCTATTCCATTTTCTTCGTAAACAGTCTTCTTATTTACGATCAACAT CTTCTGGAGTGTTTCTTGAGCGAACACATTTCTATGTAAAAATAGAACATCTTATAGTAGTGTGTTGTAA TTCTTTTCATAGGATCCTATGCTTTCTCAAGGATCCTTTCATGCATTATGTTCGATATCAAGGAAAAGCA ATTCTGGCTTCAAAGGGAACTCTTATTCTGATGAAGAAATGGAAATTTCATCTTGTTAATTTTTGGCAAT CTTATTTGCACTTTTGGTCTCAACCGTATAGGATCCATATAAAGCAATTATACAACTATTCCTTCTCTTT TCTGGGGTATTTTTCAAGTGTACTAGAAAATCATTTGGTAGTAAGAAATCAAATGCTAGAGAATTCATTT CTAATAAATATTATGACTAAGAAATTAGATACCATAGCCCCAGTTATTTCTCTTATTGGATCATTGTCGA AAGCTCAATTTTGTACTGTATTGGGCCATCCTATTAGTAAACCGATCTGGACCGATTTATCGGATTCTGA TATTCTTGATCGATTTTGCCGGATATGTAGAAATCTTTGTCGTTATCACAGCGGATCCTCAAAA
これを利用すると、以下のようにファイルが存在しない場合、
対応するfastqファイルをダウンロードするスクリプトが書けます。
import os from Bio import SeqIO from Bio import Entrez Entrez.email = "A.N.Other@example.com" # Always tell NCBI who you are filename = "gi_186972394.fasta" if not os.path.isfile(filename): # Downloading... net_handle = Entrez.efetch(db="nucleotide", id="186972394", rettype="fasta", retmode="text") out_handle = open(filename, "w") out_handle.write(net_handle.read()) out_handle.close() net_handle.close() print("Saved")
まとめ
- Bio.Entrezを使えばNCBIの検索ウィジェットでできることはほぼできちゃう。
- fastaへの変換もできちゃう。そのままファイル出力もできちゃう。