Friday, October 06, 2017

Notes on finding georeferenced sequences in GenBank

Notes on how many georeferenced DNA sequences there are in GenBank, and how many could potentially be georeferenced.

BCT	Bacterial sequences
PRI	Primate sequences
ROD	Rodent sequences
MAM	Other mammalian sequences
VRT	Other vertebrate sequences
INV	Invertebrate sequences
PLN	Plant and Fungal sequences
VRL	Viral sequences
PHG	Phage sequences
RNA	Structural RNA sequences
SYN	Synthetic and chimeric sequ
UNA	Unannotated sequences

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi
?db=nucleotide nucleotides
&term=ddbj embl genbank with limits[filt]
NOT transcriptome[All Fields] ignore transcriptome data
NOT mRNA[filt] ignore mRNA data
NOT TSA[All Fields] ignore TSA
NOT scaffold[All Fields] ignore scaffold
AND src lat lon[prop] include records that have source feature "lat_lon"
AND 2010/01/01:2010/12/31[pdat] from this date range
AND gbdiv_pri[PROP] restrict search to PRI division (primates)
AND srcdb_genbank[PROP] Need this if we query by division, see NBK49540

Numbers of nucleotide sequences that have latitude and longitudes in GenBank for each year.

DatePRIRODMAMVRTINVPLN
2010/01/01412725529551926927174
2011/01/013711204816017657784947968
2012/01/01658034214216968406027314
2013/01/01297349761107647041123435
2014/01/011529044761145986807614018
2015/01/0117452719831784336353835501
2016/01/0158261512631489875789322813
2017/01/0193817581017107127506628180

Numbers of nucleotide sequences that don't have latitude and longitudes in GenBank for each year but do have the country field and hence could be georeferenced.

DatePRIRODMAMVRTINVPLN
2010/01/01666026545534326666257756692
2011/01/01399832666210337177401598664
2012/01/015377559072835533286945103379
2013/01/011092848058013663736971995817
2014/01/019727349267515991377816135372
2015/01/0189226774139646057885867167337
2016/01/0164303384108606223895711145111
2017/01/0111474352049124115991219109747