Do-it-yourself Genetics: Running a Structure analysis

November 20, 2009

Thanks to the European Genetics and Anthropology Blog for this DIY Genetics experiment:If you bought a genome-wide scan at 23andme and/or deCODEme, then you have access to your raw data, which gives you the option of going beyond the bio-geographic analyses offered by these companies. For example, you can use various programs to compare yourself to publicly available samples from around the world. Structure is one of the more popular tools for this sort of thing, so here’s a guide how to set up a quick analysis using Structure and a data sheet from

Kosoy et al. 2009:


– Download the 2.3.2 Beta version of Structure, with the graphical front end, which makes things a lot easier.– Extract the following 125 SNPs from your raw data. Actually, 128 are listed on that sheet, but only 125 currently available at 23andme, although that’s not a problem.

– Convert the genotypes to integers, as per the instruction sheet above. For example, if you’re AG for rs731257, then convert that to 12 (ie. A=1, G=2). The three missing SNPs, as well as any no-calls, should be listed as 55.

– Download the sample data sheet, and add yourself to it. Make sure you look exactly like all the other samples on there, so you’ll need to add the various tags that precede the genotypes. For example, instead of “EURA CEU CEPH1334.10 1”, try something like “EURA POL Myself 1”, or if you’re African American then maybe “AFR AME Myself 2”.

– Start Structure and load up the data sheet by going to “File” and then “New Project”. Fill in the necessary fields in the Project Wizard, such as: Number of individuals 639; Ploidy of data 1; Number of loci 128; Missing data value 55. Then tick the following boxes: “Row of marker names”, “Data file stores data for individuals in a single line”, “Individual ID for each individual”, “Putative population origin for each individual”, “USEOFPOPINFO selection flag”, and finally “Sampling location information”.

– Define the parameter set (ie. go to “Parameter Set” and then “New”). The length of burn-in period should be at least 10,000 and the number of MCM reps about 50,000. Of course, to save time you can reduce both, especially if you’re not too worried about a bit of noise. On the other hand, if you want to minimize noise as much as possible, then go up to something like 100,000 burn-ins and 500,000 reps. But be warned, runs like this can take days.

– Press the “!” button, specify the number of clusters (K) you’d like to divide the samples into, and click “OK”. Alternatively, you can let the program work its way from K2 to whatever; Project > Start a Job > pick the parameter set > specify the K range (for example, from K2 to K6) > press the “Start” button.

Here are my results at K4. Obviously, if you’re of overwhelmingly European origin, it’s unlikely you’ll get anything below 99% European/West Eurasian with these 125 markers. Much larger sets of SNPs are needed to get more detailed admixture estimates, and to break down the intra-West Eurasian and intra-European components.

Indeed, if you’re good with Excel and Access then it’s even possible to go up to something like 500,000 SNPs. HapMap and HGDP samples are available online, although the latter are presented in a somewhat different way than the 23andme raw data, which is a real pain because it takes a lot of work to overcome. Also, there are other settings you can try to see how they affect the results, like turning on LOCPRIOR, which tells Structure the putative origins of the samples. You can use different data formats too, examples of which are shown on the Structure home page.Roman Kosoy et al., Ancestry informative marker sets for determining continental origin and admixture proportions in common populations in America, Human Mutation 2009,
Volume 30 Issue 1, Pages 69 – 78, doi: 10.1002/humu.20822

Hubisz M. J., Falush D., Stephens M., Pritchard J. K., Inferring weak population structure with the assistance of sample group information, Molecular Ecology Resources 2009. DOI: 10.1111/j.1755-0998.2009.02591.x

Posted via email from healthystealthy health hacks

