Saturday, November 4, 2017

Evaluating new Big Y changes


Big Y conversion in process



I have been seeing many questions regarding changes to Family Tree DNA's Big Y results or the inability to access them. Here we will see why some people cannot access results, why others may be frustrated at their new results, and what we can do to understand what we have so far. 

I will be referring to my brother's Big Y test results as "my" results.


Background to Big Y testing



Let's review a little about how the Big Y test is processed. As you may remember, your DNA consists of two strands of DNA coiled into a double helix. The strands run in opposite directions.  One is called the forward strand, and the other is the reverse strand. The strands are connected by base pairs (bp) which are the As, Cs, Gs, and Ts that form your DNA sequence. All of these can be numbered to show their position on the chromosome.




During the testing process, your DNA is not read in one continuous stretch. Instead, your DNA is broken into random fragments. The test then reads these fragments from each end. Some fragments are read many more times than others. For example, one of your fragments may have been read two times, and another 56 times. Unfortunately, not all of the reads may give the same result. So a fragment that was read consistently many times will be reported as a high quality SNP, while one read a few times with different results will be considered to be a much less reliable SNP. 

After all the fragments are read, they must be reassembled, mapped to the human genome reference sequence, and given a precise location. Differences between your DNA results and the reference sequence are then reported. 

The human genome reference sequence is continually improving. Big Y results were formerly compared against the human genome reference sequence known as hg19 which was Build 37. 



What happened to my Big Y results?


When  FTDNA began the conversion process, all haplogroup designations were rolled back to what they were before Big Y testing.  In the image below my results are at the bottom with Cairns and another Thompson. Before the Big Y, I had tested the single SNP R-DF13, and Cairns and the other Thompson had tested positive for R-FGC11134.  




  
Our former Big Y results disappeared.



Big changes to Big Y



For the past several weeks, Family Tree DNA has been remapping all Big Y results to the most recent human genome reference sequence, hg38. This means that many of the SNP position numbers have changed. FTDNA is also adding a new Y-chromosome browser and a new matching system.

The remapping to hg38 will lead to more accurate identification of SNPs and even the discovery of new SNPs. The Y-chromosome browser will show more information about the test results, and the new matching system will lead to more accurate matches.

We will examine new SNP discoveries and the new Y-chromosome browser. We will not look at the new matching system because not all Big Y tests have been converted.



Former Big Y results


In the previous version of the Big Y results page, there were three tabs: Known SNPs, Novel Variants, and Matching. Your novel variants are newly-discovered SNPs that have not yet been seen by Family Tree DNA in any other tester. Family Tree DNA doesn't name SNPs until they are shared, so these novel variants are identified by their position number on the Y-chromosome.

In the former Big Y version I had three novel variants. The position numbers were based on the old hg19 human reference sequence.




Not all Big Y tests have been converted


My Big Y conversion is now complete. My haplogroup designation has changed, but the haplogroups for Kits N116392 and 34484 have not.  Since all three of us have ordered the Big Y, their results have not been completed, and they will not yet show up on my list of matches.





New SNP discoveries in Big Y Tests


On the new Big Y results page, the tab names have changed.  Instead of Known SNPs and Novel Variants, they are now called Named Variants and Unnamed Variants. The named variants are shared with others; the unnamed variants are, so far, unique to you.




When I click the Unnamed Variants tab, I can see the new Y-chromosome browser at the top. I now have six unnamed or novel variants instead of three as shown in the previous report. The position numbers have all changed. These position numbers are based on the hg38 human reference sequence:





Family Tree DNA does not list the old hg19 position numbers, so I can't tell which ones were previously reported and which are new.  So I had to use other resources to convert them.

hg 38          hg19

11321844 = 13477520 

11514480 = 13670156 

  11649109 = did not exist

12144610 = 14265316 

19139783 = 21301669 

56831461 = 58977608

Positions 11514480, 12144610, and 19139783 appeared in my former Novel Variants table as positions 13670156, 14265316, and 21301669. I had submitted my hg19 Big Y results to Full Genomes Corp (FGC) and to YFull. Position 11321844 did not appear in my original Big Y Novel Variants table, but it was recognized as a SNP and named by Full Genomes Corp. So these four Unnamed Variants, 11514480, 12144610, 19139783, and 11321844, appear to be genuine.

Positions 11649109 and 56831461 were not previously recognized as new SNPs by FTDNA, FGC, YFull, or any other analyst. These two need to be examined.


Why do I have two new SNPs that were not previously recognized?


Position 11649109: I have submitted all of my unnamed SNPs to YSeq so that they can be verified by Sanger sequencing. YSeq accepted all but one: they rejected position 11649109 because it was located in a "high repetitive region." 

Position 56831461: FGC previously reported that in my results, this was a low quality SNP and that it had been seen in two prior scientific studies.


Mapping to the hg19 human reference sequence


Although FTDNA and FGC give all SNPs a quality rating, YFull shows the specific reason for their rating. Here is what was reported about position 56831461 in my YFull hg19 analysis:




The above screen states that position 56831461 was formerly called position 58977608. It was read 53 times in my Big Y test. 34 times I had a T reported for this position, and 19 times I had a C reported (which is the ancestral position). Therefore, this SNP was rejected by all analysts.

If this position had been identified as a valid SNP by YFull, I would have been able to see it in their chromosome browser where it would have had 53 segments aligned--34 segments would have had Ts, and 19 would have had Gs.  The chromosome browser would have appeared similar to the image below.  Here the cursor is pointing to a position that had been read seven times; six of them showed an A in this position, and one showed a C.  




If any of the above segments had been misaligned to the less accurate hg19 human reference sequence, the new hg38 reference sequence would show a different result.


Mapping to the hg38 human reference sequence


After the new mapping to the hg38 reference sequence, Family Tree DNA now rates position 56831461 as a high quality SNP. FTDNA's new Y-chromosome browser indicates that this position was read 18 times, and all of them were T. 




Perhaps some of the previous segments were more correctly mapped to a different location. We do not yet have access to the new BAM files, so we can't compare these results to the old BAM files to see why these changes occurred.

In FTDNA's new Y-chromosome browser the forward and reverse strands are color-coded. Notice also that in this browser image at position 56831423 the calls are identified with a blue color and no letter. If you click on the blue column you can see that this column is not a change from one base to another. These are insertions that appear in my sequence.


SNP may not be novel at all


If I look at position 56831461 at YBrowse.org, I see that it has been given two previous SNP names. This agrees with the FGC report that this SNP had been seen twice.  



So this SNP may not be truly novel, but is only a new SNP in the FTDNA database.


How can we verify that any new SNPs are genuine?

  • See if any of the current Unnamed Variants are shared by other testers when the Big Y conversions are finished.
  • Have your results further analyzed. For example, YFull has announced that they will convert any previously-submitted kit from the hg19 reference sequence to the hg38 reference sequence and provide a new analysis for only $15. I will definitely be ordering that.
  • Have your novel SNPs verified by Sanger sequencing. The least expensive way to do this is to submit each new SNP to YSeq using "Wish A SNP." Use the hg38 position numbers. Then order a test at YSeq for your new SNPs and submit a DNA sample.


Using the new Big Y chromosome browser for Named Variants


In the Named Variants table, the SNPs are listed in alphabetical order. In the image below the first SNP is A1207.




The Reference Column contains the ancestral value from the hg38 Human Genome Reference Sequence. The Genotype column shows my derived value.

If you click on the name of any SNP, you will be taken to the Y-chromosome browser. I clicked on A1207:




SNP A1207 is shown at position 10631919. When I click anywhere in that column a black reference box will appear to the right of the column. This box tells me that at position 10631919 the reference sequence has a G, and I have a T. It does not tell me how many times this position was read, but we can count down the column to find out. Move the scroll bar at the bottom of the chromosome browser all the way to the right to access the vertical scroll bar.




You can also zoom out in your Internet browser to see all segments at once.




We can count the number of segments in the column for Position 10631919. According to the chromosome browser this position was read 34 times, and all reads showed a T in my results. 


"Derived" vs "Mismatch"


It appears in the above chromosome browser that I have several SNPs in the same short region. I can click on any of these locations in the browser to get more information.  But if I click in the column for position 10631929 (ten positions to the right of 10631919), I notice that the Type does not say "Derived"; it says "Mismatch".  




The only column that has the notation "Derived" is the column for position 10631919. It is also the only column for which we can determine the SNP name (A1207). 

What does "Mismatch" mean?  Looking at the browser, position 10631929 sure looks like a genuine SNP, but that position is not on my list of Unnamed Variants.  So if it's a real SNP, it must be in my list of Named Variants.  Unfortunately, the chromosome browser only has the position numbers, and the Named Variants table only has the SNP names.

I wish all the tools were in one location so that this was not such a cumbersome process, but we currently need to use multiple tools for our evaluations. I can use YBrowse.org to look up known SNPs and find out more about them. YBrowse indicates that position 10631929 has been named BY23083.





Do I have SNP BY23083? When I go back to the Named Variants in my Big Y Results, I can enter BY23083 in the SNP Name Search Box.




This SNP immediately shows up in my list of Named Variants:





If I click that SNP name, I will be taken back to the Y-chomosome browser. Clicking anywhere in the 10631929 column, we see that now the Type is Derived instead of Mismatch. 




When we looked at SNP A1207, position 10631919, the black information box indicated that this position was "Derived" On the screen for position 10631929 (SNP BY23083), position 10631919 (SNP A1207) is now listed as "Mismatch."




As we can see from the above browser images, only the SNP that is named at the top of the each browser screen will be listed as "Derived." All others on that screen will be listed as "Mismatch."


Evaluation of New Big Y results


It is still too early to tell the full impact of the Big Y conversion because we can't yet compare all of the people who match us, and we don't have access to the BAM files. The SNP names, hg38 position numbers, and hg19 position numbers are not fully cross-referenced, so understanding the recent changes can be frustrating.

But this change has huge potential. We should soon be discovering new SNPs, learning more about them, and finding more accurate matches. We won't see any changes to the current system until after all the results are processed, but we can make a few recommendations at a time. 


Suggested improvements to Big Y Results


Although we will have many suggestions in the near future, here are a few that can make the current results easier to use.


Unnamed Variants table:

  • Include hg38 and hg19 position numbers



Named Variants table:

  • Include the SNP name and its hg38 position



Y-chromosome browser:

  • Include SNP names and positions in the black information boxes 


Family Tree DNA has stated its commitment to making our results easier to evaluate.  I look forward to seeing how much we will soon learn!

6 comments:

Carl Oehmann said...

Linda,
Can you tell me how you got to the chromosome browser idsplay on YFull? I have tried both Firefox and Chrome browsers but can only find the SNP information display.
Thanks

Linda Jonas said...

The YFull chromosome browser is available for your novel SNPs by clicking the BAM button. See https://ultimatefamilyhistorians.blogspot.com/2017/10/big-changes-to-yfull.html

Michael Clarke said...

Very informative, thank you.
Can you explain what were the "other resources" you used to convert the position numbers?

Linda Jonas said...

I used the LiftOver utility at https://genome.ucsc.edu/cgi-bin/hgLiftOver

WayneK said...

In your examples you highlight "SNPs" identified in the 10M range. The 10-millions are proximal to or inside the centromere. The majority of these reads should never have been mapped to the Y. You would appear to be looking at results which are not relevant. Once FTDNA understands their mistake you should expect to see those scrubbed from results.

FTDNA reports darn near everything as "HIGH". Take that rating with a spoonful of salt.

Linda Jonas said...

Yes, everything was rated as HIGH. That's why I like YFull--they report even low quality SNPs, and everything is online.