Friday, October 27, 2017

Big Changes to YFull

On October 9, I posted the Big Y Update from Family Tree DNA. The announcement stated that FTDNA was updating all Big Y results from the older human genome reference sequence, hg19, to the most recent and more accurate reference hg38. This means that YFull, which interprets Big Y results, would have to include references to both hg19 and hg38 and be ready to process new tests mapped to hg38.

A few days later, on October 13, I wrote What are the benefits of YFull? Since then, YFull not only has added the hg38 conversion, but has added more tools to make their interpretation service more valuable. A lot has happened in less than two weeks!

Understanding Next Generation Sequencing

In order to best take advantage of the new enhancements, it is helpful to understand a little about how Next Generation Sequencing tests, like the Big Y, are processed. As you may remember, your DNA consists of two strands of DNA coiled into a double helix. The strands run in opposite directions.  One is called the forward strand, and the other is the reverse strand. The strands are connected by base pairs (bp) which are the As, Cs, Gs, and Ts that form your DNA sequence. All of these can be numbered to show their position on the chromosome.

During the testing process, your DNA is not read in one continuous stretch. Instead, your DNA is broken into random fragments. The test then reads these fragments from each end. Some fragments are read many more times than others. For example, one of your fragments may have been read two times, and another 56 times. Unfortunately, not all of the reads may give the same result. So a fragment that was read consistently 92 times will be reported as a high quality SNP, while one read five times with different results will be considered to be a much less reliable SNP. We will examine one of these "low quality" SNPs below.

After all the fragments are read, they must be reassembled, mapped to the human genome reference sequence, and given a precise location. Differences between your DNA results and the reference sequence are then reported. The human genome reference sequence is continually improving. Big Y results were formerly compared against the human genome reference sequence known as hg19 which was Build 37. They are now compared against hg38 (Build 38), and many of the position numbers have changed.

So let's put this basic knowledge into practice.

Updates to YFull

In my October 13 blog post What are the benefits of Yfull? I showed the following image of Novel SNPs from my YFull results:

Less than two weeks later, the same page looked like this:

If we compare the new version of my Novel SNPs screen to the previous version above, you will notice that the first line which contained position 7285772 now says 7285772 - 7417731 Hg38.  Next, you see a blue BAM icon at the far right of every line, and finally, on the sixth line you now see an orange check mark next to the letter G.

hg19 and hg38

The new screen shows both hg19 and hg38 positions. The first number is the hg19 position, and the second is the hg38 position. Even though the hg38 positions are shown, this does not mean that my results were mapped to hg38. We will see later how we can tell that my results are mapped to hg19. However, YFull will be accepting new FTDNA hg38 results.  .

Browse Raw Data

On the right of the Novel SNP screen are blue BAM links so that I can view any of these positions in my BAM file. I will be examining one of my low quality SNPs. We are now viewing the "Low qual" tab. I want to see why the second line in the image below (position 23096690) is considered to be low quality.

Before we use the BAM link, start by clicking the magnifier button at the left of position 23096690. The yellow magnifier is a link to view information about that position in a tabular format. It is the identical information that is obtained from the "Browse raw data" link in the gray menu on the left of the screen.

Here is what you see when you click the yellow magnifier or the Browse raw data link.

In the table above, after the chromosome positions you find a line for "Reads." Here the Reads are reported as 5. This means that Position 23096690 was read five times in my Big Y test. The next line is Position data: 4T, 1G which tells us that four of the reads indicated that I had a T in this position, and one read indicated that I had a G. This gives me a probability of error of .28 which is why this SNP is considered to be low quality.

The next two lines give very important information. My Sample allele was a T in this position (this is called the derived value). The Reference allele (hg 19) was a G, so at position 23096690 the human genome reference sequence had a G (which is called the ancestral value). The Reference (hg19) allele also indicates that my sample was previously mapped to the old hg19 reference, not to the new hg38 reference.

Using the BAM browser

Now, knowing the information from the Browse raw data table, let's click on the new BAM link to view this information in a Y-chromosome browser.

You will be brought to the Y-chromosome browser screen below:

At the top of the screen you will see the range of base pairs shown in your browser.  Here the range is from base pair (bp) 23096615 - 23096765. My SNP 23096690 will be shown within this 150 bp range. After the SNP positions is a list of browser styles: compact, 1, 2, 3, 4, 5, and 6.  You are seeing the compact version on this first screen. This compact style doesn't seem to be very informative unless you remember that Y Full always uses the color green for A, blue for C, orange for G, and red for T.  So you don't even need to see the letters to know that I matched the reference sequence except for a few calls for T where the ancestral value was G.

After seeing the compact version of the SNP browser, we will click on each style to see the differences.  Here is Style 1:

Style 2 is much more informative showing the five forward and reverse strands. Family Tree DNA in its new Y-chromosome browser color codes the forward and reverse strands.  YFull uses capital letters or lower case letters on each strand to distinguish between forward and reverse. Remember that in the Browse raw data table above, the position 23096690 had five reads--four were T (derived) and one was G (ancestral). Here you see five lines representing the five times that position 23096690 was read. The four derived Ts are highlighted in yellow.

Style 3 is similar to style 2 with the four Ts again highlighted in yellow.

Style 4 very dramatically shows where my test result differed from the reference sequence.

Style 5 is an enlarged image of the compact version style.

In Style 6 it is very difficult to find where my results differ from the reference sequence.  I might only use this if I were to conduct a "Find the SNP" contest with someone!

Here is Style 6 again with the arrow pointing to the column where my four Ts are different from the one ancestral G.

I find only Styles 2, 3, and 4 to be useful to me, but it's all a matter of preference.

The little check mark is big news

Now that we've seen the new Hg38 reference additions and examined a SNP in the new BAM Y-chromosome browsers, let's see the final addition which is the tiny orange check mark. This could be easily ignored, but it may be the most exciting change of all. If we click the orange check mark we can see that this SNP has been tested by Sanger sequencing at YSeq. Sanger sequencing is a method by which we can verify the validity of a SNP. YSeq is a company that will conduct Sanger sequencing on your SNPs at a reasonable cost. One person was tested for this SNP at YSeq, but his result was negative (meaning he has the ancestral value at this position).

On the Anthrogenica forum, user REWM posted the following image:

Instead of an orange check mark, this one is green. This SNP has also been verified by YSeq using Sanger sequencing, but in this case nine people were tested for the SNP.  Two of them were positive meaning they had a derived value in this position. The green check mark indicates that a SNP has been found in this position using Sanger sequencing.

These new SNP verification notices are wonderful news because we can now prove whether any of our questionable SNPs are valid and show the results on YFull. Many ambiguous quality and low quality SNPs have been proven to be genuine SNPs with Sanger sequencing. Here's how to take full advantage of this:
  • Submit your novel SNPs (including any of your best, acceptable, ambiguous, and low quality SNPs) to YSeq through their Wish a SNP program. Use the hg38 position numbers. It will cost you one dollar per SNP. 
  • YSeq will let you know whether each SNP qualifies for Sanger Sequencing, and if so, they will make all qualifying SNPs available for testing. If you want to test several SNPs, you can then design your own SNP panel through "Wish a Panel" to bring the cost down.
  • Order your test and submit a DNA sample to YSeq to verify any of your SNPs. 
  • In addition to receiving your results from YSeq, the validation will then appear on YFull.  

Thomas Krahn of YSeq has stated that YSeq and YFull are examining ways to better integrate their systems.  This means there are a lot more changes in store!

Consider submitting your BAM file to YFull

I have heard some people say that you don't "need" to submit your results to YFull because you can get good interpretations from other services including haplogroup administrators. This is true, you don't "need" to submit your results to anybody. But YFull's services are hard to beat, and they just keep getting better. YFull has many useful tools including SNP dating, reporting of about 500 STRs, and more. I examined a few of these in What are the benefits of YFull? In addition, many scientific studies rely on information obtained by consulting YFull. One of the many reasons I submit results to YFull is that if scientists want to discover more about my general branch of the human Y tree, I want my specific branch to be a part of it.

As more test results are submitted to the database, the interpretations are getting even better. Please consider submitting your results, too.

Update: See Great updates to YFull


Karim said...

This was an excellent blog post. I've benefited much from reading your series on BigY and YFull. Keep it coming.

Unknown said...

How can I privately email you about my FTDNA admin project:

Linda Jonas said...

Use the new Contact Form on this page. I look forward to hearing from you.

Martin Prather said...

How does one subscribe to your posts?

Linda Jonas said...

Martin, To subscribe, look at the links on the right side of the screen. Select either "Follow by email" or "Subscribe to Ultimate Family Historians." If you select Subscribe click Posts and choose one of the feed readers.

Thank you!

James F said...

Very well written

WayneK said...

You are correct in that people don't "have" to submit their results to YFull. YFull is filling a need for a number of haplogroup projects which never developed the basic comparison technology on their own. Those groups which have the technology and larger number of analyzed submissions lower the value provided by YFull. There very incentive to pay for a service which doesn't add additional results to what is provided via the free services.

Some of the analysis software is open source and available for other haplogroup projects to get off of the YFull analysis treadmill.

Linda Jonas said...

My brother's haplogroup is one of those that has excellent analysis, and it is absolutely free. James Kane uses the BAM files and has been mapping to hg38 for a long time. He places people on a tree, does SNP dating, etc. Yet I still submitted to Full Genomes Corp and to YFull. I want those results in as many places as possible. This is the same reason that I take DNA tests from more than one DNA company, and why I have more than one online family tree. I want as many people as possible to find these results and to be able to make matches. For me, it's more than just getting an analysis. It's important to help others (including scientists) find the analysis. I encourage people to submit their results to as many places as they can.

Ahmad said...

Thank you for these wonderful information

When talking about SNPs features, we hear these terms: Low quality, Unstable and Poor coverage

What are the definition of these ?

I have 19 SNPs not reported by FTDNA but when I submitted to Yfull, it was categorised as ambiguous quality. Average of read is 2. I did ask YSEQ to make them available for testing. My question is, Are they reliable in creating more branches in my family tree?