Thursday, May 17, 2018

Fine Scaling Genetic Admixture: 23andMe vs AncestryDNA



Some genetic genealogists and genealogists pan genetic ancestry or ethnicity admixture estimates. They see them as not useful and wouldn't mind if ethnicity predicting was discontinued. On the other hand, I see genetic ancestry estimates as part of the genealogy cycle of life.

I understand that matching to genetic relatives is a natural progression from traditional genealogy methodology and practices, but genetic ancestry estimates are a natural extension of DNA relative matching and both can be used synergistically to achieve optimal results.

I'm sure by now you've seen or heard about 23andMe's new Recent Ancestor Locations feature added to its Ancestry Composition tool or last year's AncestryDNA Genetic Communities included with its Ethnicity Estimates—each touts unprecedented granularity with genetic ancestry estimates and addition of over 100 regions or reference populations.

Thus 23andMe ($99 US; $199 w/Health) and AncestryDNA ($99 US) have figured out a way to make their ethnicity admixture tools more relevant by "marrying" genetic ancestry to our genetic relatives! Both DNA testing market leaders have been able to achieve this matrimony by "fine-scaling" their respective ethnicity admixture offerings, which I will review for this blog. 

  • PINK ELEPHANT IN THE ROOM: On May 11, 2018, 23andMe hit AncestryDNA with a patent infringement and false advertising lawsuit [see here]. Based upon the bones of the case my analysis may seem to implicate a party. However it's unintentional as I've been working on this blog prior to the lawsuit. I hope the legal issue is amicably resolved between the parties.
What is fine-scaling?
Fine-scaling is usually performed by population geneticists for disease mapping studies and study of population structure in humans "to reveal the history of migration and divergence that shaped our species diversity." [see Novembre et al]

In the same manner 23andMe and AncestryDNA utilize the same approach to fine-scale our genetic or "ethnicity" admixture. Both DNA company's fine-scale offerings are similar in methodology (relying on living people's DNA and genealogy); purpose 
(revealing more about the recent ancestry and ethnic origins we inherited), and limitations, the latter of which leads to two common misconceptions
  • Your fine-scale results are NOT an Ethnicity Admixture update. It is an accretion, addition, amplification, augmentation, enhancement, enrichment, improvement, magnification, modification, rectification, reinforcement of your ethnicity estimates. It is NOT an update. For example, 23andMe still offers the same 31 global regions, and AncestryDNA still offers the same 22 regions. This is why your admixture percentages have remained roughly the same even with fine-scaling. 
  • Your fine-scale results won't necessarily be wrong or inaccurate. Fine-scaling as applied to genetic admixture tools involves subjectivity. That is, fine-scaling in part relies on customer-submitted biogeographical and genealogical information so you can't assume your new results are wrong. As a genealogy researchers you should be verifying your DNA relatives's pedigree information anyway. Also you must consider history, culture, migration patterns, and the fact that country borders and the way people self-identify may have changed over time. If you biologically match people from unexpected places, the onus is on you to find out how you connect to them. 
Prototype for Fine-scaling of Genetic Admixture Estimates

23andMe was the first to market in July 8, 2010 with a sort of fine-scaling genetic admixture product via its former "Countries of Ancestry" (CoA) tool, which was revolutionary, and ahead of its time. CoA set the basis for the fine-scaling enhancement we're seeing today with 23andMe's Recent Ancestor Locations and AncestryDNA's Genetic Communities. However CoA was never integrated with (or married to) 23andMe's Ancestry Composition. According ISOGG Wiki:
"Countries of Ancestry (formerly known as Ancestry Finder) was a feature of the 23andMe Personal Genome Service. It was accessed from Ancestry Labs on the 23andMe user interface. Ancestry Finder Lab utilized the data collected from 23andMe customers in the survey entitled, "Where Are You From?" to chart the birth country of your autosomal DNA matches' grandparents. The purpose was to attempt to give an overview of your ethnic origins by exploring those of your matches."
and
"You have an Ancestry Finder match when you share DNA with another 23andMe customer (ie, you have a Relative Finder match) and that person has completed our Where Are You From? ancestry survey. Here we show an example match on chromosome 20 of length 14.3 centiMorgans; the matched person said that all four of their grandparents were born in Sweden, which corresponds to all four stripes being colored green. You can mouse over the table rows or the segments themselves to explore your matches."
Here is what my Countries of Ancestry looked like:
As you can see in the screenshot above: 
  • the chart at the top lists my genetic matches and the countries where all 4 of their four grandparents were born (Guinea).
  • the chart on the bottom is a vertical chromosome display showing the location of my matches. When the cursor hovers over a colored match on the chart, additional information about the match pops up, such as length of segment and grandparents' birth places. 
  • A box also opens showing the chromosome location where my match shares DNA with me and segment length (12.2cM). The match is on chromosome 7. After contacting the match I learned he was from the Fulani tribe in the Futa Jaloon region of Guinea-Conakry. 
Further 23andMe's CoA had advanced control tools in which you could adjust the Number of Grandparents, Segment Length, Melting Pot Nations (like US and Canada); and Filter to show Public Matches. With CoA you could connect with your all of your genetic matches. You could also look at the CoA's of everyone that "shared" their DNA results with you, kin or not. In this manner I also learned that my Guinea match shared DNA at the same location with my maternal grandfather's first cousin, her child, as well as someone from Jamaica.

I was able to use 23andMe's CoA to quickly identify the nationalities of my genetic cousins and  knew the chromosome location they matched me on. However CoA was not integrated with 23andMe's Ancestry Composition so I had to perform manual side-by-side visual comparisons with CoA, Ancestry Composition's chromosome painting tool, and GEDmatch [see my Ethnicity Chromosome Mapping blog]. CoA was discontinued in 2015 after 
23andMe customers raised concerns about privacy, many of which were unfounded.

Now for the review! For each DNA company (23andMe and AncestryDNA) there will be a Results, Methodology and Analysis section, followed by my Conclusion:

23andMe's Recent Ancestor Locations
At the start of 2018 23andMe floated the idea of reinventing its former Countries of Ancestry tool by incorporating it with its Ancestry Composition (genetic admixture estimates) tool. 23andMe's new feature is called Recent Ancestor Locations (RAL), which boasted 150+ new regions added to its Ancestry Composition report.

By April, when 23andMe rolled out RAL [see here] some people misinterpreted it as being an Ancestry Composition update or an upgrade to its V5 chip platform. It was neither and in part due to the way 23andMe advertised RAL to the public (ie as an Ancestry Composition update).

RESULTS

When 
23andMe customers opened their new Ancestry Composition results some of them saw sub-categories below a genetic ancestry category. For example under the British & Irish category some customers saw United Kingdom, while others may have seen Ireland. Here's a new Ancestry Composition with RAL report for my African-American relative:

As you can see my relative only received an RAL (United Kingdom) for the British & Irish category. Yes not seeing any RAL's for his West African admixture is shocking. 

Next if you clicked on Scientific Details link, you were taken to another page showing your Ancestry Composition's 31 population categories each with RAL's and five dots next to them:

The five dots indicate the "strength of the match" (more in Methodology section) to the RAL. Since my relative's results show an RAL under his British & Irish admixture, and one or more of his dots will be shaded next to United Kingdom. My relative has one dot filled in, which indicates that my relative has a low-strength connection to United Kingdom and none with Ireland (the other RAL listed under British & Irish category). 

METHODOLOGY

At present 23andMe has not released a white paper on its RAL feature but offers explanations on the "Scientific Details" page of your Ancestry Composition report. 

23andMe first explains how your ancestral breakdown is calculated:
23andMe's explanation above is extremely important because it directly states that your Ancestry Composition reports were NOT updated. It also explains that the 31 ancestral reference populations are not necessarily related to or "in" you but rather that your DNA resembles (not match) DNA from a specific reference population.
Next 23andMe explains how it determines your Recent Ancestor Locations and "Match Strength":


Accordingly if you match (or share DNA with) 5 or more people (reference individuals) from a specific country, excluding your close relatives, you will be assigned one of the 150+ Recent Ancestor Locations. The RAL will appear under the corresponding Ancestry Composition report's 31 global populations (ethnicity categories).

The RAL reference individuals are customers who completed the Family Origins survey, which asks customers to fill out their grandparents' birthplaces. 23andMe utilized the survey information to organize the reference individuals into 150+ RAL clusters.

What are the dots about? If you received a RAL in your 23andMe Ancestry Composition report, you also received colored dots (one to five) in a "Strength" category. The colored dots in the "Strength" category are based on the IBD segments that were detected between you and the reference individuals. The more IBD segments shared with the reference individuals the more chances of receiving the five dots. In the example I posted in the Results section, my relative has a weaker connection to the United Kingdom RAL but my relative matches at least five individuals, excluding close relatives, from that United Kingdom RAL.

ANALYSIS

When 
23andMe's Recent Ancestor Locations (RAL) released, some customers found their Ancestry Composition results to be more specific and in accordance with their known genealogies and genetic ancestry inheritance. However others were deeply wary about their RAL and the dot system (match strength) being inaccurate, as well as the RAL reference persons submitting erroneous information about their grandparents. Most shockingly, African-Americans and very few people of recent African descent received any African RAL's.

As explained earlier, fine-scaling relies on information from clusters of related people, including those biologically connected to us. 
23andMe fine-scales its genetic admixture product by culling information from customers who filled out a "Where Are You From?"  survey and then using this information to establish an RALIn order to receive an RAL you are required to share DNA with five reference individuals, excluding close relatives. This 5-person requirement allows for high confidence that you and your DNA matches are actually related and share recent ancestry from a specific biogeographical location.

While it's possible for RAL reference persons to submit erroneous information about their grandparents we must remember that we do share a biological relationship with the ones showing up as an RAL in our Ancestry Composition report. Notably 23andMe's survey does not even ask where all 8 great-grandparents were born, which may different from our grandparents' birthplaces. Therefore we must be careful about dismissing an RAL outright without further genealogical investigation.  

Unfortunately the reference individuals comprising the RAL's are not easily contactable. This is a significant departure from 23andMe's retired Countries of Ancestry tool where you could contact any match directly from the tool's interface. Since the RAL is based on our DNA relatives, we have to manually comb through our DNA relatives list to find matches who've actually listed the birthplaces of all four grandparents—you can access this information with the "DNA comparison" feature on your "DNA Relatives" list. But there is no way of telling if your DNA relative was utilized for an RAL.


Interestingly 23andMe's 150+ RALs include numerous clusters for the Native American category. There are 22 Native American RALs representing countries in the Caribbean, Central America, Mexico and South America and none from North America. Some customers with Native American admixture were thrilled about this and even made the false assumption that these 22 RALs represented tribal origins. This problem is most of the reference persons in these "Native American" RALs are multiethnic (African, Amerindian and European), and I'm not sure if we actually share Native American or other admixture with them. 

By contrast 
23andMe's RAL only has five clusters for Ancestry Composition's West African and East African category. Most shockingly, African-Americans and very few people of African descent received any of African RALs. This is very disappointing because there are plenty customers or African descent in 23andMe's database due to recruiting initiatives like its 2018 Global Genetics Project; 2016 African Genetics Project, Roots Into The Future; Project and African American Sequencing Project

In addition 23andMe customers fortunate enough to experience Countries of Ancestry had recent African matches with all four grandparents being born in one country such as Angola, Guinea, Guinea Bissau, Senegal, Mali, Mozambique and Madagascar. Yet NONE of these populations are included in the 150+ RALs. To note 23andMe's first version of RAL 
assigned Senegal, Mozambique and Madagascar to Ancestry Composition's "Broadly Sub-Saharan African" category. Then 23andMe removed them altogether. 23andMe subsequently added Cabo Verde, an ethnically mixed population, as an RAL to its West African category. 

23andMe's new RAL's short-changes people of African descent and reveals a more complex problem that must be resolved first. 23andMe must make changes (read: update) to its Ancestry Composition tool for better RAL integration. Here are my recommendations:

  • With respect to Ancestry Composition's Sub-Saharan African population clusters, 23andMe's has enough data and reference populations to organize the sub-regions with more granularity and specificity. I propose the African categories/clusters be: Northern West AfricanCentral West African, Southern West African, Central African, Southeast African, South African, Northeast African, and North African & Arabian. Then new RAL's could be created for each cluster using customers already in the database. 
  • Ancestry Composition's West African sub-category can be broken down because populations there are genetically distinct from each other—for example the Fulani (Guinea, Senegal, Sierra Leone) are distinct from the Akan (Ghana, Ivory Coast), who are distinct from the Igbo/Yoruba (Nigeria), etc. In other words West Africa can be split into five or more distinct biogeographic regions or clusters [see Bryc et al] and even more RALs. 
  • Ancestry Composition's African Hunter-Gatherer needs to be split between the "Pygmy" populations (Cameroon/Congo) and Khosians (South African) as they are scientifically proven to be genetically distinct from each other; they should not be organized by lifestyle (hunter-gatherer). A Central African and South African category would correct this issue. As well the East African should be split between Southeast African (mostly Bantu) and Northeast African (non-Bantu). 
  • Finally remove the term "Sub Saharan African" and just use "African." Most of 23andMe's competitors have stopped using the term because it has no genetic value and is offensive to some people of African descent. [see discussion here; here and here]

ANCESTRYDNA's GENETIC COMMUNITIES

In March 2017 AncestryDNA released its new fine-scale genetic admixture feature Genetic Communities (GC) where you are matched to biogeographical regions and people in those places who share the same ancestries or regional history. AncestryDNA's GC are vast in scope and complexity; at publication there are over 350 GCs from all over the world. Since AncestryDNA released Genetic Communities early last year, a lot have been published about it. As such and for brevity purposes, I will only focus on the most salient points concerning  AncestryDNA's GC and what issues need to be improved today.
RESULTS

With AncestryDNA's GC you can see and access your results in numerous ways from your Ethnicity Estimate or DNA Matches page. When you go to your Ethnicity Estimate for example, your are  able to see specific "regions" (Genetic Communities) that you share with both genetic relatives and non-relatives. Here are my results:


As you can see at the bottom of my page there is a "Migrations" section and it contains two genetic communities: South Carolina African-Americans and Pennsylvania Settlers.

When accessing my AncestryDNA GC from my DNA Matches page it lists sub-groups of my two GC's, namely South Carolina Pee Dee Country African-Americans, Savannah River Basin African Americans (all for South Carolina African-Americans) and Poconos & North Jersey Settlers (for Pennsylvania Settlers):

My AncestryDNA's GC are also accessible when I click on a category for my Ethnicity Estimate:
Looking at my Ethnicity Estimates from the perspective of my assigned AncestryDNA GC, I can see that several of my DNA matches in the GC share several ethnicity admixture categories with me:


I can also see at the bottom of the page that share my South Carolina African-American GC with 851 DNA matches. The orange area shows the extent of the AncestryDNA's GC.

Included with the 
AncestryDNA's GC is a "DNA Story" presented as a time-line that contains historical and genealogical information about the GC and the people in them, as well as an interactive map showing the history of migrations of the peoples in the GC:


METHODOLOGY

AncestryDNA's GC basically works by utilizing a machine-learning algorithm to cluster living individuals that share DNA due to specific and recent shared history, and comparing this to 
AncestryDNA customer genealogical information. Thus AncestryDNA's GC reveals fine-scale population structure in its customers' Ethnicity Estimates due to historical  migration patterns. You can read the full description of the process in the AncestryDNA Genetic Communities White Paper.

The methodology by which 
AncestryDNA identifies and assigns individuals to GCs is best summed up in the white paper's conclusion:

As the 
AncestryDNA's GC white paper elucidates:
  • "AncestryDNA identifies groups of customers that likely descend from immigrants participating in a particular wave of migration (e.g. Irish fleeing the Great Famine), or customers that descend from ancestral populations that have remained in the same geographic location for many generations (e.g. the early settlers of the Appalachian Mountains)"; and 
  • "Following the identification of these clusters of individuals in the entire network, we can then assign any AncestryDNA customer to one or more of these clusters based on their IBD with other AncestryDNA members. Such assignment can provide a customer with insight into their recent ancestral history, in some cases traceable back to a historical event."
ANALYSIS 

AncestryDNA Genetic Communities rely on Identical By Descent (IBD) or genetic connections among its 10 million customers, as well as customers' Ethnicity Estimates and genealogical information. In turn AncestryDNA GC reveals finer details about our Ethnicity Estimates as well as the recent and ancestral migration patterns of our genetic relatives. There may be some truth in the notion that everyone is related to each other by "six degrees of separation." 

The greatest strengths of 
AncestryDNA's GC product is that it's inclusive of people with multiple ancestries and biogeographic origins...and that there just are so many GC (350+). You can also access GC from several different points in your AncestryDNA results, which should be desirable for novice customers. AncestryDNA's GC is fully integrated with its Ethnicity Estimates and DNA Match lists. And you can actually contact every DNA match in your GC. Together these attributes help us learn more specific information about our recent genetic ancestry.

Having so much accessibility and integration comes at a cost. That is, you are not related to everyone in your assigned 
AncestryDNA GC; remember GC include people you're actually related to and non-relatives sharing the same biogeographical origins and migration patterns with you. Furthermore you will have DNA matches in your GS with whom you won't share ethnicity admixture clusters listed with the GC. Invariably just because someone matches and shares a GC with you doesn’t mean that’s how you are genetically related to them. 

Since AncestryDNA GC in part based on customer-submitted genealogical information we're at mercy of family trees, many of them with scant, erroneous or private information. However our fears our mitigated by the fact that AncestryDNA also examines genetic connections among its 10 million customers to determine GC. A brilliant move. 

Of course if AncestryDNA implemented a chromosome browser we could determine with more precision how and if we match genetic relatives in our GCs. Despite a continuous petition for a chromosome browser, AncestryDNA has refused to add one. AncestryDNA  GC have two other important drawbacks:

First, all of 
AncestryDNA's 350+ GC's, inclusive of African-American clusters, there is NONE from AFRICA. This is disheartening because many African-American customers consider AncestryDNA's Ethnicity Estimate tool to have the best African breakdown on the market, often corresponding to recent African DNA matches connected to them via the Trans-Atlantic Slave trade. What's more there are a lot of Africans in AncestryDNA's 10 million+ database so there should be GCs for them too.

Second and finally, almost no one's 
AncestryDNA GC have updated from when the feature was first introduced in 2017. I still have the same old two genetic communities. Also many customers are not assigned GCs that they should be connected to. For example I've several recent ancestral lines from Virginia and many of my DNA tested relatives with roots in Virginia have a Virginia GC, but I don't. It appears AncestryDNA's machine learning algorithms are stagnant and in desperate need of lubrication.

CONCLUSION

23andMe's Recent Ancestor Locations and AncestryDNA's Genetic Communities attempts to marry ethnicity estimates with our biological relatives to reveal our ethnic origins and genetic ancestry with more specificity and precision. These new genetic admixture estimate enhancements are based upon the principles of fine-scaling genetic admixture and integrates DNA connections shared between customers and customer-submitted genealogical information.

Customers misinterpretation of 23andMe and AncestryDNA's fine-scale featureit is not an admixture calculator updatehave led to unnecessary disappointment. But as genealogists we have an ethical obligation to verify our the genealogical information of our matches.

Still there are problems with 
23andMe and AncestryDNA's fine-scaling methodology and algorithms:

With 23andMe's Recent Ancestor Locations we are not able to directly contact our relatives utilized as reference persons unless we do a grueling manual search of our matches. This has led to customer paranoia about whether reference persons are submitting the wrong information about their grandparents. What's more people of African descent were deprived because they virtually have no African RAL23andMe will have to rethink its 5-person matching requirement for RAL assignment and actually update its Ancestry Composition so that more customers will receive an RAL.

With 
AncestryDNA's Genetic Communities, we can contact the DNA matches on our GC's, which is fully integrated and accessible from any place in our DNA reports. This allows for customer to focus on their relatives, family pedigrees and migratory history. I also like the fact that most of the GC's are multi-ethnic and connect to several biogeographical regions. Yet AncestryDNA does not seem to adequately reflect the connections among its 10 million customers and 350+ GCs. And while there are many African-American GCs there are no continental African GCs so again people of African descent are treated unfairly. 

Fine-scaling of our genetic admixture is of course not what we ultimately wantedwe desire an actual ethnicity admixture update. Yet fine-scaling genetic admixture is certainly a move in the right direction. Sort of like an engagement rather than a marriage. Thus fine-scaling does exactly what it is intended to domatch us to living people who share similar biogeography and history with us in order to reveal relevant specificity about our genetic ancestry. 

And once the Pink Elephant leaves the room hopefully other DNA companies like FamilyTreeDNALivingDNA and MyHeritage will follow suit with a fine-scaling application for their genetic admixture and DNA matching products. 

###END###


2 comments:

  1. Very interesting article. I actually love Ancestry's Genetic Communities/Migrations and 23&me's Recent Ancestor Locations, but neither are exactly perfect.

    ReplyDelete
  2. Gosh TL Dixon.so much information here. I will have to read more than once. Family stories say that my African ancestor came from Guinea my Ancestry DNA says Nigeria Senegal and a little Ghana. My European ancestry from family stories is a German Dutch Jewish store keeper Ancestry DNA is present day France and Germany. So pretty close
    I've uploaded to Gedmatch to learn more. But have never gotten much matches. All matches are 4 generations or further back.
    Thank you for all you do.

    ReplyDelete