Dr. Juan Martinez Cruzado- Biology and Genetics
by Means of mtDNA Phylogeographic Analysis
The haplogroup identities of 800 mtDNAs randomly and systematically selected to be representative of the population of Puerto Rico were determined by restriction fragment length polymorphism (RFLP), revealing maternal ancestries in this highly mixed population of 61.3% Amerindian, 27.2% sub-Saharan African,
and 11.5% West Eurasian. West Eurasian frequencies were low in all 28 municipalities sampled, and displayed no geographic patterns.
Thus, a statistically significant negative correlation was observed between the Amerindian and African frequencies of the municipalities. In addition, a statistically highly significant geographic pattern was observed for Amerindian and African mtDNAs. In a scenario in which Amerindian mtDNAs prevailed
on either side of longitude 66°16 West, Amerindian mtDNAs were more frequent west of longitude 66°16 West than east of it, and the opposite was true for African
mtDNAs. Haplogroup A had the highest frequency among Amerindian samples (52.4%), suggesting its predominance among the native Taı’nos.
Principal component analysis showed that the sub-Saharan African fraction had a strong affinity to West Africans. In addition, the magnitudes of the Senegambian and Gulf of Guinea components in Puerto Rico were between those of Cape Verde and Sa˜o Tome’. Furthermore, the West Eurasian component
did not conform to European haplogroup frequencies. HVR-I sequences of haplogroup U samples revealed a strong North African influence
among West Eurasian mtDNAs and a new sub- Saharan African clade. Am J Phys Anthropol 128: 131–155, 2005.
c 2005 Wiley-Liss, Inc.
Recent technical advances have facilitated the discovery of genetic polymorphisms
in the human population, many of which are useful as markers for prehistoric migrations that gave rise to continental and regional populations. Continental-population histories were reconstructed using Y-chromosome markers, which are paternally inherited (Hurles
et al., 1998; Rosser et al., 2000; Bamshad et al., 2001; Hammer et al., 2001; Karafet et al., 2001; Kayser et al., 2001; Malaspina et al., 2001; Underhill et al., 2001; Bortolini et al., 2002, 2003; Cruciani et al., 2002; Lell et al., 2002; Pereira et al.,
2002; Semino et al., 2002; Zerjal et al., 2002, 2003; Zegura et al., 2004), and mtDNA markers, which are inherited maternally (Merriwether and Ferrell, 1996; Comas et al., 1998; Starikovskaya et al., 1998; Richards et al., 2000, 2002; Forster et al., 2001;
Kaestle and Smith, 2001; Malhi et al., 2001; Torroni et al., 2001a,b; Keyeux et al., 2002; Oota et al., 2002; Salas et al., 2002; Schurr and Wallace, 2002; Yao et al., 2002a,b; Kong et al., 2003), usually finding remarkable differences in sex migration
histories. In this study, we developed a hierarchical strategy that makes use of haplogroup-defining mtDNA restriction markers to identify maternal biological ancestries in a sample set randomly and systematically selected to be representative of the Puerto
Rico population, a mixed Caribbean population of three principal components: Amerindian, Sub-Saharan African, and West Eurasian. With some notable exceptions, most haplogroups are regarded as continent- specific. Thus, determining the haplogroup to which a
mtDNA belongs usually identifies the mtDNA biological ancestry. The HVR-I sequence was used when biological ancestry could not be determined through restriction marker analysis.
The biological ancestries of a mixed people have implications
in their population genetics and thus in public health. In terms of mtDNA ancestry, studies of European and North American populations related particular West Eurasian haplogroups to higher frequencies of some diseases such as Alzheimer’s (Hutchin and
Cortopassi, 1995), Leber hereditary optic neuropathy (Johns and Berman, 1991; Brown et al., 1997; Hofmann et al., 1997; Lamminen et al., 1997; Torroni et al., 1997; Howell et al., 2003), Wolfram syndrome and sudden infant death syndrome (Hoffman et al., 1997),
and some conditions such as asthenozoospermia and nonasthenozoospermia (Ruiz-Pesini et al., 2000). Further, it was shown that the 10394 Dde I state plays a protective role against Parkinson’s disease, and that its effect is stronger when it is combined with other polymorphisms that are specific to haplogroups J and K (van der
Walt et al., 2003). Thus, the characterization of the mtDNA pool of any population may be instrumental in determining risk factors for various diseases and conditions.
In addition, biological ancestries imply human
migration routes that shed light on the possible origins of introduced fauna and flora, including agricultural varieties. Moreover, biological ancestries play a fundamental role in population history which, as one of the main categories of cultural history,
is essential to explain the social systems and behavioral guidelines that rule all aspects of social life. Population history considers population growth in relation to geographic regions, biological ancestries, and admixture, and thus plays a central
role in the cultural development of a people (Ferna’ndez-Me’ndez, 1970).
It is estimated that from 60,000 – 600,000Arawak-speaking Taı好o Indians lived in Puerto Rico when it was discovered for the Europeans by Christopher
Columbus in 1493 (Abbad, 1959; Ferna ’ndez-Me’ndez, 1970). Traditional history tells us that they were decimated by war, hunger, disease, and emigration, such that they had totally disappeared by the end of the 16th century. The vast majority of
Spanish settlers were single men, and mixing with Indian women commenced fully upon colonization in 1506. The Spanish Crown took measures to increment the number of “white” people on the island, including ordering “white” Christian
female slaves to be sent to Puerto Rico in 1512. However, the 1530 census reported that only 57 of the 369 “white” men on the island were married to “white” women. Such Whites” were a minority. The census reported 335 “black”
female slaves and 1,168 “black” male slaves, and a total of 1,148 Indians, both genders included (Brau, 1904). By this time, the base of the Puerto Rican economy was shifting from gold mine exploitation to sugar cultivation. African slaves became
the cornerstone of the sugar industry.
Traditional history includes abundant evidence of the widely dispersed geographic origins of the sub- Saharan African peoples who were brought to the Americas, spanning Cape Verde on the northwest
edge of sub-Saharan Africa to Mozambique and the island of Madagascar in the southeast. The arrival in Puerto Rico of people from various African regions can be confirmed by traditional festivals and other activities held in the names of African gods and by
the use of words that can be found only in particular African regions. However, the lack of a classification system for slaves by tribe or even by geographic region during the Atlantic slave trade leaves doubts concerning the relative contribution of the different
continental regions (A’ lvarez-Nazario, 1974).
Slaves were first brought to Puerto Rico in 1508 by its conquistador, Juan Ponce de Leo’n. These were residents of the Iberian Peninsula, many of North African, Senegambian, or
Guinean origin (A’ lvarez- Nazario, 1974); others were Greek, Slavic, or Turkish (Thomas, 1997), and others Jewish (Dı’az- Soler, 2000). The capture of sub-Saharan Africans with the goal of providing Spanish and Portuguese colonies in the Americas
with a labor force, first in the search for gold and later in sugar plantations, started in 1518 (Dı’az-Soler, 2000). Up to the beginning of the second half of the 16th century, almost all slaves originated in Senegambia and Guinea (Alegrı’a, 1985).
The island of Sa˜o Tome’, with slaves acquired mainly from the Gulf of Guinea, was an important supplier thereafter. Throughout the 16th century and with the exception of one in the west coast, all 13 sugar mills in Puerto Rico were east of the
La Plata River, which streams along longitude 66°16 West (Gelpı’-Baı’z, 2000).
The Portuguese were the legal
source of African slaves until 1640, at which time Spain suspended all contracts in retaliation for the revolution that removed their Spanish rulers. The resulting shortage of slave labor provoked the collapse of the sugar industry, starting a period of subsistence
economy that lasted for a century and a half until the Crown suspended all taxes and source restrictions on the slave trade in 1789. The poor state of the economy hindered the importation of slaves, and the tax collected upon their sale made the illegal
trade their main source. The illegal slave trade was circuitous in that the main slave sources were the Dutch colony of Curacao and the English colony of Jamaica, in that order. Slaves brought from the Gold Coast (Ghana) were the most common in these
lonies at the time. The illegal Puerto Rican harbors were located on the west and south coasts, where most of the island population lived (A’ lvarez-Nazario, 1974). The only legal harbor was far away in San Juan, the capital, and few legal immigrants
made it to Puerto Rico during these times.
The importation of slaves increased dramatically as a consequence of the land and tax reforms of the last decades of the 18th century, and approximately two-thirds of all slaves ever brought
to Puerto Rico arrived from that point in time until the abolition of slavery in 1873 (A’ lvarez-Nazario, 1974). By then, the African harbors most used by slave traders extended from the Gold Coast to Angola (Thomas, 1997).
wave of slaves found Puerto Rico mainly inhabited by criollos, Puerto Rico Natives who were the product
of centuries of admixture and generations living under a subsistence economy with little or no Spanish government intervention (Ferna¡¦ndez-Me¡¦ndez, 2000).
The Spanish Empire started to crumble at the
beginning of the 19th century, and an 1815 royal decree permitted the settlement in Puerto Rico of foreign Catholics with their wealth and slaves. Thus, wealthy “white” refugees and other immigrants from Europe and the Americas made it to Puerto
Rico in great numbers, stimulating the economy by developing the sugar industry in the coastal plains and the coffee and tobacco industries in the mountains. International treaties banned importation of slaves directly from Africa north of the Equator in 1817
and south of it in 1820. However, enforcement of the treaties was neffective south of the Equator, where the Portuguese had bountiful slave factories. Thus, the illegal Angolan trade became substantial in the 19th century. Larger sources of Africans were probably
the West Indies, because trade within the Caribbean was not banned and because escapees arriving to Puerto Rico were granted freedom. In this respect, migrations from the then Danishruled island of Saint Thomas, which acquired its slaves mainly from the Gold
Coast to the Slave Coast in the Bight of Benin (Thomas, 1997), were a major source (A’ lvarez-Nazario, 1974).
Our results conform to most accounts of traditional history, but not at all with the extermination of the Taı’no
people as early as the 16th century, thus showing that population genetics has a lot to offer studies on Caribbean population history. It is important to note that neglected people rarely contribute to traditional history, and a great part of the cultural
development of the Puerto Rican people occurred in the “darkness” of history, far away from the capital, as did the illegal trade that kept their subsistence economy alive.
In the interest of greater clarity, we often
refer to Puerto Ricans carrying mtDNAs of Amerindian, African, or West Eurasian origin by such terms as Amerindians, Africans, and West Eurasians. However, it is important to keep in mind that we are referring to a thoroughly mixed population composed of people
of a single culture and whose phenotypes do not predict individual mtDNA ancestries.
SUBJECTS, MATERIALS, AND METHODS
A random sample of 872 housing units representative
of the island of Puerto Rico was selected using a sampling frame developed by the Center for Applied Social Research (University of Puerto Rico at Mayagu‥ ez) for survey research in Puerto Rico, based on the 1990 Census of Population and Housing. Excluding
the island municipalities of Vieques and Culebra from the sampling frame, 28 of the 76 municipalities in Puerto Rico were selected (Fig. 1), as per the following description. The eight most populated municipalities were selected with probability equal to one.
Each was assigned a number of housing units proportional to its estimated population size, based on a total of 872 housing units for the entire island. To select the remaining 20 municipalities, the remainder of the island was divided into five geographical
regions. Four municipalities from each region were selected at random with a probability proportional to estimated population size, while stratifying by estimated population size. They were assigned an equal number of housing units, oportional to the estimated
population size of the geographic region they represented.
Thirty percent of the census tracts within each municipality were selected at random, with probability proportional to estimated population size. Having established
an estimated number of housing units for each census tract based on the number of housing units for the municipality and the relative population sizes of the selected census tracts, census blocks were selected within them so that each would contribute an expected
eight households to the sample. On the field, housing units were chosen by systematic sampling, with a preestablished random starting point for each block. This means that the actual number of housing units obtained from each block could be greater or smaller
than initially expected, depending on how the number of housing units in it had changed since 1990. An adult was selected at random from each housing unit. Participation in the project was agreed to by appropriate informed consent.
Sample collection and DNA extraction were performed as in Martı’nez-Cruzado et al. (2001). Thereafter, a 200- l aliquot from each 500- l sample was purified, using the QIAamp DNA Mini Kit (Qiagen). To each aliquot, 36 l of 60 mM Tris-HCl (pH 8.0), 60 mM Na EDTA (pH 8.0), 0.6 M NaCl, 0.24 mM DTT, and 12% SDS were added, followed
by 250 l of buffer AL and 250 l of 100% ethanol.
The aliquots were vortexed thoroughly, transferred to a spin column, and spun at 8,000 rpm for 1 min. The filter was washed by adding 300 l
of buffer AW1, spinning at 8,000 rpm for 1 min, adding 300 l of buffer AW2, and spinning at 14,000 rpm for 5 min. DNA was eluted from
the filter into two 100- l aliquots. The eluate aliquots were kept at 80°C as backups until the end of the study.
Except for the cycling conditions (see below) and that 1.5 U of Taq DNA polymerase were used in each amplification reaction, the DNA amplification, restriction digestion, and agarose gel electrophoresis procedures were performed as in Martı’nez-Cruzado et
al. (2001). The amplification reactions were usually subjected to one cycle of 2.5 min at 94°C, 32 cycles of 30 sec at 94°C, 1 min at 54°C, and 70 sec at 72°C, and one cycle of 10 min at 72°C. Primer annealing was achieved
at 52°C to amplify the diagnostic site for croparagroup L, and at 56°C to amplify the sites diagnostic for haplogroups G and L3d.
Haplogroup identification strategy and
Studies involving high-resolution restriction analysis (Ballinger et al., 1992; Torroni et al., 1992, 1993a,b, 1994a–d, 1996, 1997; Chen et al., 1995, 2000), analyses of the complete sequence of
mitochondrial chromosomes (Kong et al., 2003; Reidla et al., 2003), or complete (Herrnstadt et al., 2002) or partial (Silva et al., 2002) sequences of their coding region showed that all haplogroups are virtually monomorphic for the 10,394 DdeI and 10,397 Alu I sites, with the exception
of haplogroup K. Thus, the a priori determination of the state of these sites quickly reduces the number of candidate haplogroups to which an unknown mtDNA may belong. Because these sites are close to each other, the 10394 DdeI/10397 Alu I motif (hereafter referred to as the
motif ) can be easily determined from a single amplicon.
Thus, each mtDNA sample was first
tested for its motif. Depending on the result, each sample was then tested for the markers diagnostic for all haplogroups known to share its motif. The haplogroups, their motifs, their defining markers, and the primers used are shown in Table 1. Haplogroups
that are defined by two or more markers invariably share at least one of them with some other haplogroup. Thus, tests on unshared haplogroup markers were performed only when the samples showed the shared ones. The two markers that define haplogroup L1b
were tested on all (/ ) motif samples, as each by itself defines another (/ ) motif haplogroup. L is a macroparagroup, a large group of mtDNAs including several haplogroups and other paraphyletic mtDNAs (Chen et al., 1995; Salas et al., 2002). Among others, it includes haplogroup
L2 (here further subdivided into L2a and L2* to pool subhaplogroups L2b, L2c, and L2d) and subhaplogroups L1b and L1c. All other L plogroups and paraphyletic mtDNAs were included in paragroup L0 (Mishmar et al., 2003).
The testing of markers for all haplogroups within each motif group served as a quality-control measure, as it allowed us to detect false positives. In the few instances in which the mtDNA tested positive for no haplogroup-defining
markers, its identity was determined by the sequence of its HVR-I and confirmed by restriction analysis. Thus, false negatives were also detected, and the likelihood of an y error involving false haplogroup positives, false haplogroup negatives, or motif group
misdiagnoses could be estimated experimentally. Such estimates were used to calculate the probabilities of any number of samples being misdiagnosed. Because all tests were performed independently, the likelihood that any two errors were committed in analyzing
the same sample could be calculated based on the multiplicative rule of probability.
Amplicons to be sequenced were purified using the High Pure PCR Product Purification Kit (Roche Applied Science), as instructed by the
manufacturer. Automated sequencing was performed at the New Jersey Medical School Molecular Resource Facility (University of Medicine and Dentistry of New Jersey), using an Applied Biosystems (ABI) model 3100 capillary sequencer after cycle sequencing with
Dye Terminator mix version 2.0.
Biological ancestry determination and data analysis
Biological ancestries were inferred from haplogroup identity. Because only nine women of Asian ancestry were
reported living in Puerto Rico in 1899 (Sanger et al., 1900), mtDNAs of haplogroups belonging to both the New World and Asia were assumed to be of Amerindian origin unless participant interviews revealed otherwise.
Data analysis was
performed using the program SPSS 10.0.5 for Windows. To determine whether variation in participation rates or changes in population size occurring since 1990 in the sampled municipalities would lead to biased estimates of the parameters, we devised a weighting
scheme. Through these weights, the number of samples provided by each municipality was adjusted so that it would be equal to the number expected by applying the original sampling proportions to the final sample size. The weights for municipality samples
(W m) were a function of the sampling proportion of the municipality (P m ), the final total obtained sample size (n), and the number of samples provided by the municipality
(n m ), so that W m Pm n/nm.
A triangular graphic of ancestry distribution among municipalities was constructed using MATLAB. A projected plane representing a linear function of form w f(X, Y, Z), in which plotted population dots were defined as the end of vectors with form w Xi Yj Zk, where X, Y, and Z represented Amerindian, African, and West Eurasian frequencies,
respectively, produced. The sum of X, Y, and Z was equal to one. Their magnitudes were a function of the 30¢X and 60¢X angles. Vectors i, j, and k were their respective unit vectors in the positive directions of the
coordinate axes x, y, and z.
To illustrate the geographic distribution of Amerindian mtDNA frequencies, municipalities were listed in order according to such frequencies and divided into 12 categories by creating a new
category every time that the difference between two municipalities was 1.6% or more. Divisions were drawn halfway between the frequencies of such municipalities.
Principal component (PC) analyses were performed using
the POPSTR program of Henry Harp Harpending (University of Utah). They were based on population haplogroup frequencies, and included only populations with 17 samples or more. Sub- Saharan African mtDNAs were classified as follows. Macroparagroup L was divided
into haplogroup L2 (further subdivided into L2a and L2*), subhaplogroups L1b and L1c, and paragroup L0 to pool all other haplogroups and paraphyletic mtDNAs within the macroparagroup. Paragroup L3A (Salas et al., 2002) was divided into L3b, L3d, L3e, L3f,
L3g, and L3*. We designated U5b2 as a sub-Saharan African clade with the HVR-I sequence 16189-16192-16270- 16320. Taken from one source were Shona (n 17),
Tongas (20), Shangaan (22), Chopi (27), Chwabo (20), Lomwe (20), Makonde (19), Makhwa (20), Ndau (19), Nyungwe (20), Nyanja (20), Ronga (21), Sena (21), and Tswa (19) from Mozambique (Salas et al., 2002), Brazil (65) (Alves-Silva
et al., 2000), Bubi (45), Sa˜o Tome’ (49) (Mateu et al., 1997), Mandenka (118) (Graven et al., 1995), Serer (23), a group of other Senegalese tribes (48), a pool of Mauritanian and West Saharan tribes (24) (Rando et al., 1998), Tuareg
(22), Yoruba (33), Hausa (20), Fulbe (60), Turkana (37), malia (27), Kikuyu (22) (Watson et al., 1997), Nubia (46) (Krings et al., 1999), Khwe (31) (Chen et al., 2000), and the southeastern islands of the Cape Verde Archipelago (169)
(Brehm et al., 2002). From two sources were Biaka (34) and Mbuti (35) Pygmies (Chen et al., 1995; Watson et al., 1997), Wolof (66) (Chen et al., 1995; Rando et al., 1998), and !Kung (62) (Watson et al., 1997; Chen et al., 2000). For West Eurasians,
mtDNAs were classified as belonging to H, V, HV, (pre-HV)1, J, T, I, W, X, M, N, R, K, U*, U2, U5*, U5(a b), U6, and U(others) to pool the remaining
clades (U1, U3, U4, and U7). Populations were obtained from Rando et al. (1998) (23 Moroccan non-Berbers and 58 Moroccan Berbers), Brakez et al. (2001) (37 Moroccan Souss Valley inhabitants), and Richards et al. (2000). This last
group of authors compiled data from several authors concerning 13 populations from North Africa and the Near East, as well as several populations from Europe. They classified the European populations into 10 ographic regions, and
we observe those same classifications here. Amerindian populations ere divided into 12 geographic regions and “Others.” These were three from eastern North America (Mohawk (123) (Merriwether and Ferrell, 1996) and Ojibwa
from Manitoulin Island (33) and northern Ontario (28) (Scozzari et al., 1997)), five from the Great Plains (Cheyenne/Arapaho (35), Sisseton/Wapheton Sioux (45), Turtle Mountain Chippewa (28) and Wisconsin Chippewa (62) (Malhi et al., 2001),
and Siouan (34) (Lorenz and Smith, 1996)), six from the North American Southeast (Choctaw (27) (Lorenz and Smith, 1996), Creek (39) and Seminole (40) (Weiss and Smith, 2003), Oklahoma Muskoke (70) (Merriwether and Ferrell, 1996), and Oklahoma Red Cross
Cherokee (19) and Stillwell Cherokee (37) (Malhi et al., 2001)), 15 from the North American Southwest (Akimal O’odham (43), Apache (38), Delta Yuman (23), Navajo (64), North Paiute/Shoshoni (94), Pai Yuman (27), River Yuman (22), Tauno O’odham
(37), Zuni (26) (Malhi et al., 2003), California Penutian (17), vasupai/Hualapai/Yavapai/Mojave (18), Jemez (36), Pima (37), Quechuan/Cocopa (23), and asho (28) (Lorenz and Smith, 1996)), four from Mesoamerica (Maya (26), Mixtec (29), ahua/Cora
(32) (Lorenz and Smith, 1996), and North Central Mexico (199) (Green et al., 2000)), eight from eastern Central America (Bribri-Cabecar (24) (Torroni et al., 1993a), Embera’ (Panama’ ) (44), Wounan (31) (Kolman and Bermingham, 1997), Guatuso (20),
Teribe (20) (Torroni et al., 1994d), Huetar (27) (Santos et al., 1994), Kuna (63) (Batista et al., 1995), and Ngo‥be’ (46) (Kolman et al., 1995)), 14 from western Colombia and Ecuador including the Andes (Cayapa (94) (Rickards et al., 1999), Chimila
(34), Guambiano (23), Guane-Butaregua (33), Ijka-Arhuaco (40), Kogui (30), Paez (31), Tule-Cuna (29), Waunana (30), Yuco-Yukpa (88) (Keyeux et al., 2002), Embera’ (Colombia) (41), Ingano (52), Wayuu (59), and Zenu (69) (Mesa et al., 2000; Keyeux et al.,
2002)), nine from Colombia east of the Andes (Coreguaje (19), Curripaco (17), Guahibo-Sikuani (23), Guayabero (24), Huitoto (22), Murui-Muinane (18), Nukak (20), Piaroa (18) (Keyeux et al., 2002), and Tucano (71) (Mesa et al., 2000; Keyeux et al., 2002)),
seven from the Amazon (Bele’n (Brazil) (81) (Batista dos Santos et al., 1999), razilian North (26) (Alves-Silva et al., 2000), Gaviao (27), Xavante (25), Zoro’ (30) (Ward et al., 1996), Ticuna (28) (Torroni et al., 1993a), and Yanomami (97)
(Merriwether and Ferrell, 1996)), nine from the Peruvian, Bolivian, and Chilean highlands around Lake Titicaca Atacamen˜ o (50) (Merriwether et al., 1995), Chimane (40), Ignaciano (21), Mosete’n (19), Movima (22), Trinitario (33), Yuracare’
(27) (Bert et al., 2001), Aymara (98), and Quechua (51) (Merriwether and Ferrell, 1996; Bert et al., 2001)), six from northern Argentina (Mataco from the provinces of Chaco (28), Formosa (44), and Salta (55), Pilaga (40), and Toba from the provinces
of Chaco (28) and Formosa (26) (Demarchi et al., 2001)), and five from southern South America (Huilliche (89) (Merriwether and Ferrell, 1996), Mapuche- Argentina (50) (Bailliet et al., 1994), Mapuche- Chile (156) (Merriwether et al., 1995; Moraga et al., 2000),
Pehuenche (204) (Merriwether and Ferrell, 1996; Moraga et al., 2000), and Yaghan (21) (Moraga et al., 2000)). Two “Other” populations were Bella Coola (36) (Lorenz and Smith, 1996) and Brazilian Southeast (33) (Alves-Silva et al., 2000).
Haplogroup diversity for the Amerindian mtDNAs was calculated using the method of Tajima (1989), h [1 x i 2 ]n/(n 1), where xi is the frequency of each haplogroup and n is the sample size.
All selected housing units were identified between August 6, 1999–March 19, 2000. Based on the 1990 Census of Population and Housing, a total of 872 housing units was selected. This translated into 1,067 because
of housing growth through the decade. Eighty-one housing units were uninhabited. From the 986 remaining housing units, 876 selected individuals were contacted. Exactly 800 of these agreed to participate, for a response rate of 81.1% based on the 986 selected
individuals. The sampling procedure results for each municipality and region are detailed in Table 2.
Haplogroup identification data quality
The haplogroup identification strategy described above
allowed the detection of misdiagnoses of both motif and haplogroup-defining marker identities, and thus an estimation of the probability that any misdiagnoses may have gone undetected. The largest margin of error lies within the (/ The haplogroup identification
strategy described above allowed the detection of misdiagnoses of both motif and haplogroup-defining marker identities, and thus an estimation of the probability that any misdiagnoses may have gone undetected. The largest margin of error lies within the (/ ) motif group. Initially, all (/ ) samples were tested,
among others, for the 3592 HpaI and 2349 Dpn II markers but not for markers 9070 TaqI and 16389 Hin fI, which are necessary to discriminate L1c and L2, respectively, from all other mtDNAs within L(Table 1). Thus, the samples belonging to L1b (3592 HpaI/2349 DpnII) and L3e (3592 Hpa I/ 2349 Dpn II) were quickly identified, while the samples with the 3592 HpaI/2349
Dpn II profile had to be subjected to a second round of tests for ) motif group. Initially, all (/ ) samples were tested,
among others, for the 3592 HpaI and 2349 Dpn II markers but not for markers 9070 TaqI and 16389 Hin fI, which are necessary to discriminate L1c and L2, respectively, from all other mtDNAs within L (Table 1). Thus, the samples belonging to L1b (3592 HpaI/2349 DpnII) and L3e (3592 Hpa I/ 2349 Dpn II) were quickly identified, while the samples with the 3592 HpaI/2349
Dpn II profile had to be subjected to a second round of tests for markers
9070 TaqI and 16389 Hin fI.
HVR-I sequencing of those samples with no haplogroup-defining markers showed that the 3592 Hpa I
motif of one of the 79 samples with the 3592 HpaI/ 2349 Dpn II profile initially went undetected. This gave us an experimental estimate of 1/79 for the frequency with
which the 3592 Hpa I motif went undetected. Using such a frequency and a base of 49 L1b samples, we
calculated a probability of 53.6% that none of the 38 samples identified as belonging to haplogroup L3e ( 3592 HpaI/2349
Dpn II) may actually belong to L1b (3592 HpaI/2349 Dpn II). Using bases of 50, 51, and 52 L1b samples,
we calculated probabilities of 33.9%, 10.9%, and 2.4% that one, two, or three samples identified as L3e actually belong to L1b.
There are two other scenarios by which misdiagnoses could occur. One is the combination of a
misdiagnosis of the sample motif group with a false positive for a haplogroup-defining marker. The other is the occurrence of both a false negative and a false positive for haplogroup-defining markers with the same sample. Based on the detection of 11 motif
misdiagnoses (three samples misdiagnosed as( /), seven as (/ ), and one as (/)), six false positives (one each for the markers corresponding to A, D, HV, L, L3b, and J/T), 11 false egatives (three each for the markers of A
and J/T, two for that of C,
and one each for those of HV, L and L3b), the number of samples belonging to each motif group (377 ( /), 233 (/ ), and 190 (/)), and the number of samples belonging to each
haplogroup (Table 3), we estimate that the probability that no misdiagnoses were made under either of these two scenarios is 86.0%, and that the probability that two or more misdiagnoses were made is insignificant.
Table 3 shows the distribution by municipality of all haplogroups found, their frequencies, and their biological origin. Only six of the 800 samples were confirmed, through HVR-I quencing, as having 10394 DdeI/10397 Alu I motifs different from those corresponding to
their haplogroups (Table 1). Specifically, we found two L3e, one L2*, one L2a, and one L1c sample to have ( /) instead of
(/ ) motifs. In addition, one haplogroup C sample had a (/) motif instead of the (/) expected.
On six occasions, samples were confirmed as having more than one haplogroup-defining
marker. The 8616 Dpn II marker that characterizes haplogroup
L3d was found in one L0 and one L1c sample. Furthermore, two L1b samples had the 4216 Nla III
marker that characterizes haplogroups J and T. These were all regarded as belonging to macroparagroup L because of the known stability of the TABLE 2. Sampling procedure results categorized by region Region Municipality Number of housing units
Uninhabited housing units Inhabited using units Agreed to participate Selected, not contacted Declined to participate Metro ecibo 33 3 30 26 (86.7%) 2 (6.7%) 2 (6.7%) Bayamo¡¦n 60 7 53 43 (81.1%) 6 (11.3%) 4 (7.5%) Caguas 40 3 37 30 (81.1%)
7 (18.9%) 0 Carolina 51 3 48 39 (81.3%) 4 (8.3%) 5 (10.4%) Guaynabo 23 2 21 16 (76.2%) 5 (23.8%) 0 Mayagu¡L ez 33 3 30 26 (86.7%) 0 4 (13.3%) Ponce 37 3 34 27 9.4%) 0 7 (20.6%) San Juan 118 5 113 78 (69.0%) 22 (19.5%) 13 (11.5%) Subtotal
395 29 366 285 7.9%) 46 (12.6%) 35 (9.6%) North Florida 39 6 33 29 (87.9%) 4 (12.1%) 0 Toa Baja 28 1 27 22 81.5%) 1 (3.7%) 4 (14.8%) Vega Alta 50 3 47 38 (80.9%) 4 (8.5%) 5 (10.6%) Vega Baja 41 6 35 25 1.4%) 7 (20.0%) 3 (8.6%) Subtotal
158 14 142 114 (80.3%) 16 (11.3%) 12 (8.5%) East Humacao 72 4 51 (75.0%) 11 (16.2%) 6 (8.8%) Loı’za 46 1 45 37 (82.2%) 4 (8.9%) 4 (8.9%) Patillas 26 2 24 21 7.5%) 1 (4.2%) 2 (8.3%) San Lorenzo 43 3 40 31 (77.5%) 7 (17.5%) 2 (5.0%) Subtotal
187 10 177 140 (79.1%) 23 (13.0%) 14 (7.9%) South Guayanilla 24 6 18 17 (94.4%) 0 1 (5.6%) Juana Dı’az 23 1 22 1 9 (86.4%) 2 (9.1%) 1 (4.5%) Pen˜ uelas 13 4 9 9 (100%) 0 0 Yauco 27 2 25 22 (88.0%) 0 3 (12.0%) Subtotal 87 13 74 67 (90.5%) 2 (2.7%)
5 (6.8%) West Aguadilla 26 2 24 23 (95.8%) 1 (4.2%) 0 Hormigueros 33 1 32 28 (87.5%) 2 (6.3%) 2 (6.3%) Moca 27 2 25 23 (92.0%) 0 2 (8.0%) San Sebastia’n 29 2 27 23 (85.2%) 1 (3.7%) 3 (11.1%) Subtotal 115 7 108 97 (89.8%) 4 (3.7%) 7 (6.5%) Central Barranquitas 38 2 36 30 (83.3%) 5 (13.9%) 1 (2.8%) Cayey 31 1 30 22 (73.3%) 7 (23.3%) 1 (3.3%) Corozal 29 1 28 23 (82.1%) 4 (14.3%) 1 (3.6%) Jayuya
27 2 25 22 (88.0%) 3 (12.0%) 0 Subtotal 125 6 119 97 (81.5%) 19 (16.0%) 3 (2.5%) Total Total 1,067 81 986 800 (81.1%) 110 1.2%) 76 (7.7%)
L-defining 3592 Hpa I marker. The true haplogroup identities of the remaining two samples were determined from their HVR-I sequences. One haplogroup H sample had the 9-bp deletion between
the tRNA Lys and COII genes that characterizes haplogroup B, and one haplogroup
C sample had the 7598 Hha I mutation that characterizes Asian
haplogroup E. Their respective HVR-I sequences were 16093-16362 and 16221-16223-16261-16298-16325- 16327. Thus, the first lacked the transitions at positions 16189 and 16217 that characterize haplogroup B (Ginther et al., 1993;
Horai et al., 1993), and the second possessed the ogroup C-specific transitions at positions 16298 and 16327 as well as the Amerindian-specific transition at 16325 (Torroni et al., 1993b).
that did not test positive for any haplogroup-defining marker were identified by sequencing their HVR-I as well as specific sites in their coding regions. Nine (/ ) mtDNAs were classified as L3* for having transitions at sites 10873 and 12705. The HVR-I sequences of two ( / ) mtDNAs not having transitions at 10873 or 12705 were 16288-16311 and 16126-16189-16362. The first (/ ) mtDNA had a transition at site 11719 but not at 16223, and was thus classified as belonging to R. The transitions at sites 16126 and 16362 showed that
the second (/ ) mtDNA belonged to JT or (pre-HV)1
(Macaulay et al., 1999). The absence of a transition at 11719 showed that it belonged to (pre-HV)1 (Richards et al., 2000). Finally, the HVR-I sequence of one (/) sample that did not exhibit any haplogroup-defining marker was 16086-16183- 16189-16223-16278-16298-16325-16327.
Thus, it contained the 16223, 16298, 16325, and 16327 transitions specific for Native American haplogroup C, and transitions 6183, 16189, 16223, and 16278, which are found in most haplogroup X mtDNAs (Brown et al., 1998).
However, it possessed the 10397 Alu I motif specific of macrohaplogroup
M, to which haplogroup C but not haplogroup X belongs. This motif was shown to be very stable Kivisild et al., 2002; Kong et al., 2003), and we thus regarded this mtDNA as belonging to aplogroup C. One haplogroup C mtDNA lacking the
13262 Alu I marker was previously described from the Amazonian Makiritare (Torroni et al., 1993a),
and (/) mtDNAs lacking defining markers for haplogroups C and D seem to be common in Colombia (Keyeux et al., 2002; Rodas et al., 2003). No mtDNAs belonging to haplogroups E, F, G, I, M, N, JT, W, or X were found in our set of 800 samples.
Haplogroup U subdivisions
Among all haplogroups found here, U is the only one that was reported in significant numbers in more than one continental region (Torroni et al., 1996). It was thus necessary to study
such mtDNAs in more detail to identify their biological origin. The HVR-I sequence of the 27 samples belonging to haplogroup U segregates them into 10 types (Table 4). Although haplogroup U ismostly regarded as aWest Eurasian haplogroup, it is apparent that
nine of these samples originate from sub-Saharan Africa. All share the same sequence type, which has not been found in Europe or the Near East despite the thousands of samples from these areas for which HVR-I was sequenced (Alves-Silva et al., 2000; Richards
et al., 2000; Finnila‥ et al., 2001; Malyarchuk et al., 2002). However, it was found in one out of 60 Fulbe sequences (Watson et al., 1997), and in one of 38 and 23 Wolof and Serer sequences, espectively (Rando et al., 1998). We classify it as a member of
clade U5b* because of its 16189, 16192, and 16270 motif (Richards et al., 2000). Its distinction is the addition of a transition at position 16320. We designate it as clade U5b2 to represent a sub-Saharan African clade with a transition at 16320 as its signature.
Eleven samples seem to originate from North Africa and the Canary Islands. Two samples sharing the same sequence exhibit the 16163 motif, which is diagnostic for the Native Canarian-specific clade U6b (Rando et al., 1999). Nine samples segregate
into three North African sequence types. The most common type (16224-16270), comprising seven samples, may correspond to the 16093-16224-16270 type of apparently North African ancestry that was found in two Canarian Islands (Pinto et al., 1996; Rando et al.,
1999), because our sequencing reactions did not extend to the left of the 16154 site in these samples. No other mtDNAs have been found with the 16093- 16224-16270 or the 16224-16270 sequence types elsewhere. Of the two remaining North African sequence types,
one (16224-16261-16270) may have derived directly from the most common type, as it differs from it at only one site. The remaining one has been found mainly in North Africa, but also in the Near East and sub-Saharan Africa. Its highest frequency was reported
in the Berber-speaking ozabites of northern Algeria: 10 out of 85 samples (Coˆrte-Real et al., 1996). Other populations with lower frequencies are Moroccan Berbers and non- Berbers (Pinto et al., 1996; Rando et al., 1998), Egyptians (Krings et al., 1999),
Syrians (Richards et al., 2000), and some East and West African tribes (Watson et al., 1997). It was also found in two of 54 samples from Portugal (Coˆrte-Real et al., 1996), but we believe its presence in the Iberian Peninsula is due to migrations related
to the slave trade.
Two samples share the motif 16189-16362. They likely belong to the U2 clade, which is characterized by the 16051 motif (Kivisild et al., 1999; Macaulay et al., 1999), a site to which our sequencing reaction did not extend.
However, most West Eurasian U2 mtDNAs, but not other haplogroup U clades, present substitutions at positions 16129 and 16362. Since these samples do not present motifs that would classify them under any other clade, but possess the 16362 transition, they likely
belong to clade U2. Clade U2 is virtually absent in North Africa and is found in the Near East at somewhat higher frequencies than in Europe. However, the precise sequence type is found at a higher frequency in the Iberian Peninsula than in any Near Eastern
population except the Kurdish (Richards et al., 2000). PC analysis does not assign the Puerto Rican West Eurasian population a decisively higher affinity to the Kurdish or the European Mediterranean Western Region population (see below). Thus, we can only
conclude that these samples should originate either in the Iberian Peninsula or the Near East.
The remaining four sequence types, encompassing only five samples, are likely of European origin. One differs from the CRS only by a transition
at position 16192. Although the exact sequence type has not been reported elsewhere, it is regarded as European in origin because of the instability of the 16192 site in haplogroup U mtDNAs and the fairly high frequency of the otherwise resulting sequence
type (CRS) in the Iberian Peninsula. The remaining three European sequence types are either particular U5b* types common only throughout Europe or belong to subclade U5a1a, which evolved in Europe (Richards et al., 2000).
The HVR-I sequences of the nine (/ ) samples for which no haplogroup-specific markers were found
are shown in Table 5. They segregate into eight sequence types sharing the 16223-16311 motif. Three of the sequence types possess the 16209 transition diagnostic of the L3f clade and the 16292 transition of subclade L3f1 (Salas et al.,
2002). Only these three sequences showed a 1-bp deletion in the 5-bp T-stretch that runs from 15940–15944 in the CRS. This deletion is not the result of errors in the CRS (Andrews et al., 1999); it may play a significant role
in RNA translation efficiency, as it makes the T*C arm loop of the tRNA hr only
two nucleotides long, and may become a useful phylogenetic marker for group L3* clades or to further subdivide subclade L3f1.
Another sequence type contains the 16293T-16355-16362 motif of clade L3g. The four remaining
sequence types encompass five samples and cannot be grouped into any L3* clade. These thus remain classified as L3*. However, four of these five samples seem to be not too distantly related, as they all share the 16256A transversion. Two of them also share
transitions at positions 16129 and 16362.
Geographic distribution of mtDNAs by biological ancestry
Little change is observed when biological ancestry
frequencies are corrected by sample weight. Frequencies and 95% confidence intervals of 61.0 3.4% Amerindian, 27.5 3.1% African, 11.4 2.2% West Eurasian, and 0.1 0.2% Asian (Table 3) are corrected to 61.3 3.4% Amerindian,
27.2 3.1% African, 11.5 2.2%
West Eurasian, and 0.0% Asian (Table 6).
Amerindian mtDNAs are the most common in all municipalities except Loı’za, where African mtDNAs are more frequent, and Cayey, where the population is equally divided into African
and Amerindian mtDNAs. Amerindian mtDNA frequencies are 50% or higher in all unicipalities except Loı’za, San Juan, and Carolina (Table 3).
In addition, West Eurasian frequencies are low in all municipalities (0–17.9%).
Thus, in a triangular graph with axes representing biological ancestries, ancestry frequencies cluster close to the vertex where the Amerindian frequency equals one, and scatter next to the side defined by zero West Eurasian frequency, toward the vertex where
African frequency equals one (Fig. 2). A negative Pearson correlation ( 9.19) between African and Amerindian frequencies is observed that is
significant at the 0.01 level (two-tailed test). That is, the biological ancestry frequency of municipalities can be virtually described by stating only their African or Amerindian frequencies.
Figure 3 divides the
28 sampled municipalities into 12 categories according to their Amerindian mtDNA frequencies, and divides Puerto Rico by longitude 66°16 West, as 12
of the 13 sugar mills that worked throughout the 16th century were built east of it. It can be observed that the three municipalities with the lowest Amerindian frequencies are next to each other in San Juan and further east. Further,
all 11 municipalities east of longitude 66¢X16 West are among the 14 municipalities with the lowest Amerindian frequencies. There
is a highly significant deviation from the null hypothesis that frequencies for all ancestries are the same east and west of longitude 66¢X16 West (Pearson 2 43.70, df 2, P 0.001). 2 tests also
show highly significant deviations from null hypotheses of equal frequencies on each side of longitude 66°16 West for Amerindian
(Pearson2 41.72, df 1, P 0.001) and African (Pearson 2 34.40, df
1, P 0.001) mtDNAs. African
mtDNAs are more frequent in the east than in the west; the reverse is true for Amerindian mtDNAs. No significant difference is found for West Eurasian mtDNAs.
Interestingly, the geographic distribution by biological
ancestry does not fit expectations based on traditional history that place Amerindians fleeing to the mountains and African slaves working in sugar plantations on the coasts. The three municipalities with the highest Amerindian frequencies are coastal (Fig.
3), and 2 tests show that Amerindian frequencies in noncoastal municipalities
are not significantly higher than those in coastal ones, and that African frequencies are not significantly higher in coastal than noncoastal municipalities.