Arabidopsis Bay-0 × Sha GEM Haplotype Map



Supplemental Data to the paper:
Marilyn A.L. West, Hans van Leeuwen, Alexander Kozik, Daniel J. Kliebenstein, R. W. Doerge, Dina A. St.Clair, Richard W. Michelmore 2006.
High-density haplotyping with microarray-based expression and single feature polymorphism markers in Arabidopsis. Genome Research 16: 787-795, PubMed link PubMed - National Library of Medicine.



Summary

We detected 1431 genes that exhibited a two-fold or greater expression difference between the Arabidopsis thaliana accessions Bay-0 and Shahdara, using the data of the microarray analyses as we described in our paper [submitted]. Of these 1431 genes, 324 showed a non-overlapping distribution using expression values from 16 GeneChips for each genotype. The 324 potential gene expression markers (GEMs) were evaluated using replicated gene expression data from 148 Bay-0 × Sha F9 RIL progeny in order to determine if the expression levels of this particular set of genes can be used to predict the genotype. After applying several filtering criteria we obtained 188 GEMs corresponding to 187 Arabidopsis genes (At5g09810 is represented by two Affymetrix probesets). These 188 GEMs were integrated with the 38 microsatellite reference markers from the Bay-0 × Shahdara RIL population (Loudet et al. 2002).





Methods and Results


From the 324 genes with non-overlapping parental distributions, the maximum expression value for the lower-expressing genotype (Max), and the minimum expression value for the higher-expressing genotype (Min) were used for genotyping of the 148 RILs (see Figure below). We termed this method the 'parental min-max method' as described in our paper. We also detected GEMs using another method, the 'RIL distribution method', as described in our paper and presented here.

Part A: Slicing scheme to adjust Min and Max values and allele assignment for the GEMs.

If RIL gene expression values fall between the parental distributions, this creates ambiguous genotypes in the RILs, resulting in missing data scores. To minimize the number missing data scores, we adjusted the distributions used for allele assignments by utilizing a “slicing” scheme. If the Bay-0 and Sha gene expression value distributions were far apart, with a gap greater than 1.1, one segment was added sequentially to the appropriate end of each parental distribution in order to bring the parental distributions closer together. Iterations of this protocol were repeated until the adjusted gap was equal to 1.1. For gene expression distributions where Sha is the higher expressing parent, slices were iteratively added to the lower end of the Sha distribution to decrease the Sha Min value, resulting in an adjusted Sha Min value. Likewise, slices were added to the Bay-0 distribution to increase the Bay-0 Max value, resulting in an adjusted Bay-0 Max value.

If the Bay-0 and Sha parental distributions were close together, with a gap less than 1.1, slices were subtracted iteratively from each parental distribution, until the gap was equal to 1.1.




For expression markers where the Bay-0 allele gave higher expression values than Sha allele in the parental control microarrays, RILs were assigned a Sha genotype if the gene expression value was less than Sha Max. Likewise, the RILs were scored with a Bay-0 genotype if the gene expression value was greater than Bay-0 Min. RILs exhibiting a gene expression value between Sha Max and Bay-0 Min were scored as "missing data" (dash "-" in data file). When the Sha allele gave higher expression values than the Bay-0 allele, genotypes were assigned accordingly using the same approach described previously. Bay-0 or Sha genotypes were assigned only in the case when both replicate microarrays had identical genotypes. If one replicate had missing data then the RIL also had missing data (-) for the corresponding datapoint.

The custom Python script Affy_ELP_Translator_V017_Numeric.py was used to perform the slicing scheme and the allele assignments. Five input files were used for this script:


1. probe intensity range [ ELP_INPUT1_AFFY_RANGE_sw_only_stats_mod.txt ]:
7-th and 8-th columns in this file have Min and Max values for Bay parental microarrays;
9-th and 10-th columns have Min and Max values for Sha parental microarrays.
2. affy ID - ATH ID conversion [ ELP_INPUT2_AFFY_ID_sw_affy_ath_conversion.txt ]:
first column - Affymetrix probe set ID; second column - Arabidopsis gene ID.
3. expression values [ ELP_INPUT3_AFFY_VALUES_00__sw_only_all_643.tab ]:
scaled gene expression values for RILs and four (2 Bay + 2 Sha) parental microarrays. Affymetrix probe sets are in rows; chip IDs are in columns.
4. genotyping data for MS molecular markers [ ELP_INPUT4_MARKERS_MS_Molec_Markers.tab ]:
genotyping data for microsatellite molecular markers - RIL IDs are in rows; marker IDs are in columns
5. RIL keys (conversion RIL ID - Chip ID) [ ELP_INPUT5_CHIP_ID_sw_chip_keys148_RILs_BaySha.txt ]:
RIL keys: first column - numerical order; second column - RIL ID; third column - chip ID of biologial replicate 1; chip ID of biological replicate 2.

This script created the ELP_OUTPUT_643SW.exp_master.tab output file with genotyping scores which were assigned according to gene expression data. This file was modified into the Master locus file ath_elp_july_2005.ril.loc by removing the last 4 columns with genotyping data corresponding to parental accessions (Bay-0 and Sha). At the same time the file containing data for RILs as well as for four parental (Bay-0 and Sha) controls ath_elp_july_2005.all.loc was created for use in graphical genotyping (see below). The other files created by this script are for debugging purpose.


Part B: Filtering of GEMs.

The Master locus file was processed by the Python MadMapper program http://cgpdb.ucdavis.edu/XLinkage/MadMapper/ (version V248) to filter the dataset. Markers were removed if they had >10% missing data, or if they displayed pronounced allele distortion (>1:3; the expectation of allele segregation in a RIL population is 1:1). This Python program (Python_MadMapper_V248_RECBIT_007.py) generated a "clean" locus file ath_elp_july_2005.ril_good.loc with the 188 GEMs which was used for further mapping studies.

(Step by step procedures of data and file handling can be found here: 00_README_STEP_BY_STEP_ELP_PROCESSING.txt)


Part C. Calculation of genetic distances.

The integrated set of 226 GEMs and microsatellite markers was analysed with JoinMap 3.0 (Van Ooijen and Voorrips, 2001) to group the markers into linkage groups and calculate genetic distances between the markers in each linkage group. The default JoinMap options were used. The result can be seen in Table 1.


Table 1. Characteristics of an integrated genetic map derived from the Bay-0 × Shahdara RIL population (148 RILs).
Markers1 Size (cM) Marker density (cM) Maximum gap (cM)
Linkage Group 1 52 (43+9) 89.2 1.72 8.57
Linkage Group 2 27 (20+7) 65.2 2.42 8.63
Linkage Group 3 49 (43+6) 72.5 1.48 7.32
Linkage Group 4 44 (36+8) 72.8 1.65 14.67
Linkage Group 5 54 (46+8) 93.8 1.74 14.83
All linkage groups 226 (188+38) 393.5 1.74 14.83
1 Between parentheses the number of our SFP markers, and microsatellite markers (Loudet et al. 2002), respectively.



Part D. Data Visualization

The genotype scores of the 148 RILs for the five linkage groups were used to calculate pairwise distances between markers using the MadMapper software (http://cgpdb.ucdavis.edu/XLinkage/MadMapper/, Python_MadMapper_V248_RECBIT_007.py). The CheckMatrix software (http://cgpdb.ucdavis.edu/XLinkage/MadMapper/, py_matrix_2D_V248_RECBIT.py) was then used to create a graphical genotyping map and a heat map of linkage values.

For the visualization of all five linkage groups together, physical positions of the genes corresponding to the SFPs were obtained from the Arabidopsis annotation Version 4, TIGR release May 2003 (ftp://ftp.tigr.org/pub/data/a_thaliana/ath1/PREVIOUS_RELEASE_VERSIONS/release4.tar.gz). Physical positions of the 38 reference microsatellite markers previously mapped in this RIL population by Loudet et al. (2002) were obtained by BLASTing the PCR primer sequences of each microsatellite marker against the Arabidopsis annotation Version 4, TIGR release May 2003. For the visualization of the individual linkage groups, genetic distances were obtained as described above in Part C.


Order of the markers according to genetic maps constructed with JoinMap:
CheckMatrix heat map of linkage values Graphical genotype map of RILs1 Circular map of linkage between markers2
Linkage
Group 1
Linkage
Group 2
Linkage
Group 3
Linkage
Group 4
Linkage
Group 5
1The last four columns are two Sha and two Bay-0 controls, respectively.
2For the circular maps a linkage cutoff value of 0.9 was used.



Genetic map constructed with JoinMap. All five linkage groups are concatenated together:
CheckMatrix heat map of linkage values Graphical genotype map of RILs1 Circular map of linkage between markers2
All five
linkage
groups
1The last four columns are two Sha and two Bay-0 controls, respectively.
2For the circular maps a linkage cutoff value of 0.9 was used.



Order of the markers according to the physical order of the genes in the Columbia genome:
CheckMatrix heat map of linkage values Graphical genotype map of RILs1 Circular map of linkage between markers2
All five
linkage
groups
1The last four columns are two Sha and two Bay-0 controls, respectively.
2For the circular maps a linkage cutoff value of 0.9 was used.



Part E. Data files

Table 2. Data files of haplotypes and genetic distances of the map of the Bay-0 × Shahdara RIL population (148 RILs).
Haplotypes1 Genetic Distances (cM)2
Linkage Group 1 ath_gem_july_2005_LG_1_JM.loc ath_gem_july_2005_LG_1_JM.map
Linkage Group 2 ath_gem_july_2005_LG_2_JM.loc ath_gem_july_2005_LG_2_JM.map
Linkage Group 3 ath_gem_july_2005_LG_3_JM.loc ath_gem_july_2005_LG_3_JM.map
Linkage Group 4 ath_gem_july_2005_LG_4_JM.loc ath_gem_july_2005_LG_4_JM.map
Linkage Group 5 ath_gem_july_2005_LG_5_JM.loc ath_gem_july_2005_LG_5_JM.map
All linkage groups (genetic)3 ath_gem_july_2005_LG_ALL_JM.loc ath_gem_july_2005_LG_ALL_JM.map4
All linkage groups (physical)5 ath_elp_july_2005_Phys.loc ath_elp_july_2005_Phys.map6
1 First column contains marker ID's. First row contains column numbers, second row contains Loudet's RIL numbers, third row contains microarray ID replicate 1, fourth row contains microarray ID replicate 2. "A" is the Shahdara genotype, "B" is the Bay-0 genotype, and "-" is missing data.
2 Genetic distances calculated with JoinMap 3.0, see details in Part C. First column contains marker ID's; second column contains genetic distances (cM).
3 All five linkage groups are concatenated together.
4 First column contains marker ID's; second column contains accumulated genetic distances over all linkage groups; third column contains genetic distances per linkage group.
5 Based on physical distances, see details in Part D.
6 First column contains linkage group number. Second column contains marker ID's. Third column contains accumulated physical distances (Mb) over all linkage groups. Fourth column contains physical distances (Mb) per linkage group. Fifth column contains the Arabidopsis BAC ID for the gene corresponding to the marker. Sixth column indicates orientation of gene in genome. Seventh column contains physical distances (bp) per linkage group.




Part F. Diagonal Dot Plot: Physical positions of GEMs vs Genetic map positions


This dot plot illustrates, in a similar way as the heat maps in the above figures, that the GEMs map to the expected physical positions based on the sequenced Col-0 genome. The few exceptions are described in our paper.






References

Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D.J. (1997) "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs." Nucleic Acids Res. 25:3389-3402.

Loudet, Chaillou, Camilleri, Bouchez and Daniel-Vedele, 2002. Bay-0 x Shahdara recombinant inbred lines population: a powerful tool for the genetic dissection of complex traits in Arabidopsis. Theoretical and Applied Genetics, vol 104 (6-7), pp 1173-1184. http://www.inra.fr/qtlat/BayxSha/index.htm.

Van Ooijen, J.W. & R.E. Voorrips, 2001. JoinMap® 3.0, Software for the calculation of genetic linkage maps. Plant Research International, Wageningen, the Netherlands. http://www.kyazma.nl/index.php/mc.JoinMap.

Marilyn A.L. West, Hans van Leeuwen, Alexander Kozik, Daniel J. Kliebenstein, R.W. Doerge, Dina A. St.Clair, Richard W. Michelmore 2006. High-density haplotyping with microarray-based expression and single feature polymorphism markers in Arabidopsis. Genome Research 16: 787-795, PubMed link PubMed - National Library of Medicine.



Last modified on May 30, 2006.

email to: mlwest@ucdavis.edu Marilyn West (Affymetrix GeneChip experiment)

email to: akozik@atgc.org Alexander Kozik (Data processing, Mapping and Visualization)

email to: hvanleeuwen@ucdavis.edu Hans van Leeuwen (Web design, Genetic map construction)