SFPscan: SFP detection and genotyping in a RIL population using microarray data

The SFPscan script can be used for Single Feature Polymorphism (SFP) detection and genotyping of a RIL population. This script was used successfully to construct a high-resolution haplotype for the Arabidopsis Bay-0 × Shahdara RIL population (West et al. 2006, Genome Research 16: 787-795, PubMed link PubMed - National Library of Medicine; Arabidopsis Bay-0 × Sha Haplotype Map).

The Perl script (Version 10) can be downloaded here: SFPscanV10.pl.

The R-script required for the background correction of the .CEL files can be downloaded here: BioC_getBGPM.r.

An input file with ATH1 GeneChip probeset IDs and AGI gene IDs that is required for the Perl script can be downloaded here: ATH1_conv-probeset-genecode.txt. This file should only be used if you work with the Affymetrix ATH1 GeneChip.

Please read the instructions below before using the scripts. Make sure that all your input files and background corrected .CEL files are in the same directory as the SFPscan script. From a Unix/Linux shell or a Windows Command Prompt type the name of the script file followed by the command line arguments (explained below). On the Unix/Linux system you might have to type 'perl' before the name of the script file.

To use the Perl script you will need to have a Perl interpreter installed. On Unix or Linux systems the Perl interpreter is usually available without further installations. On Windows you will need to install a Perl interpreter, for example from http://www.activestate.com/Products/ActivePerl/. Make sure you, or your system administrator, have installed the required Perl packages for the SFPscan script (see the use statements at the top of the script).

To use the R script you will need to install the R-programming environment. For more information see http://www.r-project.org/. Once you have that installed, you will need to install the Bioconductor packages. For more information see http://www.bioconductor.org/.



New:
SFPscan (Version 10) is now also available with a graphical user interface.
The Perl script can be downloaded here: SFPscanGUI_V2.pl.



################################################################################################
################################################################################################
# SFPscanV10.pl
#
# Author: Hans van Leeuwen
#         Dept. of Plant Sciences
#         University of California, Davis
#         Mail Stop 3
#         One Shields Ave
#         Davis, CA  95616
#         E-mail: hvanleeuwen@ucdavis.edu, hans_van_leeuwen@hotmail.com
#
# Date:   September 2005
#
################################################################################################
################################################################################################





################################################################################################
################################################################################################
################################# Program usage ################################################
#
# >SFPscanV10.pl [CONVERSION_FILE] [SETUP_FILE] [CHIP_LIST_FILE] [PARENT_A_CHIP_LIST_FILE] 
#  [PARENT_B_CHIP_LIST_FILE] [GAP_SIZE] [ALLELE_DISTORTION] [UNASSIGNED_SCORES] [OUTPUT] [USE_CD]
#
#  (all input files should be in the same directory as the SFPscanV10.pl file)
#  (all arguments must be entered! even if you use default values; see below)
#
#
# [CONVERSION_FILE] 
#    The conversion file with the Affymetrix probeset IDs and the AGI gene codes, having the format:
#    Probe Set ID	Sequence Derived From
#    245016_at	accD
#    261585_at	At1g01010
#    261568_at	At1g01030
#    261584_at	At1g01040
#    ...etc
#
#    This file is provided with this script: "ATH1_conv-probeset-genecode.txt"
#    Note: Positions in the Col-0 genome  of the probeset sequences from the Affymetrix ATH1 GeneChip
#          were obtained from the Arabidopsis annotation Version 4, TIGR release May 2003.
#
#
#
# [SETUP_FILE] 
#    A file with the experimental setup, first column is RIL code, next columns, rep chip codes.
#    First line should be a header line starting with ; (no space between ';' and 'RILcode')
#    For example: 
#    ;RILcode	SW1	SW2
#    4	R5	R620
#    5	R85	R687
#    8	R221	R495b
#    13	R30	R666
#
#
#
#
# [CHIP_LIST_FILE] 
#    A file with a list of chip files that will be used in this study.
#    this file should contain both the RIL chip file names and the parental chip file names.
#    File should have following format:
#    R5.CEL
#    R85.CEL
#    R221.CEL
#    R30.CEL
#    R82.CEL
#    R270.CEL
#    R76.CEL
# 
#    etc.
#
#
#
# [PARENT_A_CHIP_LIST_FILE] 
# [PARENT_B_CHIP_LIST_FILE] 
#    Two files with the chip file names of the parental chips.
#    File should have following format:
#    R5.CEL
#    R85.CEL
#    R221.CEL
#    R30.CEL
# 
#    etc.
#
#
#
#
# [GAP_SIZE]
#    The minimum distance between the two peaks in the bimodal distribution.
#    This value is calculated as follows: min_SFPdev SFP parent / max_SFPdev nonSFP parent.
#    The default value is 2.0
#    A value greater than 1.0 means that the distributions are separated.
#    This value should be chosen carefully; a too low value may increase the number of markers,
#    but can also increase the number of unassigned alleles; a too high value may give higher
#    higher quality markers, but the number of markers obtained decreases.
#    
#
#
# [ALLELE_DISTORTION]
#    In a RIL population a allele segregation of 50%-50% is expected, 
#    but this can vary in certain populations
#    The default value is 80
#    This means that allele segregations from 80%-20% to 20%-80% are allowed.
#
#
#
# [UNASSIGNED_SCORES]
#    This sets a limit to the number of unassigned allele scores a marker may have.
#    The default value is 10
#    This means that a marker is discarded if more than 10% of the allele calls are unassigned.
#
#
#
# [OUTPUT]     
#    This is a short text (please use numbers, letters and underscores only)
#    that will be used in the output file names.         
#
#
#
# [USE_CD]
#    You have the option to use 'C' and 'D' scores in genotyping (C = NOT A, D = NOT B).
#    If you select this option the following genotyping rules will be used:
#          If half or more than half of the rep chips is genotyped as one parental accession,
#          and the other rep chips as unassigned: use C or D.
#       Type 'YES' to select this option.
#    If you do not select this option the above mentioned chips will be scored as '-'.
#       Type 'NO' if you do not want the 'C' and 'D' scoring to be used.
#    
#
#
#
################################################################################################
################################################################################################






################################################################################################
################################################################################################
################################### Genotyping rules ###########################################
#
#
#            The following rules are used when determining a genotype for a RIL,
#            based on the genotypes of the individual reps:
#            A = parent A, B = parent B, C = NOT A, D = NOT B, - = unassigned
#            If all rep chips are genotyped as one accession: use A or B
#            If half of the rep chips is genotyped for one accession,
#                 and the other half of the rep chips for the other accession: use -
#            If half or more than half of the rep chips is genotyped as one accession,
#                 and the other rep chips as unassigned: use C or D
#            Else, use: - 
#
#
################################################################################################
################################################################################################






################################################################################################
################################################################################################
################################# Preparation .CEL files #######################################
# 
# The .CEL files need to be background corrected seperately under the R programming environment,
# with the Bioconductor package installed. Use the following R-script (copy and paste
# this code (all lines between START SCRIPT and END SCRIPT in a text file
# and then load it into the R-programming environment).
# Make sure to remove the first '#' (only the first!!) in all code lines
# between the START and END lines. 
#
#
################################# START R-SCRIPT ###########################
# ##########################################################################
# #
# # Author: Hans van Leeuwen
# #         Department of Plant Sciences
# #         University of California, Davis
# #         Mail Stop 3
# #         One Shields Ave
# #         Davis, CA  95616
# #         
# #         Email: hvanleeuwen@ucdavis.edu, hans_van_leeuwen@hotmail.com
# #
# # Date:   JUNE 2005
# #
# ##########################################################################
# #
# ##########################################################################
# ######################### IMPORTANT ######################################
# ##########################################################################
# ########### THE LINES MARKED WITH 'CHANGE' MUST  #########################
# ########### BE MODIFIED USING YOUR PATH          #########################
# ##########################################################################
# ######################### IMPORTANT ######################################
# ##########################################################################
#
# ##########################################################################
# # Set work directory to the directory where your original .CEL files are located
# setwd("C:/rw1081/hans_temp") ######### CHANGE ########
# #
# ##########################################################################
# # Read in file with chip ID's for which the CEL files will be read later
# # This file should have format:
# # R5.CEL
# # R85.CEL
# # R221.CEL
# # R277.CEL
# # R30.CEL
# # R82.CEL
# # R270.CEL
# #
# # etc...
# #
# ##### This file should have the filename 'chips.txt' #####################
# chips <- read.table("chips.txt")
# #
# ##########################################################################
# # Load affy package
# library(affy)
# 
# ##########################################################################
# # Open cel files one by one
# for (i in 1:length(chips$V1)){
#   # Set work directory to the directory where your .CEL files are located
#   setwd("C:/rw1081/hans_temp") ######### CHANGE ########
#   data <- ReadAffy(filenames = as.character(chips$V1[i]))
#   data.bg <- bg.correct.rma(data)
#   data.bg.matrix <- pm(data.bg)
#   # Set work directory to the directory where the background corrected files
#   # will be saved. This can not be the same work directory that contains the original .CEL files!!
#   setwd("C:/rw1081/hans_temp/bg_files") ######### CHANGE ########
#   write.table(data.bg.matrix, file = as.character(chips$V1[i]), append = FALSE, quote = TRUE, sep = ",", eol = "\n", na = "NA", dec = ".", row.names = TRUE, col.names = TRUE, qmethod = c("escape", "double"))
# } # end for
################################# END R-SCRIPT #############################
#
#
# The obtained files should have a format similar to:
#"471.CEL"
#"244901_at1",363.582352941176
#"244901_at2",124.111764705882
# ...etc
#
#
################################################################################################
################################################################################################





################################################################################################
################################################################################################
################################# Output files  ################################################
#
# The following output files will be created:
#    (the asterisk indicates the text string from the OUTPUT argument provided by the user)
#
# 1. *_SFPscanV10_log.txt
#
#            Logfile, contains possible error messages.
#
#
# 
# 2. *_SFPscanV10_marker_data.txt
#
#            column 1: SFP marker ID
#            column 2: Parental line with the SFP (A/B)
#            column 3: max_SFPdev nonSFP parent
#                      This is the right boundary of the distribution of the
#                      chips that have the genotype of the parent that does 
#                      not contain the SFP.
#            column 4: min_SFPdev SFP parent
#                      This is the left boundary of the distribution of the
#                      chips that have the genotype of the parent that
#                      contains the SFP.
#            column 5: spread
#                      This is the result of: min_SFPdev SFP parent/max_SFPdev nonSFP parent
#                      The user sets the minimum allowed spread.
#                      A spread greater than 1 means that the distributions are separated.
#            column 6: A frequency
#                      This is the frequency of the 'A' allele
#            column 7: B frequency
#                      This is the frequency of the '-' allele
#            column 8: missing data frequency
#                      This is the frequency of unassigned allele calls.
#
#            NOTE: The data in columns 6, 7, 8 is based on data from all chips,
#                  before a genotype is assigned to a RIL based on the individual
#                  genotypes of the reps for that RIL. In other words, these frequencies
#                  can be slightly different in the final marker data, because then
#                  the data for the RILs (and not the individual reps) is used.
#
# 
#
#
# 3. *_SFPscanV10_genotypes_all_chips.txt
#            
#            Contains genotypes for all chips for all markers in following format:
#
#            SFPmarker      RIL        chip     rep	genotype
#            At5g52440-9    Bay-0_2    R473b    1     B
#            At5g52440-9    Bay-0_2    R628     2     B
#            At1g64900-6    Bay-0_2    R473b    1     B
#            At1g64900-6    Bay-0_2    R628     2     B
#            At3g25610-10   30         R493     2     B
#            At3g25610-10   30         R260     1     B
#            At1g54030-7    30         R260     1     B
#            At1g54030-7    30         R493     2     B
# 
#
#
#
# 4. *_SFPscanV10_genotypes_jm.txt
#
#            Contains genotypes for all RILs for all markers, in JoinMap format:
#            (RILs are sorted in numerical order; the actual RIL codes 
#             can be seen in the Excel format file (see below))
#
#            marker14
#            DABAAADAADBADBAADBAABBDAAADDBAADAAABAAABABDAAABBAA
#            marker30
#            AABAABBBAABBAAAABABBAAABBAAAABAAABAAAAAAAAAABBAAA-A
#            marker35
#            BAADAAABADAABAAABABBAAAABABAADAABBDCADADAABAABA-DA
#            etc.
#
#
#
# 
# 5. *_SFPscanV10_genotypes_ex.txt
#
#           Contains genotypes for all RILs for all markers, in Excel format (tabulated):
#           (RILs are sorted in numerical order on first row)
#
#                      4	5	8	13	14	16
#           marker14   D	A	B	A	A	A
#           marker30   A        A       B       A       A       B
#           et. 
#
#
#
# 
# 6. *_SFPscanV10_genotypes_clean_jm.txt
#
#           Contains genotypes for all RILs for markers that passed the post-genotyping
#           filtering process. Data is in JoinMap format.
#
#
#
#
# 
# 7. *_SFPscanV10_genotypes_clean_ex.txt
#
#           Contains genotypes for all RILs for markers that passed the post-genotyping
#           filtering process. Data is in Excel format (tabulated).
#
#
#
################################################################################################
################################################################################################




Last modified: May 26, 2006.

email to: hvanleeuwen@ucdavis.edu Hans van Leeuwen.