SELCON3.FOR --- SELCON, Self-Consistent method for CD analysis, Version 3
Author
Algorithm
Options
CRDATA Program
Summary
Current Implementation
Input File
FTP Instructions
References
Secondary Structures
Running the Program
Previous Versions of SELCON
Files Needed
 
Modifications
 

Reference: Sreerama N, Venyaminov, S.Yu. & Woody, R.W., 1999, Estimation of the number of a-helical and b-strand segments in proteins using CD spectroscopy. Protein Science, 8,  370-380.
 

A New version of SELCON3 and two additional programs (CONTIN/LL from Provencer & Glockner and CDSSTR from W.C.Johnson) are provided in CDPro software package along with larger reference sets constructed by combining proteins from three four research groups. Please use those programs instead of this version of SELCON. WebAddress for CDPro: http://lamar.colostate.edu/~sreeram/CDPro
AUTHOR:
Narasimha Sreerama
Department of Biochemistry and Molecular Biology
Colorado State University, Fort Collins, CO 80523
E-mail: sreeram@lamar.colostate.edu
Home Page: http://lamar.colostate.edu/~sreeram
ACKNOWLEDGMENTS:

PART OF THE ALGORITHM AND REFERENCE PROTEIN SETS ARE YET TO PUBLISHED.

SOME WILL BE PRESENTED AT THE 43RD ANNUAL BIOPHYSICAL SOCIETY MEETING, BALTIMORE, 1999.

THE PROGRAM WILL BE UPDATED IN THE EARLY PART OF 1999 -- IMPLEMENTING A NEW SELECTION RULE DEVELOPED BY Johnson W.C. Jr.(1999) --

IF YOU ENCOUNTER A BUG OR A PROBLEM OR IF YOU HAVE SUGGESTIONS TO IMPROVE THE METHOD OR PROGRAM PLEASE SEND A E-MAIL TO sreeram@lamar.colostate.edu

TOP

SUMMARY:

This Program calculates secondary structure content in a protein from its CD spectrum. This is done using the CD spectra of a set of proteins with known secondary structure content from x-ray diffraction. The method is based on the Singular Value Decomposition of the CD data matrix used with the secondary structure matrix (1-3). The Locally Linearized method of van Stokkum et al (4) is used to implement the variable selection method (5). The Self-Consistent method of Sreerama & Woody (6) is the basic algorithm; modified to compare the CALCULATED and EXPERIMENTAL CD - according to SELCON2 (to be published, 7). The PROGRAM now estimates the Number of SEGMENTS of Helix and Strands in the protein examined according to Sreerama et al (8). The CD Spectra (178-260 nm range) are a gift from W.C. Johnson, Jr. The Secondary Structure is according to DSSP method (9). Two New REFERENCE SETS OF PROTEINS are provided. These are described in references 10 and 11.
TOP

REFERENCES:

  1. Forsythe et al., 1977, Computer methods for Mathematical Computations, Prentice-Hall, Englewood Cliffs, NJ.
  2. Hennessey & Johnson, 1981, Information Content in the Circular Dichroism of Proteins, Biochemistry, 20, 1085-1094.
  3. Compton & Johnson, 1986, Analysis of Protein Circular Dichroism Spectra for secondary structures using a simple matrix Multiplication, Anal. Biochem., 155, 155-167.
  4. van Stokkum et al., 1990, Estimation of Protein Secondary Structure and Error Analysis from Circular Dichroism Spectra, Anal. Biochem., 191, 110-118.
  5. Manavalan & Johnson, 1987, Variable Selection Method improves the Prediction of Protein Secondary Structure from CD. Anal. Biochem., 167, 76-85.
  6. Sreerama & Woody, 1993, A Self-Consistent Method for the analysis of Protein Secondary Structure from Circular Dichroism. Anal. Biochem., 209, 32-44.
  7. Sreerama et al., SELCON2: A Program for Estimating Fraction of Secondary Structure from Protein CD Spectra. (Submitted)
  8. Sreerama et al., 1999, Estimation of the number of a-helical and b-strand segments in proteins using CD spectroscopy. Protein Science, 8, 370-380.
  9. Kabsch & Sander, 1983, Dictionary of Protein Secondary Structure: Pattern Recognition of H-bonded and Geometric Features. Biopolymers, 22, 2577-2637.
  10. Sreerama et al., 1999, Inclusion of Denatured Proteins in the Reference Set Improves the Analysis of Protein CD Spectra Biophysical J., A716.
  11. Johnson, 1999. Analyzing Protein CD for Accurate secondary Structures. Proteins: Str. Func. Genet., (in Press)
  12. Sreerama & Woody, 1994, Poly(Pro)II helices in Globular Proteins: Identification and Circular Dichroic Analysis. Biochemistry, 33,10022-25.


RELATED PUBLICATIONS:

  • Sreerama & Woody, 1994, Combining variable selection and cluster analysis with neural network, ridge regression and self-consistent methods J.Mol.Biol, 242,497-507.
  • Greenfield, 1996, Methods to estimate the conformation of proteins from circular Dichroism data. Anal. Biochem. 235, 1-10.
  • TOP

    FILES:

    README ----             THIS FILE (earlier version); Please use the HTML version
    INPUT.SMP ----         Sample INPUT file
    OUTPUT.SMP ----     Sample OUTPUT file
    BASIS.SMP ----         Sample BASIS file
    CDOUT.SMP ----       Sample CDOUT file
    SELCON3.FOR----     FORTRAN Source Code of SELCON3 Program
    SELCON3.EXE----     IBM/PC Executable of SELCON3 Program
    CDDATA.29 ----        File containing CD data of 29 proteins
    SSDATA.29 ----         File containing SS data of 29 proteins
    CDDATA.37 ----        File containing CD data of 37 proteins
    SSDATA.37 ----         File containing SS data of 37 proteins
    CDDATA.23 ----        File containing CD data of 23 proteins
    SSDATA.23 ----         File containing SS data of 23 proteins
    CRDATA.FOR ---- FORTRAN Source code of CRDATA program
    CRDATA.EXE ---- IBM/PC Executable of CRDATA program
    CRDATA.IN ----     Input file for CRDATA program
    CRDATA.OUT ---- Output file from CRDATA program
    TOP


    SECONDARY STRUCTURE ELEMENTS:
    At present there are two sets of secondary structure elements that are used in the analysis. First set comprises of Helix1, Helix2, Sheet1, Sheet2, turns and Unordered, and is used with the basis sets of 29 or/and 37 proteins. This set is necessary to to obtain the number of helican and strand segments in the protein analyzed since the two subcategories of helix and sheet correspond to the regular and distorted fractions which are used in determining the number of segments. The second set is the secondary structure categories used by Johnson (1999) comprising of alpha-helix, 3/10-helix, beta-sheet, PP2 helix, turns and unordered (other) structures. This should be used with basis set of 23 proteins. For how to activate these options is SELCON3 please see the details of INPUT file below.

    FILE : SSDATA.ext (From Above)
    ext --->        29; 37      23
    SS -->              6         6          4
                Str1 Helix1     AHelix    Helix
                Str2 Helix2     3/10Hel   Sheet
                Str3 Sheet1    Sheet      Turns
                Str4 Sheet2    Turns      Unord
                Str5 Turns      PP2
                Str6 Unord     Unord

    Code
    Helix1: Regular Helix Fraction (Ref. 8) H1
    Helix2: Distorted Helix Fraction (Ref. 8) H2
    Sheet1: Regular Sheet Fraction (Ref. 8) S1
    Sheet2: Distorted Sheet Fraction (Ref. 8) S2
    AHelix: Alpha-Helix Fraction (Ref.11) H
    3/10helix:3/10-Helix Fraction (Ref.11) G
    Helix: Alpha-helix and 3/10 helix combined (H1+H2 OR H+G)
    Sheet: Beta Sheet(parallel and Anti-parallel combined), E (OR S1+S2)
    PP2 : Poly(Pro)II type structure (Ref 11 and 12) P
    Turns: Beta turns T
    Unord: Unordered (also called as: random-coil, O,U unassigned,unstructured, remainder etc)

    TOP


    CURRENT IMPLEMENTATION:
    Maximum Wavelength Range: 260 - 178 nm (1 nm intervals) - This is the Range for CD of reference proteins of 29 and 23 protein set.
    Maximum Wavelength Range: 240 - 185 nm (1 nm intervals) - This is the Range for CD of reference proteins of 37 protein set.
    The CD data of the protein analyzed should be within these ranges.
    We recommend that CD data input be at least in the range 240-190 nm.
    Number of Secondary Structures: 6
    H1,H2,S1,S2,T,U for 29,37 protein sets AND H, G, E, T, P, O for 23 Protein Set

    (For CODES see above)

    TOP

    ALGORITHM:

    The original SELCON algorithm has been modified by incorporating the recent advances in the CD analysis (W.C. Johnson). For the methodology and the program written by Dr. W. Curtis Johnson please contact him.

    The CD spectral matrix is constructed with the TEST protein spectrum as the first column. The proteins in the data base form the columns 2 to NPRT+1 in the order of decreasing similarity with the test protein CD. The structure matrix is also arranged in the same order. The initial guess forms the first column of the structure matrix.

    (a) Considering all proteins and 5 SVD coefficients the helix fraction is determined by a iterative process--HelHJ. (HelHJ can form the initial guess and the rest are determined by similarity with the second column.) The output corresponds to that of the original method of Hennessey & Johnson (2)

    (b) Now the basis set is varied from 6 to NPRT+1 and the number of SVD vectors are also varied from 1 to 7 (or basis prot - 2).

    1. The solutions are screened using first two selection rules -- If the number of solutions are < MinSol the rules are relaxed to SUM-Rule 0.15; Fraction-Rule -0.055.

    2. If the Solutions are still smaller than MinSol, then MinSol is reduced till 1.

    3. The process is iterated for self-consistency.

    The results are approximately equal to original SELCON Method (6).


    (c) Solutions are screened using the first three selection rules:

    1. The solutions are screened using Three selection rules -- If the number of solutions are < MinSol the rules are relaxed to SUM-Rule 0.15; Fraction-Rule -0.055.

    2. Comparison of Calculated spectra and Reconstructed spectra OR Comparison of Experimental spectra and Reconstructed spectra. Only those solutions with good values -- 0.25 delta(e) or less -- are collected. If very few solutions are obtained this is relaxed upto 0.50.

    3. If the Solutions are still smaller than MinSol, then MinSol is reduced till 1.
    The results are approximately equal to SELCON2 Program (7).
    The output also contains the number of HELICAL and STRAND segments as ESTIMATED from CD analysis (8)


    (d) THE NEW SELECTION RULE: (11)

    1. The fourth rule is now applied to further limit the solutions to a narrow range determined by the helix content. The solutions at the end of step (c) are screened using the maximum and minimum values of helix from the solutions and HelHJ, as follows:

    If      HelHJ > 0.65 -- all solutions with Helix > 0.65 are averaged
    If      0.65 > HelHJ > 0.25 -- all solutions with Helix = (HelHJ+maxH) + or - 0.03
    If      0.25 > HelHJ > 0.15 -- all solutions with Helix = (HelHJ+aveH) + or - 0.03
    If      HelHJ < 0.15 -- all solutions with Helix = (HelHJ=minH) + or - 0.03

    2. If the Solutions are still smaller than MinSol, then MinSol is reduced till 1.
    The results should be similar to those obtained with CDSSTR program of Curtis Johnson for most cases. For proteins with spectra dissimilar to those in the basis CDSSTR performs better by limiting and randomly selecting the basis set.

    TOP

    OPTIONS:

    DATA SETS AVAILABLE:

    CDDATA.29 -- CD spectra of 29 proteins (178-260 nm range)
    SSDATA.29 -- Sec Str data of 29 proteins. From Kabsch & Sander method. Sec. Str. are Heliix1, Helix2, Strand1, Strand2, Turn, O. This is the Preferred set since data is between 178-260 nm.

    CDDATA.37 -- CD spectra of 37 proteins (For 185-240 nm range ONLY)
    SSDATA.37 -- Sec Str data of 37 proteins. 29 from the CDDATA.29 set and 8 from S. Venyaminov (5 denatured proteins). Use this particularly for the proteins with large unordered contributions or in the denaturing process.

    CDDATA.23 -- CD spectra of 23 proteins (178-260 nm range) These are selected by Curt Johnson for CDsstr program.
    SSDATA.23 -- Sec Str data of 29 proteins. From Curt Johnson. Sec. Str. are A-Helix, 3-10 helix, B-sheet, Turn, PP2, O. This set can be used for comparison with results from CDsstr program from Curt Johnson. The Sec. Str. Fractions for this set are different as it uses a different algorithm for assigning them.

    P.S. Segment analysis WILL NOT be performed with Johnson's 23 protein set since the secondary structure fractions are derived differently than the Kabsch and Sander method. You WILL also find the fractions obtained from the 29 and 23 basis sets DIFFERENT.

    Options for Reference Set: (in File:INPUT) --- See below

    Basis_1     Basis_2     Basis_3     SEGMENT

              1               0               0                     1     ---     29 proteins; Segments
              0               0               1                     1     ---     37 proteins; Segments
              0               1               0                     0     ---     23 proteins; NO Segments

    TOP

    INPUT FILE: INPUT (Can be created using Program: CRDATA)

    INPUT.SMP -- Sample of INPUT file. (DATA in Free FORMAT; Detais BELOW)

    # Parameters
    # PRINT Basis_1 Basis_2 Basis_3 SEGMENT
    0 0 0 1 1
    #
    # Title 1 line
    SAMPLE INPUT: Myoglobin CD data (178-260 nm)
    #
    # WL_Begin WL_End Factor
    260.00000 178.00000 1
    #
    # CD DATA (Long-Wavelength to Short-wavelength; 260-178 nm LIMITS)
    0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 -0.01 -0.01
    -0.01 -0.01 -0.04 -0.09 -0.15 -0.22 -0.31 -0.42 -0.57 -0.77 -1.01
    -1.30 -1.66 -2.08 -2.56 -3.09 -3.66 -4.27 -4.92 -5.59 -6.23 -6.79
    -7.25 -7.62 -7.91 -8.13 -8.26 -8.30 -8.25 -8.13 -7.98 -7.83 -7.70
    -7.58 -7.45 -7.35 -7.31 -7.35 -7.46 -7.57 -7.59 -7.43 -7.02 -6.26
    -5.11 -3.67 -2.04 -0.09 2.32 5.03 7.84 10.60 13.13 15.21 16.66
    17.61 18.19 18.39 18.14 17.46 16.41 15.17 13.91 12.65 11.40 10.16
    8.97 7.90 7.01 6.25 5.55 4.88
    #
    # IGuess Str1 Str2 Str3 Str4 Str5 Str6
    0

    INPUT FILE EXPLAINED:

    TOP

    TO RUN THE PROGRAM:

    1. Create the CD data -- It should be in Delta(Epsilon) Units. You can do it from any text editor. The CD data should be within the range, Maximum: 260-178 nm; Minimum 240 - 190 nm. (Can be created using program: CRDATA)

    2. Edit the File, INPUT -- this is the input file for SELCON3. Leave the lines with the first character # as is. Provide the Data required as described in INPUT file description.

    3. Once the INPUT file is created, type SELCON3 in a MS-DOS shell -- where the program and the DATA files exist. (Keep all data files and programs in the same directory)
    OR Click on the icon for SELCON3 (It can be created; from file-manager or explorer for windows 95 create a short cut and place it on desktop)
    OR double-click on the program name from file-manager (or explorer). The program (opens a ms-dos shell if windows file manager is used) and runs silently and creates Files: OUTPUT and BASIS

    This PROGRAM can be compiled and run on any machine with a FORTRAN-77 COMPILER.


    OUTPUT FILES: OUTPUT and BASIS

    FILE BASIS:

    BASIS: contains the information about the Basis set used--List of proteins, their secondary structures and CD spectra
    Look in BASIS.SMP (Sample File)
     

    FILE OUTPUT:

    OUTPUT: contains the results which are partially explained in the output itself.
    Look in OUTPUT.SMP (Sample OUTPUT file) -- CONTAINS COMMENTS ABOUT RESULTS
     

    FILE CDOUT:

    CDOUT: contains the CD data, digitised, corresponding to the final plot in OUTPUT file. This can
    be imported into any plotting routine. (Suggested by Don Gray and Norma Greenfield). March 8, 99
    TOP

    MODIFICATIONS:

    One can add new PROTEINS to the REFERENCE SET. The program is currently dimensioned to handle 44 reference proteins (MxPR = 45 in the Parameter Statement). Simply add the protein CD data (in the appropriate range: 260 - 178 nm for 29 & 23 protein set and 240-185 nm for 37 protein set) in the format 15F6.2 and a name for the added protein. DO NOT change the name of the file. Of course you will have to add similarly the Sec. Str. DATA to the corresponding SSDATA file. As noted earlier the Sec. Str. DATA for 29 and 37 proteins is from DSSP, and that for 23 protein set is from XTLSTR from W.C. Johnson. These programs are in public domain.
    TOP

    CRDATA PROGRAM:

    This is a driver program to create the data for SELCON3 from any CD data file, in ASCII ot TEXT. The CD data should be within the wavelength ranges required for the basis sets and should be in the same ORDER as the wavelength ranges INPUT. The program asks for the TITLE, WAVELENGTH RANGE and the name of the FILE containing the CD data. It is assumed that the CD data file has two values per line, wavelength and CD. If the data are in molar ellipticity units it can be converted to delta(e) units. The output of the file is saved as TEST.DAT, which has to be COPIED to file: INPUT. Of course, one edit the file. Type CRDATA in a MSDOS shell or double-click from file manager/explorer on the name of the file or create an icon and click on it. Program Prompts you for title, etc. and creates file: TEST.DAT which you SHOULD copy to file: INPUT and run SELCON3
    TOP


    FTP INSTRUCTIONS:

    The files listed above can be copied via anonymous ftp @ 129.82.125.151 following the commands below. The executables should be copied using binary transfer. If you encounter any problems please contact sreeram@lamar.colostate.edu.

    The files can also be obtained by clicking on their names (above) and saving them appropriately.

    ftp 129.82.125.151
    login: anonymous
    passwd: your_name
    bin
    cd pub/SELCON3
    mget *.* OR get file_name

    *** This will be updated less frequently in comparison to the Web Page; Please check the web page
    first or contact the author***

    TOP

    PREVIOUS VERSIONS OF SELCON:

    Previous versions of SELCON (SELCON, SELCON1 and SELCON2) are also available at our anonymous ftp site (129.82.125.151). They are placed in the directory pub/CD_spectroscopy and the oreadme and readme files explain the differences and the data files required for running them. These can also be obtained from the location: http://lamar.colostate.edu/~sreeram/SELCON, by clicking on the relevant files and saving them.
    TOP

    7326thvisitor (Since 03/15/05).