Reference: Sreerama
N, Venyaminov, S.Yu. & Woody, R.W., 1999, Estimation of the number
of a-helical
and b-strand
segments in proteins using CD spectroscopy. Protein Science, 8,
370-380.
A New version of SELCON3 and two additional programs (CONTIN/LL from Provencer & Glockner and CDSSTR from W.C.Johnson) are provided in along with larger reference sets constructed by combining proteins from three four research groups. Please use those programs instead of this version of SELCON. WebAddress for CDPro: http://lamar.colostate.edu/~sreeram/CDProAUTHOR:
Narasimha SreeramaACKNOWLEDGMENTS:
Department of Biochemistry and Molecular Biology
Colorado State University, Fort Collins, CO 80523
E-mail: sreeram@lamar.colostate.edu
Home Page: http://lamar.colostate.edu/~sreeram
- Robert W. Woody, Colorado State University, Fort Collins.
- Sergei Yu. Venyaminov, Mayo Foundation, Rochester.
- W. Curtis Johnson Jr., Oregon State University, Corvallis.
PART OF THE ALGORITHM AND REFERENCE PROTEIN SETS ARE YET TO PUBLISHED.TOPSOME WILL BE PRESENTED AT THE 43RD ANNUAL BIOPHYSICAL SOCIETY MEETING, BALTIMORE, 1999.
THE PROGRAM WILL BE UPDATED IN THE EARLY PART OF 1999 -- IMPLEMENTING A NEW SELECTION RULE DEVELOPED BY Johnson W.C. Jr.(1999) --
IF YOU ENCOUNTER A BUG OR A PROBLEM OR IF YOU HAVE SUGGESTIONS TO IMPROVE THE METHOD OR PROGRAM PLEASE SEND A E-MAIL TO sreeram@lamar.colostate.edu
This Program calculates secondary structure content in a protein from its CD spectrum. This is done using the CD spectra of a set of proteins with known secondary structure content from x-ray diffraction. The method is based on the Singular Value Decomposition of the CD data matrix used with the secondary structure matrix (1-3). The Locally Linearized method of van Stokkum et al (4) is used to implement the variable selection method (5). The Self-Consistent method of Sreerama & Woody (6) is the basic algorithm; modified to compare the CALCULATED and EXPERIMENTAL CD - according to SELCON2 (to be published, 7). The PROGRAM now estimates the Number of SEGMENTS of Helix and Strands in the protein examined according to Sreerama et al (8). The CD Spectra (178-260 nm range) are a gift from W.C. Johnson, Jr. The Secondary Structure is according to DSSP method (9). Two New REFERENCE SETS OF PROTEINS are provided. These are described in references 10 and 11.TOP
TOP
- Forsythe et al., 1977, Computer methods for Mathematical Computations, Prentice-Hall, Englewood Cliffs, NJ.
- Hennessey & Johnson, 1981, Information Content in the Circular Dichroism of Proteins, Biochemistry, 20, 1085-1094.
- Compton & Johnson, 1986, Analysis of Protein Circular Dichroism Spectra for secondary structures using a simple matrix Multiplication, Anal. Biochem., 155, 155-167.
- van Stokkum et al., 1990, Estimation of Protein Secondary Structure and Error Analysis from Circular Dichroism Spectra, Anal. Biochem., 191, 110-118.
- Manavalan & Johnson, 1987, Variable Selection Method improves the Prediction of Protein Secondary Structure from CD. Anal. Biochem., 167, 76-85.
- Sreerama & Woody, 1993, A Self-Consistent Method for the analysis of Protein Secondary Structure from Circular Dichroism. Anal. Biochem., 209, 32-44.
- Sreerama et al., SELCON2: A Program for Estimating Fraction of Secondary Structure from Protein CD Spectra. (Submitted)
- Sreerama et al., 1999, Estimation of the number of a-helical and b-strand segments in proteins using CD spectroscopy. Protein Science, 8, 370-380.
- Kabsch & Sander, 1983, Dictionary of Protein Secondary Structure: Pattern Recognition of H-bonded and Geometric Features. Biopolymers, 22, 2577-2637.
- Sreerama et al., 1999, Inclusion of Denatured Proteins in the Reference Set Improves the Analysis of Protein CD Spectra Biophysical J., A716.
- Johnson, 1999. Analyzing Protein CD for Accurate secondary Structures. Proteins: Str. Func. Genet., (in Press)
- Sreerama & Woody, 1994, Poly(Pro)II helices in Globular Proteins: Identification and Circular Dichroic Analysis. Biochemistry, 33,10022-25.
RELATED PUBLICATIONS:Sreerama & Woody, 1994, Combining variable selection and cluster analysis with neural network, ridge regression and self-consistent methods J.Mol.Biol, 242,497-507. Greenfield, 1996, Methods to estimate the conformation of proteins from circular Dichroism data. Anal. Biochem. 235, 1-10.
README ---- THIS FILE (earlier version); Please use the HTML versionTOP
INPUT.SMP ---- Sample INPUT file
OUTPUT.SMP ---- Sample OUTPUT file
BASIS.SMP ---- Sample BASIS file
CDOUT.SMP ---- Sample CDOUT file
SELCON3.FOR---- FORTRAN Source Code of SELCON3 Program
SELCON3.EXE---- IBM/PC Executable of SELCON3 Program
CDDATA.29 ---- File containing CD data of 29 proteins
SSDATA.29 ---- File containing SS data of 29 proteins
CDDATA.37 ---- File containing CD data of 37 proteins
SSDATA.37 ---- File containing SS data of 37 proteins
CDDATA.23 ---- File containing CD data of 23 proteins
SSDATA.23 ---- File containing SS data of 23 proteins
CRDATA.FOR ---- FORTRAN Source code of CRDATA program
CRDATA.EXE ---- IBM/PC Executable of CRDATA program
CRDATA.IN ---- Input file for CRDATA program
CRDATA.OUT ---- Output file from CRDATA program
At present there are two sets of secondary structure elements that are used in the analysis. First set comprises of Helix1, Helix2, Sheet1, Sheet2, turns and Unordered, and is used with the basis sets of 29 or/and 37 proteins. This set is necessary to to obtain the number of helican and strand segments in the protein analyzed since the two subcategories of helix and sheet correspond to the regular and distorted fractions which are used in determining the number of segments. The second set is the secondary structure categories used by Johnson (1999) comprising of alpha-helix, 3/10-helix, beta-sheet, PP2 helix, turns and unordered (other) structures. This should be used with basis set of 23 proteins. For how to activate these options is SELCON3 please see the details of INPUT file below.TOPFILE : SSDATA.ext (From Above)
ext ---> 29; 37 23
SS --> 6 6 4
Str1 Helix1 AHelix Helix
Str2 Helix2 3/10Hel Sheet
Str3 Sheet1 Sheet Turns
Str4 Sheet2 Turns Unord
Str5 Turns PP2
Str6 Unord UnordCode
Helix1: Regular Helix Fraction (Ref. 8) H1
Helix2: Distorted Helix Fraction (Ref. 8) H2
Sheet1: Regular Sheet Fraction (Ref. 8) S1
Sheet2: Distorted Sheet Fraction (Ref. 8) S2
AHelix: Alpha-Helix Fraction (Ref.11) H
3/10helix:3/10-Helix Fraction (Ref.11) G
Helix: Alpha-helix and 3/10 helix combined (H1+H2 OR H+G)
Sheet: Beta Sheet(parallel and Anti-parallel combined), E (OR S1+S2)
PP2 : Poly(Pro)II type structure (Ref 11 and 12) P
Turns: Beta turns T
Unord: Unordered (also called as: random-coil, O,U unassigned,unstructured, remainder etc)
Maximum Wavelength Range: 260 - 178 nm (1 nm intervals) - This is the Range for CD of reference proteins of 29 and 23 protein set.TOP
Maximum Wavelength Range: 240 - 185 nm (1 nm intervals) - This is the Range for CD of reference proteins of 37 protein set.
The CD data of the protein analyzed should be within these ranges.
We recommend that CD data input be at least in the range 240-190 nm.
Number of Secondary Structures: 6
H1,H2,S1,S2,T,U for 29,37 protein sets AND H, G, E, T, P, O for 23 Protein Set(For CODES see above)
The original SELCON algorithm has been modified by incorporating the recent advances in the CD analysis (W.C. Johnson). For the methodology and the program written by Dr. W. Curtis Johnson please contact him.TOPThe CD spectral matrix is constructed with the TEST protein spectrum as the first column. The proteins in the data base form the columns 2 to NPRT+1 in the order of decreasing similarity with the test protein CD. The structure matrix is also arranged in the same order. The initial guess forms the first column of the structure matrix.
(a) Considering all proteins and 5 SVD coefficients the helix fraction is determined by a iterative process--HelHJ. (HelHJ can form the initial guess and the rest are determined by similarity with the second column.) The output corresponds to that of the original method of Hennessey & Johnson (2)
(b) Now the basis set is varied from 6 to NPRT+1 and the number of SVD vectors are also varied from 1 to 7 (or basis prot - 2).
1. The solutions are screened using first two selection rules -- If the number of solutions are < MinSol the rules are relaxed to SUM-Rule 0.15; Fraction-Rule -0.055.2. If the Solutions are still smaller than MinSol, then MinSol is reduced till 1.
3. The process is iterated for self-consistency.
The results are approximately equal to original SELCON Method (6).
(c) Solutions are screened using the first three selection rules:1. The solutions are screened using Three selection rules -- If the number of solutions are < MinSol the rules are relaxed to SUM-Rule 0.15; Fraction-Rule -0.055.2. Comparison of Calculated spectra and Reconstructed spectra OR Comparison of Experimental spectra and Reconstructed spectra. Only those solutions with good values -- 0.25 delta(e) or less -- are collected. If very few solutions are obtained this is relaxed upto 0.50.
3. If the Solutions are still smaller than MinSol, then MinSol is reduced till 1.
The results are approximately equal to SELCON2 Program (7).
The output also contains the number of HELICAL and STRAND segments as ESTIMATED from CD analysis (8)
(d) THE NEW SELECTION RULE: (11)1. The fourth rule is now applied to further limit the solutions to a narrow range determined by the helix content. The solutions at the end of step (c) are screened using the maximum and minimum values of helix from the solutions and HelHJ, as follows:If HelHJ > 0.65 -- all solutions with Helix > 0.65 are averaged
If 0.65 > HelHJ > 0.25 -- all solutions with Helix = (HelHJ+maxH) + or - 0.03
If 0.25 > HelHJ > 0.15 -- all solutions with Helix = (HelHJ+aveH) + or - 0.03
If HelHJ < 0.15 -- all solutions with Helix = (HelHJ=minH) + or - 0.032. If the Solutions are still smaller than MinSol, then MinSol is reduced till 1.
The results should be similar to those obtained with CDSSTR program of Curtis Johnson for most cases. For proteins with spectra dissimilar to those in the basis CDSSTR performs better by limiting and randomly selecting the basis set.
DATA SETS AVAILABLE:CDDATA.29 -- CD spectra of 29 proteins (178-260 nm range)
SSDATA.29 -- Sec Str data of 29 proteins. From Kabsch & Sander method. Sec. Str. are Heliix1, Helix2, Strand1, Strand2, Turn, O. This is the Preferred set since data is between 178-260 nm.CDDATA.37 -- CD spectra of 37 proteins (For 185-240 nm range ONLY)
SSDATA.37 -- Sec Str data of 37 proteins. 29 from the CDDATA.29 set and 8 from S. Venyaminov (5 denatured proteins). Use this particularly for the proteins with large unordered contributions or in the denaturing process.CDDATA.23 -- CD spectra of 23 proteins (178-260 nm range) These are selected by Curt Johnson for CDsstr program.
SSDATA.23 -- Sec Str data of 29 proteins. From Curt Johnson. Sec. Str. are A-Helix, 3-10 helix, B-sheet, Turn, PP2, O. This set can be used for comparison with results from CDsstr program from Curt Johnson. The Sec. Str. Fractions for this set are different as it uses a different algorithm for assigning them.P.S. Segment analysis WILL NOT be performed with Johnson's 23 protein set since the secondary structure fractions are derived differently than the Kabsch and Sander method. You WILL also find the fractions obtained from the 29 and 23 basis sets DIFFERENT.
Options for Reference Set: (in File:INPUT) --- See belowTOPBasis_1 Basis_2 Basis_3 SEGMENT
1 0 0 1 --- 29 proteins; Segments
0 0 1 1 --- 37 proteins; Segments
0 1 0 0 --- 23 proteins; NO Segments
INPUT.SMP -- Sample of INPUT file. (DATA in Free FORMAT; Detais BELOW)# Parameters
# PRINT Basis_1 Basis_2 Basis_3 SEGMENT
0 0 0 1 1
#
# Title 1 line
SAMPLE INPUT: Myoglobin CD data (178-260 nm)
#
# WL_Begin WL_End Factor
260.00000 178.00000 1
#
# CD DATA (Long-Wavelength to Short-wavelength; 260-178 nm LIMITS)
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 -0.01 -0.01
-0.01 -0.01 -0.04 -0.09 -0.15 -0.22 -0.31 -0.42 -0.57 -0.77 -1.01
-1.30 -1.66 -2.08 -2.56 -3.09 -3.66 -4.27 -4.92 -5.59 -6.23 -6.79
-7.25 -7.62 -7.91 -8.13 -8.26 -8.30 -8.25 -8.13 -7.98 -7.83 -7.70
-7.58 -7.45 -7.35 -7.31 -7.35 -7.46 -7.57 -7.59 -7.43 -7.02 -6.26
-5.11 -3.67 -2.04 -0.09 2.32 5.03 7.84 10.60 13.13 15.21 16.66
17.61 18.19 18.39 18.14 17.46 16.41 15.17 13.91 12.65 11.40 10.16
8.97 7.90 7.01 6.25 5.55 4.88
#
# IGuess Str1 Str2 Str3 Str4 Str5 Str6
0
Value: 0 -- Option Disabled; 1 -- Option Enabled
Parameters:
PRINT: The parameter for Printing intermediate results (for Debugging)
Basis_1: For CDDATA.29 and SSDATA.29 set, SEGMENT analysis (Recommended)
Basis_2: For CDDATA.23 and SSDATA.23 set. From Curt Johnson's new Program.
Basis_3: For CDDATA.37 and SSDATA.37 set. This includes 5 denatured spectra and the wavelength range is limited to 185-240 nm. (STILL UNDER DEVELOPMENT)
SEGMENT: 1 -- enables number of segment calculation 0 -- disables segment calculation
Provide Title in Line 6.
In line 9 Provide the range of CD spectra and the Factor for multiplying CD spectra (Normally 1, but for correcting concentration errors it can be changed -- 0.95 - 1.05) Also if you have CD data in Molar Ellipticity then you can use a FACTOR of 0.00030303 (= 1/3300) to convert them to Delta(Epsilon) units.
CD DATA is provided in line 12 -- TOTAL of (WL_Begin - WL_End + 1) VALUES in the interval of 1 nm. The input is FREE OF FORMAT, and hence you should have exact number of values. You can have ONE CD data per line or TWO CD data points per line OR THREE ... OR ... OR etc. DO NOT include the wavelength values with CD values. (Line 12 is in fact a collection of lines containing CD DATA) (HINT: This can be created from the ASCII or TEXT file of CD data either by cut-paste option OR manually entering the CD data - Always from Longer wavelength to Shorter wavelength. If the file contains wavelength and CD data values then you can use the program CRDATA to read the CD values and create the input file for SELCON3)
Line 15 has the Initial Guess. You can let the Program Make a GUESS. If IGuess = 0, Program makes the guess. If IGuess = 1, Provide values for Str1 to Str6.
1. Create the CD data -- It should be in Delta(Epsilon) Units. You can do it from any text editor. The CD data should be within the range, Maximum: 260-178 nm; Minimum 240 - 190 nm. (Can be created using program: CRDATA)2. Edit the File, INPUT -- this is the input file for SELCON3. Leave the lines with the first character # as is. Provide the Data required as described in INPUT file description.
3. Once the INPUT file is created, type SELCON3 in a MS-DOS shell -- where the program and the DATA files exist. (Keep all data files and programs in the same directory)
OR Click on the icon for SELCON3 (It can be created; from file-manager or explorer for windows 95 create a short cut and place it on desktop)
OR double-click on the program name from file-manager (or explorer). The program (opens a ms-dos shell if windows file manager is used) and runs silently and creates Files: OUTPUT and BASISThis PROGRAM can be compiled and run on any machine with a FORTRAN-77 COMPILER.
OUTPUT
FILES: OUTPUT and BASIS
TOPFILE BASIS:
BASIS: contains the information about the Basis set used--List of proteins, their secondary structures and CD spectra
Look in BASIS.SMP (Sample File)
FILE OUTPUT:
OUTPUT: contains the results which are partially explained in the output itself.
Look in OUTPUT.SMP (Sample OUTPUT file) -- CONTAINS COMMENTS ABOUT RESULTS
FILE CDOUT:
CDOUT: contains the CD data, digitised, corresponding to the final plot in OUTPUT file. This can
be imported into any plotting routine. (Suggested by Don Gray and Norma Greenfield). March 8, 99
One can add new PROTEINS to the REFERENCE SET. The program is currently dimensioned to handle 44 reference proteins (MxPR = 45 in the Parameter Statement). Simply add the protein CD data (in the appropriate range: 260 - 178 nm for 29 & 23 protein set and 240-185 nm for 37 protein set) in the format 15F6.2 and a name for the added protein. DO NOT change the name of the file. Of course you will have to add similarly the Sec. Str. DATA to the corresponding SSDATA file. As noted earlier the Sec. Str. DATA for 29 and 37 proteins is from DSSP, and that for 23 protein set is from XTLSTR from W.C. Johnson. These programs are in public domain.TOP
This is a driver program to create the data for SELCON3 from any CD data file, in ASCII ot TEXT. The CD data should be within the wavelength ranges required for the basis sets and should be in the same ORDER as the wavelength ranges INPUT. The program asks for the TITLE, WAVELENGTH RANGE and the name of the FILE containing the CD data. It is assumed that the CD data file has two values per line, wavelength and CD. If the data are in molar ellipticity units it can be converted to delta(e) units. The output of the file is saved as TEST.DAT, which has to be COPIED to file: INPUT. Of course, one edit the file. Type CRDATA in a MSDOS shell or double-click from file manager/explorer on the name of the file or create an icon and click on it. Program Prompts you for title, etc. and creates file: TEST.DAT which you SHOULD copy to file: INPUT and run SELCON3TOP
The files listed above can be copied via anonymous ftp @ 129.82.125.151 following the commands below. The executables should be copied using binary transfer. If you encounter any problems please contact sreeram@lamar.colostate.edu.TOPThe files can also be obtained by clicking on their names (above) and saving them appropriately.
ftp 129.82.125.151
login: anonymous
passwd: your_name
bin
cd pub/SELCON3
mget *.* OR get file_name*** This will be updated less frequently in comparison to the Web Page; Please check the web page
first or contact the author***
Previous versions of SELCON (SELCON, SELCON1 and SELCON2) are also available at our anonymous ftp site (129.82.125.151). They are placed in the directory pub/CD_spectroscopy and the oreadme and readme files explain the differences and the data files required for running them. These can also be obtained from the location: http://lamar.colostate.edu/~sreeram/SELCON, by clicking on the relevant files and saving them.TOP