My summary of CASP1

There are two parts to this missive about the conference I attended in December, 1994: the first meeting on the Critical Assessment of protein Structure Prediction methods (CASP1). The first part deals with my escapades in the sub-continent that is California. The second is about the actual details of the conference.

We arrived in San Francisco on Friday and went to Chinatown for dinner there. It's a cool city with a lot of strip bars is all I can say. The bay area is something I've heard a great deal of, and I finally got to see what it was all about.

The next day we headed out along the coast all the way down to Big Sur, where you encounter a spectacular meeting of the cliffs and the sea. The Asilomar conference centre is located right on the Monterey coast, on pebble beach. My room overlooked the ocean which was pretty wild . The weather was cool, but the sun shone for the most part and one could take long walks on the beach or drive around the peninsula to see the otters, seals, and other wildlife during the afternoon breaks. Also to be seen are interesting attracks like the Lone Cypress . There were a lot of surfers on the beach with wet suits, but the water was too cool to swim.

The state is spectacular! It is no wonder that some Californians would like to separate themselves from the rest of the US. They have everything! I didn't get to see the deserts, but once we explored the Monterey peninsula, we headed out east towards the mountains. We first went to the Sequoia and King's Canyon National Parks, the home of the giant Sequoia and redwood trees. General Sherman , the largest Sequoia tree, is also the largest living thing we know. The trees are gigantic (going up to 94 metres with a diametre of 12+ metres) and old (3000+ years). Standing next to these immortal objects, you feel extremely humble and it is one of the most spectacular things I've ever seen. The view at various spots in the mountains is thrilling. We then went to Yosemite park, an example of mountains and valleys primarily shaped by glacial advances. It wasn't too cold, and there wasn't much snow on the roads, but it had snowed a week earlier and the lingering whiteness added to the whole scenic beauty. All the falls along the wake were shrouded with ice around the main water stream. The shining mountains and the deep clefted valleys are a demonstration of nature's sculpting ability .

The mountains are a great place to laze around and relax, basking in the warmth of the sun. As we descended down to the Yosemite valley, I saw an extremely lucid display of Jesus rays shining down into the heart of the valley. While I didn't come up with exciting inspirations about protein folding here, I discovered something about my shoes I bought in August: one of them's a size 10 and the other's a 9.5. The last stop we made was at the moaning caverns, where a brilliant display of huge stalactites and stalagmites exist.

That's a bloody large tree, mate!

Nyah, nyah! You missed!

We referred to the conference here as the Protein Folding Competition. The difference between the words "structure prediction" and "protein folding" is hardly noticed here at CARB. The way I (we) approach protein folding and structure prediction is the same: understand how a protein folds up in nature and see if you can simulate it. This approach will, at the very least, tell us whether it is possible for us to predict structure using an algorithm similar to what nature uses if we understand how it happens in our cells. However, at the conference, the two issues were completely disparate.

There were three categories at the conference (which was the first time I was presenting something) representing classes of methods that are presently being used to predict protein 3D structure: homology modelling, where a sequence-sequence relationship is exploited (2 sequences having more than 25% of identical amino acids in a alignment means they have the same topological fold), threading, where a sequence-structure relationship is exploited (2 sequences with less than 25% identity could still adopt the same fold), and ab initio prediction (where no prior information about the structure of the sequence is used).

The format was such that at the beginning of each day for a given category, the assessor of the entries would tell us how we did, and then each of the groups would make a presentation on their method. The afternoon consisted of a poster session and a computer demonstration, and in the evening there was another set of talks that were mostly for specific topics that weren't touched upon in the morning and those that would generate discussion. The good thing about the format was that each speaker, except for the assessor, spoke only for 10-15 minutes

We submitted three models in the homology modelling category, and Mike James, the assessor for the comparative modelling category decided not to display the names of the people who submitted the structures. Thus it was hard to judge who did best, but from looking at the numbers, and our own assessments, we either did the best or the second best (I know we did best for at least one of the three models we submitted). The day itself was somewhat disappointing, in that very little progress has been made in the field of comparative modelling. We still couldn't predict loops/insertions accurately, and worse, a lot of groups got their sequence aligments wrong, which is definitely a no-no.

While most groups choose the automated approach, we chose to go for the visual approach, inspecting everything we did at every stage. In the case of the alignments, I managed to spot a couple of misalignments and I corrected them by hand. One of the errors was spotted by direct inspection of the sequence patterns (i.e., I realised if I moved one stretch of the sequence above by ~20 residues, it would make the overall alignment better), and the other was spotted by generating the initial structure, realising the structure didn't make complete sense, re-evaluating the alignment and correcting it. Getting the alignments right is apparently a problem and the way to get around it is to (i) generate more than one correct alignment (i.e., more than two non-trivially similar alignments), and (ii) use structure information. This summarises the little presentation I gave in the evening about why we got all our alignments correct. In the evening, where many people ended up touting their own methods, my assessment of the problem at hand was well-received. I also presented a poster on the three structures we modelled. The final results were published in a issue of Proteins: Structure, Function, and Genetics in June 1995.

The second category was threading, and this probably is the most encouraging thing to come out of the competition. Almost all the folds that were in the competition were predicted correctly by at least one group. But playing "name the fold" is useless if a complete structure that is accurate cannot be generated (in fact, one could've bettered everyone else by simply saying "TIM barrel" for every entry, since 8/24 proteins in the competitition belonged to that class of folds). And here, people will ultimately run into the same problems that were mentioned in the homologous modelling category---loops/insertions and packing of side chains will still be a problem. The ironic thing was that it was shown rather conclusively that one needn't do computationally expensive sequence/structure comparisons to recognise folds; one could do so by simply generating highly sensitive multiple sequence alignments using Hidden Markov Models (HMMs). Contrast this to the message that was gotten from the homology modelling experiments, where it was emphasised that structural information was necessary to generate more accurate alignments. This is the area where I have been doing the most research in the past year (the modelling I did for this competition was just an aside), so I got quite a bit out of this day's sessions.

The third category was ab initio prediction and this ended up being a bit of a mess. Secondary structure prediction, usually looked down by real ab initio folders, was what was mostly displayed. It was rather unexciting to learn that we can do a couple of percentage points better (mainly through the use of multiple sequences) than we could 15 years ago. There were only two groups (including us; we submitted two entries for this category but I had nothing to do with it) that did actual simulations of protein folding. There was some success in the folding of an all-helical peptide by both the groups.

From an engineering perspective, we seem to have accomplished a bit in terms of predicting structure. We still are a long way from doing so accurately, but fold recognition seems to work to a certain degree. In terms of science, I learnt absolutely nothing. There was too much focus on building structures by whatever methods and the real physical/biological nature of why a certain method works or doesn't work was not addressed. The ab initio category, which represented the main possibility of the science of protein folding coming through, failed to ignite. Jan summed it up appropriately when he said "the problem is that people treat this as a computer science problem". And this is my biggest complaint and disappointment about the entire conference. There is no intellectual satisfication when one treats it as a engineering or an applied computer science problem. Sure, there might be a practical solution, but I, for one, don't claim to work on the protein folding problem for the practical implications. All that said, given that this was the first time something like this has ever been organised, it was a tremendous success and gave me plenty of ideas as to how to go about finishing the current project I'm working on, and begin work on my Ph.D. proposal.

Samudrala Computational Biology Research Group (CompBio) || Ram Samudrala || me@ram.org || December 2-16, 1994