Obtaining and Preparing Ligand PDB files

This guideline is taken directly from ligand_interface.py. The examples below can be tested using test_lig.pdb, ATP.mdl, molfile2params, and ligand_interface.py.

PDB files are the keys to structural Bioinformatics and structure prediction. PDB files are most easily obtained from the RCSB but may contain variability which makes them incompatible with PyRosetta. Ligands are tricky since PyRosetta must know what ResidueType the ligand is. There is no generic ResidueType and chemical information is sufficient, but generally unreadable to PyRosetta. Presented here are two methods of introducing new (fullatom) ResidueTypes, one temporary (loading in for the script or interpreter session) the other permanent (appending to the database). Both processes start by obtaining the proper .params files.

PyRosetta does not perform any of the initial changes required to improve the accuracy or change the file format of a ligand. When docking a flexible molecule, it is best to use multiple trials with each conformer separately. PyRosetta cannot be used to generate conformers.

Tools

Make sure you have downloaded molfile2params. The data files ATP.mdl and test_lig.pdb can be used to test this procedure. For a complete example, please consult the sample script ligand_interface.py.

Procedural Outline

1) Obtain the ligand .mdl file

-skip this step if the compound is present it PyRosetta (rare)

-refine the chemical data (no PyRosetta tools for this)

-convert the chemical data file to .mdl if necessary (try babel or openbabel)

-if necessary, generate conformers of the compound as separate files

2) Produce a .params file from the .mdl file and the script molfile_to_params.py

-skip this step if the compound is present it PyRosetta (rare)

-this will also yield a .pdb file which may be needed in step 3.

python molfile_to_params.py <MDL filename> -n <ResidueType name>

3) Produce the ligand-protein complex PDB file

-obtain the desired protein PDB file (see pose_structure.py)

-clean the PDB of other undesirable lines (HETATMs, waters, etc.)

-manually insert the lines from the .pdb file prodced in step 2. into

the protein PDB file after all protein chains**

4) Check the ligand-PDB file to ensure:

-the ligand residue column matches the ResidueType name used in step 2.

-the ligand chain is named "X" (a convention)

-the ligand chain occurs after all protein chains**

5) Load the ligand-protein complex PDB into PyRosetta by:

>Temporarily creating a ResidueTypeSet (Method 1)

-create a ResidueTypeSet using generate_nonstandard_residue_set,

providing it a Vector1 of .params filenames

-create an empty Pose object

-load the PDB file data into the pose using pose_from_pdb providing

the ResidueTypeSet as the second argument

>Permanently modifying the chemical database (Method 2)

-if using a new ligand/ResidueType:

-place the new .params file into (in PyRosetta main directory)

/rosetta_database/chemical/residue_type_sets/fa_standard/residue_types

-add the path to the new .params file to the file

/rosetta_database/chemical/residue_type_sets/fa_standard/residue_types.txt

-if the ResidueType is present, but "turned off"

-uncomment (or add) the path to the .params file in

/rosetta_database/chemical/residue_type_sets/fa_standard/residue_types.txt

Obtaining Ligand Data Files

Chemical formats are painfully nonstandard. Depending on your application and resources, there are numerous options (not discussed here) for obtaining data

files for ligand compounds. When seeking chemical data files, numerous chemical databases exist online (Pubchem etc.) as do tools for creating your compound. The specific properties (partial charge, bond lengths and angles, etc.) may be refined using other software*. Depending on the ligand, multiple conformers may be necessary. Different software produce different chemical formats. Conversion tools, such as babel or openbabel (openbabel.org), are required to convert your file into the .mdl format for usage in the PyRosetta database. If starting from an RCSB crystal structure, you can use PyMOL's "Save Molecule" feature to produce an .mdl file of a ligand (the file extension appears as ".mol"). Molecular Networks (http://www.molecular-networks.com/) offers a free online demo for chemical file format conversion (http://www.molecular-networks.com/online_demos/convert_demo).

Converting to Params Files

An additional script (and other necessary scripts) are provided for converting an .mdl file to a .params file (required for the PyRosetta database) and .pdb files. Execute this script from the commandline providing the .mdl file as the first argument and the ResidueType name as option "-n". For the .mdl file provided with this script, the example commandline call would be:

>python molfile_to_params.py ATP.mdl -n ATP

This example will produce two files, "ATP.params" and "ATP_0001.pdb". The .params file is necessary for the PyRosetta database, it defines the "ATP" ResidueType. The .pdb file is produced for grafting the ligand into a PDB file (the next step).

Preparing Ligand PDB Files

Now that the ResidueType is defined, the PDB file for ligand interface prediction can be made. If the PDB file already has the ligand present, ensure that its ResisueType column (PDB file format) is set to the ligand ResidueType ("ATP" for the example case). It is common practice to rename the chain to "X". If the ligand is not already present in the PDB file, insert in manually (using PyMOL, grep, awk, Python, Biopython, or whatever technique you prefer). The script modfile_to_params.py provides a sample .pdb file for this purpose. Ensure that the final PDB file has the proper ResidueType definition and chain ID. As with DNA-protein PDB files, the ligand chain should be last**.

Loading a Ligand PDB File into PyRosetta

Method 1: Temporarily using generate_nonstandard_residue_set

Inside the relevant script or interpreter, create a non-standard ResidueTypeSet using the method generate_nonstandard_residue_set and use use pose_from_pdb to load data into to a pose object. The method pose_from_pdb is overloaded such that it can accept a Pose (poses), a ResidueTypeSet (residue_set), and a string (filename) and load into the poses the data in the PDB file filename using residue_set to define any unknown residues. This method is preferred when the ligand is transitory (default). Permanently adding the ligand increases the amount of data held at any time and may slow PyRosetta if too many are added.

Method 2: Permanently by altering the PyRosetta database

Place (or copy) the new .params file somewhere in the PyRosetta fullatom chemical database. Inside the PyRosetta main directory, place files within:

/minirosetta_database/chemical/residue_type_sets/fa_standard/residue_types

You must also add the path to the new ligand in the file:

/minirosetta_database/chemical/residue_type_sets/fa_standard/residue_types.txt

The database has many unused compounds. To activate these, simply uncomment the

necessary line in residue_types.txt.

When preparing PDB files for docking, remember that the two chains to dock must be part of the same Pose object. This is easily attained by creating a PDB file

which includes both partners. If only interface structure prediction (high- resolution) is used, the PDB file MUST contain the molecules ORIENTED properly for the interface or the sampling will rarely find a proper structure.

Methods for downloading and generically "cleaning" PDB files should accompany future PyRosetta releases.

*PyRosetta DOES NOT perform ANY of these refinements or predictions, simply creating a molecule and introducing it to PyRosetta will rarely cause an error although the results may be poor since the compound is inaccurately represented

**Otherwise the protein may be moved significantly by the protocol, these methods were designed for usage with two-body docking problems and there is a hard-coded definition of "upstream" and "downstream" for docking partners. The position of the upstream docking partner is held constant while the downstream partner is altered by rigid-body perturbations, this does NOT affect the accuracy of predictions but can be an annoyance since the protein coordinates can significantly change though its conformation will not