Rosetta/PyRosetta API reference
As PyRosetta is a Python interface to Rosetta package macromolecular modeling package it uses the same classes and functions as Rosetta. Therefore familiarity with Rosetta API is a primary source for in-depth information of underlying types and Rosetta functionality. For API reference of all bound enums/functions/classes please see [Rosetta/PyRosetta API reference]
Reading and Searching PyRosetta
PyRosetta is an easy-to-use interface to Rosetta objects and algorithms. Users familiar with Python will understand the basic syntax however the sophisticated object structures and dependencies are hidden and specific to Rosetta. Fortunately, the code is under constant development and new features are frequently being added to PyRosetta. Unfortunately this means syntax can change quickly and documentation can become outdated. Luckily, Rosetta is an immense box of tools and tricks for protein manipulation. Solutions to many problems, specific or general, exist within the code. Hunting these methods and objects can be tricky and requires deeper knowledge of Rosetta architecture, development decisions, and C++ syntax. I hope to outline some tricks for searching Rosetta within PyRosetta, the knowledge you'll need to confidently test Rosetta objects, and a list of troubleshooting scenarios where Rosetta code works but fails to solve the question it appears to.
Finding/Testing Objects and Methods
PyRosetta (specifically IPython) offers tab-completion which significantly speeds up searching Rosetta. With some knowledge of the objects you're hunting for, keywords that an appropriate data structure might use, and Rosetta's general architecture, you will probably be able to find an appropriate method if it exists. When using an object for the first time, use tab-completion to see all available methods.
Help messages, produced using "help <target>" or "<target>?", provide some explanation of the object or method. They also describe the inputs required and output produced (separated by "->"). It will look something like (for Pose.phi?):
phi( (Pose)arg1, (int)seqpos) -> float:
The data structures of each input are in parentheses preceding the input variable name. The output data structure(s) are listed plainly after the "->" characters. Object methods will include the object itself as the first argument (often indicated by a variable name such as "arg1"). This prevents the user from specifying the input argument meaning you can ignore it. For the example above, the method Pose.phi only requires the user to specify one input argument, the "seqpos" integer. Setter methods, display methods, and some others don't return anything explicitly (the output data appears as "void", nothing) although some of these methods may return a boolean (True or False) indicative of their performance.
Recent updates to Rosetta should include updated help messages for many of the commonly used objects and methods. These generally indicate what objects are "safe" or "guaranteed to work". A lot of Rosetta is accessible within PyRosetta. Other methods generally work, but may not work intuitively or make assumptions you may not want. If you find yourself working at the interpreter with methods lacking extensive help messages, try writing a script to perform all steps with helpful output. Mistakes can still break PyRosetta (crash the IPython instance) and it WILL get tiring to constantly load in the same data (typing "from rosetta import*" will also get tiring).
NOTE: You can paste Python syntax, including an entire script, into the IPython interpreter and it will run although explicit comments of string (' or ") may cause syntax errors without stopping the execution of subsequent lines of code.
When using a new object or method, use tab-completion, help messages, and experimentation to understand what data the object records or requires. Methods which have unintelligible names are probably not useful. Make sure you know how to change an object's data, what it is used for, and the changes or output it produces.
Developing and Testing New Protocols
PyRosetta is a development tool for accessing and understanding the Rosetta program suite. Rosetta algorithms use a modified MCMC (Markov Chain Monte Carlo) protocol for predicting structures. When performing novel investigations, I recommend writing scripts which contain all relevant protocols in Python methods or classes followed by a main execution. Debugging novel protocols is very difficult and having an unorganized script can confound problems.
The PyMOL_Mover allows developers to easily inspect the inner-workings of Rosetta protocols. Remember, Rosetta is an MCMC algorithm and as such, there is no a priori evidence to indicate the sampling techniques produce realistic intermediate structures. When viewing protocols in PyMOL, intermediate may appear realistic but they are indicative of important interactions, not necessarily the structures attained.
When debugging, please consult the "Common Problems" section below, the FAQ, and the RosettaCommons forums. Rosetta Movers and protocols are NOT devoid of all problems and often expect input in a specific format. Consult the sample scripts and other Rosetta summaries to ensure that a problem is not due to an ongoing issue.
Building PyRosetta from source
In some cases it might be beneficial to build PyRosetta from source instead of using pre-build binaries. Here is the list of cases when it might be a good idea to consider building from source:
You need to have bindings for your own custom C++ code that is not part of Rosetta master branch.
PyRosetta need to run by custom-build Python (for example if you would like to use some Python packages from MacPorts, Fink or Homebrew who deploy its own version of Python)
You need to use different version of Python.
You can't install GLIBC required by pre-build binaries.
You running on 32Bit system.
1. Installing required packages
Before PyRosetta could be compiled the following packages needs to be installed:
2. Building PyRosetta
After acquiring Rosetta source package execute the following commands:
python build.py -j24 --create-package $HOME/my_pyrosetta_package
python setup.py install
To build PyRosetta for different python version replace `python` in steps above with appropriate python executable. For full list of available command line options please run 'build.py --help'.
Rosetta has a myriad of coding conventions used to standardize the structure and ease communication since hundreds of developers work on the code simultaneously. Several words and terms are intimately linked to the problems and paradigms of protein structure prediction and design. We have included a "dictionary" of relevant terms to ease usage however these do not elucidate the common naming conventions.
Nothing is completely standardized. There are many exceptions (sadly Pose has a lot of exceptions). If you find one...move on and try to remember it.
2. Object Naming
Objects are named using CamelCase, the first letter of each word is capitalized and individual words are combined to form the object name. Acronyms are treated as words and as such, only the first letter is capitalized.
3. Method Naming
Methods (of objects) are named using underscores, each word is entirely lower case and separated by the characher "_". Objects owning other objects have getters and/or setters to these objects named this way.
4. Exposed Methods
In PyRosetta, several methods are exposed and accessible directly. Since these are methods, they are named using the underscoring convention. Some notable exceptions are permitted if specific names are involved (for example, the convention of CA for an alpha carbon).
5. Getters and Setters
An object with getters and setters for specific data uses an overloaded method. Calling the method with no arguments returns the data (the getter) and calling the method with appropriate input sets the data (the setter).
Alternatively, some objects have getters and setters named such that the setter has the name of the getter method preceded by the word "set". This is often seen when a specific element of a larger object is set (typically Conformation and related objects).
6. Overloaded Methods
Overloaded methods will have multiple "C++ signature" in their help messages. These can be difficult to read but the line reading "C++ signature :" (and its following line) indicates the end of one definition and thus the end of one overloaded method definition. For example, the MonteCarlo.score_function method is overloaded (as described in 5.) and its help appears as:
The far left numbers indicate which method call is defined, the first (1) or second(2).
If the object constructor is overloaded, the method will appear as "__init__". For example, FoldTree has a overloaded constructor:
7. Rosetta "Size" and "Real"
Within Rosetta, several simple objects are used for basic data structures. If these are seen within PyRosetta help, they can be replaced by their appropriate Python data type.
Size is an int
Real is a double or float (use float from Python)
Lists or similar structures exist as numerous different Array, Vector, and Matrix objects within Rosetta. If these are seen in PyRosetta help, you may make you own instance of the object (if you can't find it, try rosetta.numeric. and tab-completion)
8. 0 Indexing vs 1 Indexing
Rosetta has its roots in FORTRAN so counting is "1-indexed" (the first element is numbered 1). Python on the other hand is "0-indexed" (the first element is numbered 0). One major advantage of PyRosetta is the ability to extract complex structural data into easy-to-handle Python lists. Be careful when extracting data since the first element of a Rosetta object may require a "1" when using its getter method while the data is stored in a Python object requiring a "0" to access the first element. For example:
creates a Python list of the PDB file's ('my_favorite.pdb') phi torsion angles. The value returned by p.phi(1) is stored in phis. Some Rosetta objects will cause a Segmentation Fault (crashing the Python interpreter) if an improper value is referenced (such as a 0th or Pose.total_residue()+1-th residue). All Rosetta objects are 1-indexed, including Vector and Matrix objects.
All ScoreType objects are accessible since these objects are the "keys" for EMapVectors (an object returned by some common methods).
The exposed methods "score_type_from_name" and "name_from_score_type" convert between the ScoreType object and its string representation.
10. Common Method Names
Although nonstandard, many similar methods have similar names. Luckily, tab-completion provides an easy way to learn if an object method exists.
Methods returning the number of objects contained:
Methods modifying the object to the input object
.assign( <object> )
Methods returning a complete copy object
Methods clearing, resetting, or unsetting data
11. "fullatom" vs. "centroid"
A Pose has a specific ResidueTypeSet defining which "structural mode" it is in. The only ResidueTypeSets used in PyRosetta are fullatom and centroid. Various keywords are specific to these:
12. User Variables
User variables can be named however you please. Most reference texts either follow the underscoring convention for naming object instances or use a number of common names.
For the record, the names are Rosetta, Python, IPython, PyRosetta, and PyMOL.
A Bit of Python
Python documentation and help online is very nicely done and well written. I am not going to replicate that information here. It is ESSENTIAL that you are comfortable with Python syntax to make use of PyRosetta. Here I will reiterate some minor aspects of the syntax that are helpful for using PyRosetta.
There are two main options for running a script:
-from the commandline: python my_scrit.py
-from the interpreter: : run my_script.py
If you are interested in refining commandline accessible python scripts, I suggest using the optparse library packaged with Python for managing arguments and options from the commandline.
You can also directly import the contents of a script into the interpreter with the "import" and "reload" commands. Any "stand-alone" methods in your script can be imported into the interpreter. After using the "import" command to bring a method into the interpreter, the "reload" command must be used to update it if you change the source. Successive calls to "import" will appear to succeed but will not update the methods based on changes to the source. Importing, modifying, and reloading, methods can be an efficient way to develop a section of code.
Many Rosetta objects are containers for other data. Constantly writing "for" loops can get tiring and is inefficient for interpreter inquiries. I suggest reading the Python documentation on Python data structures. Specifically, list comprehensions and dictionaries can save A LOT of time when using PyRosetta.
As an example of how useful Python libraries and syntax techniques can be, I will show how to create a list of all PDB files in the current directory using the os library:
which abstractly says "get the os tools, create a variable pdb_files that is a list of all files in the current directory ending in .pdb".
Many Unix commands are supported within the Python and the IPython interpreter. You MUST be comfortable with the cd command for changing directories to effectively use PyRosetta...or decide to have ALL files you are using in one directory. The pwd, ls, rm, cp, and mkdir commands are also very useful.
Many tools for computational biology are written in Python but several may only be accessible from the commandline. Some programs can be called from the commandline from a Python script or the interpreter (os.system, subprocess, etc.).
When using the PyMOL_Mover, you may need to check the IP address of the PyRosetta or PyMOL session. Both tools support a Python interpreter and an IP address (string) can be obtained in Python with:
If you suspect your script has an error but the output passes too quickly for you to catch by eye, you can always run the script on the commandline and direct output into a text file:
python my_script.py > sample_output.txt
Since Rosetta output can be verbose and similar between methods, you may want to insert additional print lines so it is easier to locate your desired output (such as print "="*80).
A Bit of C++
Rosetta is (currently) implemented in C++. PyRosetta users should be capable of accessing and understanding the code without extensive knowledge of C++ syntax. Some computer science terms are used ubiquitously in the Rosetta community and users without an extensive background in programming may find some terms or concepts foreign. Here I intend to expose the basic knowledge that will help you understand Rosetta structure and accompanying help messages where they are related to differences between Python and C++.
Rosetta is abstractly split into three layers or tasks; data containers (such as pose), data assessment (scoring), and data manipulation (moves and protocols). Data container objects are built for efficiency and rarely allow direct access to their information. Since access is indirect, many objects have "getter" methods for extracting data and "setter" methods for overwriting data. Data assessment and data manipulation objects often perform specific functions with variable options. Data manipulation is often applied directly on an input object. This is is why many sample scripts produce copies of the original data container.
As noted above, a common convention for getter and setter methods is to use an overloaded method. An overloaded method (or more accurately, overloaded method name) occurs when multiple methods are created with the same name but different inputs. This can be confusing if you are unfamiliar with the concept but they can perform very different functions. When inspecting PyRosetta help messages, overloaded methods will be obvious since they list multiple "C++ signature :" lines and have multiple process explanations (descriptions of input and output, look for the same name preceding "->"). An explanation of the help message convention is above in Conventions 6. When using overloaded methods make SURE you are using the method you want. Although an overloaded method may change functionality, overloaded methods usually occur for getters and setters OR when multiple forms of input are acceptable.
When discussing Rosetta (or other code) architecture, it is common to refer to objects and methods abstractly. However, individual usage always involves an instance of the object or method call. Be aware of this subtlety in language since Pose (a name for the Rosetta object abstractly) and pose (a word for a specific instance of Pose or even the general concept) have different meanings. These nuances are relevant to PyRosetta since Python often hides more complex functionality and has no private data. In C++, an object must be created with a call to its "constructor" which ensures the object is setup to perform its necessary tasks. Since many Rosetta objects depend on one another, it can be difficult to know when objects are being created. Although combining creation and setting saves time, it is generally good practice to create an object with a single command and perform any related functions afterwords with subsequent commands (ensuring that the constructor is called properly). For example, the command:
creates a default ScoreFunction (the command ScoreFunction()), obtains a list of weights (the .weights() method) and accesses the value corresponding to fa_atr. Although this can be done in a single line, it should not. If you try a similar command, an incorrect result will be returned:
returns 0.0 even though 0.8 is the fa_atr weight for a 'standard' ScoreFunction. This error results due to complexities with construction however doing this in two commands where a ScoreFunction instance is created:
returns the proper value and is easier to understand.
A related concept used in Rosetta is the notion of a "Factory" object which returns instances of another object (usually the object of interest). Factory objects are often employed when a subset of options frequently change while other options do not (specifically when the object created may vary or if obtaining the object is more complex than simply constructing an instance of the class). For example, ScoreFunctions are essentially a list of ScoreType weights defining which score terms are relevant. Manually creating an object and setting all the weights can be tedious. A ScoreFunctionFactory can return a full ScoreFuction with pre-set weights. The "get_fa_scorefxn()" method above is actually a wrapper of a method within ScoreFunctionFactory which returns the current standard ScoreFunction. For PyRosetta, most Factory objects are "black-boxed" and work as desired.
There are many other nuances to Rosetta and PyRosetta syntax but many cases are specific.
Rosetta, and thus PyRosetta, has a general architecture which is useful to know when searching for objects and methods. Within PyRosetta, the most important Rosetta libraries to use are:
rosetta.core MANY basic Rosetta objects, chemistry/geometry, scoring
rosetta.numeric Vector and Matrix objects, other numerical applications
rosetta.protocols Movers and protocols
rosetta.utility additional useful objects and methods
If you are searching for specific Vector or Matrix objects, or otherwise suspect that what you are looking for is mathematical and independent of Rosetta data structures, the numeric and utility libraries are good places to look. If you are searching for specific Movers, protocols, or protocol methods, the protocols library is a good place to look. The core is divided into several sections, the most relevant to users being:
rosetta.core.chemical chemical information, residues, and residue types
rosetta.core.conformation protein geometry tools (full Residue information), for manipulation
rosetta.core.kinematics protein geometry tools (internal representation), for manipulation
rosetta.core.pose pose tools, the ultimate container
rosetta.core.scoring score function tools, score terms, and other assessment tools
Within PyRosetta, many objects contained within these libraries are exposed. It can be useful to know where these method explicitly live. When importing new objects or methods, provide the full path (use "from rosetta.protocols.scoring import Interface" not "from protocols.scoring import Interface"), a few objects are not accessible through tab-completion.
The minirosetta_database accompanying PyRosetta includes all the information on chemistry and scoring used by PyRosetta. When looking for new ScoreFunctions or ResidueTypes, please check the following locations:
/minirosetta_database/chemical/residue_type_sets directory containing all ResidueTypeSet information
/minirosetta_database/chemical/residue_type_sets/fa_standard/residue_sets directory containing the fullatom .params files
/minirosetta_database/chemical/residue_type_sets/centroid/residue_sets directory containing the centroid .params files
/minirosetta_database/scoring/weights directory containing ScoreFunction .wts and .wts_patch files
It is useful to know the general structure of Pose since it is the most common representation of a molecule in Rosetta. This is NOT a proper discussion of the Pose data structure and may be misleading if you wish to understand the full complexity of Rosetta (the Pose data structure, and Rosetta architecture, is outlined in the publication A. Leaver-Fay et al., "ROSETTA3: an object-oriented software suite for the simulation and design of macromolecules," Methods in Enzymology, 487, 545-574 (2011)).
A Pose contains
1) several Residue objects
2) an AtomTree object for chemical connectivity and bonding
3) a FoldTree object for propagating changes in structure
4) a Conformation object for storing geometry information
5) an Energies object for storing score information
6) a PDBInfo object for storing PDB information
(to reiterate, this is a pratical abstraction of the Pose structure, NOT the actual structure)
In many applications, MoveMap and FoldTree objects accompany one-another for defining a process, however the Pose does not have a MoveMap. A FoldTree is required for a Pose to understand its full connectivity and which residues depend on one another. MoveMaps can be used in parallel with their corresponding poses and are not integral for Pose structure. As such, specific MoveMaps and FoldTrees are often provided for an application.
Also, a single Pose defines the entire molecule or molecules of interest but is NOT necessarily a single protein. As such, it is impossible to dock two poses. It IS possible to join two separate poses into a single pose (at this point containing multiple molecules) and perform docking on this joined pose. Be aware of this subtlety since many applications will involve multiple protein chains.
Graphical User Interfaces
1. The PyRosetta Toolkit
The PyRosetta Toolkit is a GUI-addon to PyRosetta for setting up Rosetta filetypes, analyzing results, running protocols, and doing many other molecular modeling and design tasks. It is distributed with PyRosetta in the /GUIs/pyrosetta_toolkit directory. Please see the documentation for setup, use, and tips.
The code is written in Python, using the Tkinter API, which is distributed with Python itself. As such, it is easy to add new menus, windows, and functions to help in your own modeling and design. See GUIs/pyrosetta_toolkit/documentation for an overview of the code, and development details.
The PyRosetta Toolkit is developed by Jared Adolf-Bryfogle from the Dunbrack lab.
There are several ongoing problems in Rosetta that need to be fixed at a very deep level. This is effectively a collection of known problems to inform you that an error may result from underlying problems.
Making a PDB Rosetta-friendly
Rosetta is capable of loading in nearly any PDB, however there is often information which a Pose object will not want. PyRosetta is setup by default to fail loading a PDB containing atoms it does not know. You can add to the list of chemicals PyRosetta does know (see the ligand_interface.py). Be careful applying this generally since you may load in cumbersome amounts of information you do not want. PyRosetta comes with an extensive list of additional Residue choices but many of these are "turned off" by default. You can add residue types to PyRosetta permanently by adding them to (or removing the comment character from) minirosetta_database/chemical/residue_type_sets/fa_standard/residue_types.txt (or similar database file). Do NOT turn all of these "on" as it will likely cause memory problems.
Aside from identity, many protocols have difficulty with an input protein taken directly from the PDB. Many problems, such as missing atoms, are handled gracefully by default but others are not. In general, a Pose must be "adapted" for applications involving scoring (practically all). Unfortunately, there is no agreed-upon method for making a PDB "Rosetta-friendly". Common fixes involve high-resolution refinement techniques (see the sample_refinement sample script) utilizing mostly backbone score minimization and sidechain packing. For real applications, please consult the RosettaCommons forums and documentation. Usually three or more rounds of backbone minimization and sidechain packing, including the input sidechains, is sufficient to eliminate clashes without significantly moving the protein backbone or destroying relevant sidechain conformations.
Since Rosetta is not robust to varying PDB files, you may notice additional discrepancies including automatic manipulation of sidechain conformation or default scoring (a new Pose would have an empty Energies object, in fixing some problems this object is updated).
This is not a problem exclusive to Rosetta arises from the mere fact that Rosetta models a complex problem and PDB files are not completely standardized. Many small variations can accumulate into a large difference in Rosetta score between the crystal structure and the Rosetta prediction. PDB files can have different atom identifiers and you might want to check that your PDB has been interpreted properly by Rosetta (especially if the PDB has been modified by CHARMM).
Recent work has made PyRosetta more robust to user input however many problems can still cause hard-to-catch bugs or a Segmentation Fault which crashes the IPython interpreter. Try and avoid feeding Rosetta objects incorrect values or data structures as this can cause problems. Likewise, do not get or set Rosetta object data if it is empty. These problems most commonly occur when extracting information from a Pose's Energies and PDBInfo objects. PDBInfo only contained useful information if the pose was filled using pose_from_pdb. Energies is not updated until the pose is successfully scored.
Unfortunately, simple features, such as deleting residues, inserting residues, and directly changing a residue rotamer are currently unsupported Pose features. There is a tool outlined in the ala_scan.py sample script allowing residues to easily mutate. Hopefully these features will be supported in the next PyRosetta release.
Occasionally, you will want to create a copy of data. Python (fortunately) hides some difficult details and automatically creates a lightweight "pointer" object when the assignment operator ("=") is used on an existing object. This (unfortunately) means that creating a copy of data can be tricky. Since we usually want reference copies or "damageable" copies of our data, many Rosetta objects have "assign" or "clone" methods for replicating data. For example, if you try to make a centroid copy of a Pose using:
both pose1 and test_pose would change to centroid since test_pose is a pointer to pose1's data and the change will be applied on pose1. Proper syntax to create two separate poses for this purpose would be:
Several minimization techniques are available in PyRosetta but all techniques are gradient dependent. As such, score outliers, such as clashes, can severely disrupt minimization steps. When this occurs, the high clash score can be eliminated by many conformational changes, including several which may move the protein away from its global minima structure. This bug can produce confusing results especially in docking applications where inter-chain distance may be a degree of freedom. If you suspect minimization problems, check for clashes (high fa_rep or vdw scores, try investigating individual residues with pose.energies().residue_total_energies( resnum )[ <some_energy_term> ]) or eliminate the minimization step and investigate the results.
The traditional sidechain packing used in PyRosetta (such as PackRotamersMover) is NOT a deterministic algorithm. To reduce computation time, each residue is individually optimized and this process occurs in a "random" residue order. The lowest scoring conformation among all rotamers considered for a single residue is selected. As such, this algorithm does NOT optimize the entire structure. This stochasticity rarely causes problems but can result in slightly different scores. Often times, two or more successive rounds of packing is sufficient to yield the same rotamer selection. This is often unnecessary since the conformations will be very similar. By default, only rotamers are considered for packing and the original sidechain conformation will be lost. You may "save" a pose's sidechain conformations by using a ReturnSidechainsMover, setting up a resfile to include the original sidechain using the "USE_INPUT_SC" command (set by default in the header or on a per residue basis), or setting a PackerTask to consider them using the .or_include_current command(True).
Hydrogen Bonds and Hydrogen Bond Scoring
While per-residue score terms can normally be accessed with pose.energies().residue_total_energies( resnum ) or through the EMapVector as pose.energies().residue_total_energies( resnum )[ <some_energy_term> ] , the hydrogen bond terms will not access their correct value unless the ScoreFunction has the proper EnergyMethodOptions set. To make sure your ScoreFunction has this do:
from rosetta.core.scoring.methods import EnergyMethodOptions
emo = EnergyMethodOptions()
emo.hbond_options().decompose_bb_hb_into_pair_energies( True )
scorefxn.set_energy_method_options( emo )
Note: this setting is only required when extracting per-residue ScoreTerm values. Even without these options, the total energy of the pose will be calculated properly.
Additional syntax for obtaining per-hydrogen bond information, is included in the pose_scoring.py. Briefly, this involves constructing an HBondSet object (found in rosetta.core.scoring.hbonds.HBondSet) which is a list of HBond objects, each with score information (Hbond.energy).
hbond_set = HBondSet(pose, True).
Pass False to only calculate bb-bb Hbonds.
Older versions of PyRosetta may require manually extracting the hydrogen bonds:
hbond_set = HBondSet()
fill_hbond_set( pose, False, hb_set)
The PDBInfo object contains a Remarks object for storing PDB remarks. It does not work in PyRosetta. As mentioned above, many aspects of PDB manipulation are not currently supported by PyRosetta. There are many Python based PDB manipulation tools, such as Biopython, so please use the tools you are most familiar with and combine them with PyRosetta.