Wregex (weighted regular expression) searches for short linear motifs (SLiMs) candidates in a target amino acid protein sequences dataset by combining a regular expression with an optional Position-Specific Scoring Matrix (PSSM). The regular expression is used to obtain a list of SLiM candidates matching the desired conditions, while the PSSM is used for computing a score that allows to rank the matches and then select the most promising candidates.
Wregex supports all motifs in ELM as regular expressions and also a number of custom motifs which include a PSSM. In case your motif of interest is not included in the dropdown list of predefined motifs or if a PSSM is not available for it, you can build your own custom motif following the steps detailed later on. In that case you can send us your motif definition and we will be happy to include it in the dropdown list of predefined motifs used in the basic search.
To facilitate familiarity for novice users with the search form and the types of results obtained, the following presets are provided as example case studies:
The basic search consists on searching a target protein dataset for one or several predefined short linear motifs (SLiMs):
Additionally, external database Connections can be enabled for conducting more specific searches:
The default Wregex behaviour can be customized by changing the Options section:
Finally, the results are displayed as follows:
When a pattern is not available for the motif of interest or when the user is interested in exploring a modification of an existing pattern, Wregex offers the possibility to define a custom pattern. To do so, in the Motif dropdown list select Custom and then an input text box will be shown to type (or paste and edit) the custom pattern. This amnino acid pattern must be defined using the syntax of the Java Pattern class for regular expressions. The Regulex visualizer may help novice users with defining custom regular expressions. For instance [LIMA].{2,3}[LIVMF] means any of leucine (L), isoleucine (I), methionine (M) or alanine (A) followed by 2 or 3 ({2,3}) occurrences of any (.) amino acid and then followed by any of leucine (L), isoleucine (I), valine (V), methionine (M) or phenylalanine (F).
While not mandatory, Wregex exploits the use of regular expression capturing groups for marking regions of the motif. For instance, in the pattern defined above consisting of two hydrophobic amino acids separated by a short variable-length region, each key hydrophobic amino acid position can be marked as a capturing group of a single amino acid, and the spacing region as a capturing group of 2 or 3 amino acids. This will be used by Wregex to align matches across the capturing groups and to define the Position-Specific Scoring Matrix (PSSM) positions as explained later on. Capturing groups are marked between parenthesis, so the updated regular expression using capturing groups would be ([LIMA])(.{2,3})([LIVMF]). In the case that capturing groups have been defined and a PSSM file has previously been built, this PSSM file can optionally be used by clicking the Select PSSM button and thus scores will be reported for each of the matches.
Finally, a target protein dataset can be selected as previously described and after clicking the Search! button the results obtained with this custom motif will be available.
A motif pattern, defined as a regular expression alone, can be used to report matches to the target protein sequences, but it does not provide any kind of score to rank those matches and select the most promising ones. Wregex allows to optionally complement this regular expression with a Position-Specific Scoring Matrix (PSSM) so that any match of the regular expression to the target protein sequences will also be assigned a score. Since motif matches usually have a variable amino acid length, Wregex exploits the use of regular expression capturing groups to define the positions that will be used in the PSSM instead of inserting "gaps". Using the previous example, the following regular expression ([LIMA])(.{2,3})([LIVMF]) has three capturing groups (regions between parenthesis) so it will also define three PSSM positions: the first one consists of the first hydrophobic amino acid, the second one consists of the 2-3 amino acid separation, and the third one consists of the final hydrophobic amino acid. Then, a custom PSSM indicating a weight for every amino acid in each of the three positions can be provided for computing a match score.
A training web page is provided for allowing the user to build a custom PSSM. Basically this just requires the following steps:
As previously stated, when the Input motifs button is clicked an intermediate page will be shown to specify the training motifs in the following manner:
If you define a custom motif (custom regex and/or custom PSSM) that can be of interest to others, please contact us and we will be happy to include it in the Wregex dropdown list of motifs to facilitate its use through the basic search form.
Wregex execution at https://ehubio.ehu.eus/wregex/ has some constraints in order to avoid denial of service to other users, i.e. input target protein dataset limit and execution time limit. These constraints can be avoided by running Wregex locally on your computer and changing the default limits.
To run Wregex on your local computer, you need a Java EE 8 (or later) application server. Then you can download the *.war binary distributable and copy it into your deployments directory. Both the binary file and the Java code can be accessed from the Downloads page.
You can change the default execution limits by editing the WEB-INF/web.xml file. Please take into account that Wregex uses a predefined set of databases which must be present in your system. You can modify (add/delete) these databases by editing the resources/data/databases.xml file. If you need help with this process please contact us and we will be happy to assist you.
We use cookies to provide you with the best possible experience. They also allow us to analyze user behavior in order to constantly improve the website for you.