Wregex: Documentation

Overview
Browser compatibility
Examples
Basic Search
Search with a Custom Motif
Building a Custom PSSM
Running Wregex on your Local PC

Overview

Wregex (weighted regular expression) searches for short linear motifs (SLiMs) candidates in a target amino acid protein sequences dataset by combining a regular expression with an optional Position-Specific Scoring Matrix (PSSM). The regular expression is used to obtain a list of SLiM candidates matching the desired conditions, while the PSSM is used for computing a score that allows to rank the matches and then select the most promising candidates.

Wregex supports all motifs in ELM as regular expressions and also a number of custom motifs which include a PSSM. In case your motif of interest is not included in the dropdown list of predefined motifs or if a PSSM is not available for it, you can build your own custom motif following the steps detailed later on. In that case you can send us your motif definition and we will be happy to include it in the dropdown list of predefined motifs used in the basic search.

Browser compatibility

OS	Version	Firefox	Google Chrome	Microsoft Edge	Safari
Linux	Ubuntu 20.04 LTS	105.0	106.0.5249.103	n/a	n/a
macOS	Monterey 12.6	105.0.3	106.0.5249.119	n/a	16.0
Windows	Server 2016	104.0.1	107.0.5304.88	107.0.1418.26	n/a

Examples

To facilitate familiarity for novice users with the search form and the types of results obtained, the following presets are provided as example case studies:

Case	Description	Action
Example 1: NES/CRM1 candidates in human proteins of Cajal bodies	This preset illustrates a basic search using the NES/CRM1 motif for searching candidates in a subproteome. We target reviewed human proteins in UniProt that are annotated with the term GO:0015030 (Cajal body). A score threshold of 50 is used to focus on the best candidates.
Example 2: NES/CRM1 candidates with auxiliar NLS motif in Cargo A proteins (Kirli et al., 2015)	This case shows an example of using an auxiliary motif. We use NES/CRM1 as the main motif but we are also interested in the presence of a predicted NLS in the same protein. We target a positive dataset (Cargo A) to corroborate that the presence of an auxiliary NLS motif is significantly higher than in a negative dataset (Non-binder).
Example 3: Combining PLK4 motif matches with PhosphoSitePlus phosphorilations in human centrosome	This is an example of using PTM information. We use the PLK4 motif to find potential phosphorylation sites and also check if there are already known phosphorylations at those positions. Since this phosphorylation is spatially restricted to structures such as the centrosome, we target a subset of the HPA centrosome proteins that we will use for training a custom PSSM using only phosphorylated matches.

Basic Search

The basic search consists on searching a target protein dataset for one or several predefined short linear motifs (SLiMs):

First of all, the main motif of interest must be selected from the dropdown list. This list is divided into a first part containing entries specific of Wregex (for which both a regular expression and a PSSM are available), and a second part consisting on ELM entries (without a PSSM). Once a motif is selected, a second dropdown list is shown to the right allowing the user to select a specific motif configuration (different regular expression and/or PSSM) in case there are several available. Optionally, this process can be repeated for a second auxiliary motif.
Next, a target protein dataset must be specified. This can be done in multiple ways: by selecting a predefined dataset available in the target dropdown list provided by Wregex, by providing a Gene Ontology (GO) term, by entering a list of protein accessions or gene names, or by uploading a custom fasta file.
After completing the two steps above, a Search! button will be finally shown. Clicking this button starts a new search and a list of candidates will be shown when finished (if there is any match).

Additionally, external database Connections can be enabled for conducting more specific searches:

COSMIC: for each candidate motif above searches COSMIC missense mutations and ranks the results according to each mutation impact and recurrence.
PhosphoSitePlus/dbPTM: allows selecting the desired PTM types and for each candidate motif above searches the selected PTMs and ranks the results according to the number of PTMs present.

The default Wregex behaviour can be customized by changing the Options section:

The Grouping checkbox allows to configure whether overlapping candidates should be considered as a single entry or separated into multiple entries. An overlap occurs when there are several combinations that match the regular expression in the same region of the protein.
The Filter similar checkbox is useful when the user is interested only in the motif sequence. If enabled, only the first result for the same matched sequence is shown.
The Filter without connections checkbox keeps only the results with values for any of the external connections selected.
The score Threshold spinner keeps only main motif matches with a PSSM score equal or larger than the value specified.
The number of Flanking amino acids spinner shows additional amino acids at both sides of each match.

Finally, the results are displayed as follows:

First of all, the total number of results obtained is shown followed by three buttons that allow downloading these results into a local file in the user's computer in the following formats respectively:
- Comma Separated Values (CSV): useful for further analysis using any spreadsheet or statistical analysis software.
- Clustal ALN file: can be used to display the alignments of the results using the capturing groups indicated between parenthesis () in the regular expression.
- Fasta: amino acid sequences that can also be used for training a PSSM as explained later on.
These results are also displayed on a table with the following columns (among others):
- Protein accession/entry from the fasta header.
- Starting and ending positions of the candidate motif within the protein sequence.
- Candidate motif sequence with the regular expression groups indicated between dashes (-).
- Number of combinations grouped in the same entry (overlapping positions).
- Wregex score (ranging from 0 to 100).
- If there are motif annotations with scores in the input fasta sequences, their score is also displayed as the assay score.
- Percentage of motif sequence within a disordered region annotated in UniProt.
- Number of features within the motif region annotated in UniProt. By passing the mouse over this number, detailed results are shown.
- If enabled, auxiliary motif score.
- If enabled, COSMIC mutated sequence, the mutation impact and the mutation recurrence.
- If enabled, PSP/dbPTM modifications within the candidate motif.

Search with a Custom Motif

When a pattern is not available for the motif of interest or when the user is interested in exploring a modification of an existing pattern, Wregex offers the possibility to define a custom pattern. To do so, in the Motif dropdown list select Custom and then an input text box will be shown to type (or paste and edit) the custom pattern. This amnino acid pattern must be defined using the syntax of the Java Pattern class for regular expressions. The Regulex visualizer may help novice users with defining custom regular expressions. For instance [LIMA].{2,3}[LIVMF] means any of leucine (L), isoleucine (I), methionine (M) or alanine (A) followed by 2 or 3 ({2,3}) occurrences of any (.) amino acid and then followed by any of leucine (L), isoleucine (I), valine (V), methionine (M) or phenylalanine (F).

While not mandatory, Wregex exploits the use of regular expression capturing groups for marking regions of the motif. For instance, in the pattern defined above consisting of two hydrophobic amino acids separated by a short variable-length region, each key hydrophobic amino acid position can be marked as a capturing group of a single amino acid, and the spacing region as a capturing group of 2 or 3 amino acids. This will be used by Wregex to align matches across the capturing groups and to define the Position-Specific Scoring Matrix (PSSM) positions as explained later on. Capturing groups are marked between parenthesis, so the updated regular expression using capturing groups would be ([LIMA])(.{2,3})([LIVMF]). In the case that capturing groups have been defined and a PSSM file has previously been built, this PSSM file can optionally be used by clicking the Select PSSM button and thus scores will be reported for each of the matches.

Finally, a target protein dataset can be selected as previously described and after clicking the Search! button the results obtained with this custom motif will be available.

Building a Custom PSSM

A motif pattern, defined as a regular expression alone, can be used to report matches to the target protein sequences, but it does not provide any kind of score to rank those matches and select the most promising ones. Wregex allows to optionally complement this regular expression with a Position-Specific Scoring Matrix (PSSM) so that any match of the regular expression to the target protein sequences will also be assigned a score. Since motif matches usually have a variable amino acid length, Wregex exploits the use of regular expression capturing groups to define the positions that will be used in the PSSM instead of inserting "gaps". Using the previous example, the following regular expression ([LIMA])(.{2,3})([LIVMF]) has three capturing groups (regions between parenthesis) so it will also define three PSSM positions: the first one consists of the first hydrophobic amino acid, the second one consists of the 2-3 amino acid separation, and the third one consists of the final hydrophobic amino acid. Then, a custom PSSM indicating a weight for every amino acid in each of the three positions can be provided for computing a match score.

A training web page is provided for allowing the user to build a custom PSSM. Basically this just requires the following steps:

Enter into the input text box a regular expression with capturing groups to indicate the PSSM positions (see previous section).
Provide an input fasta file with the training sequences by clicking the Input motifs button. An intermediate page will be shown as explained soon.
A table with the Training motifs used to build the PSSM is shown at the end of the page. This table also offers the possibility of removing any of the motif matches that will be used for the training process just by clicking the recycle bin icon in the Action column. If this removed motif match overlapped with another match(es), the weight (column Weight') of the remaining match(es) will be updated to account for the total weight (column Weight) considering the updated number of matches for that motif (column "!").
Download the PSSM by clicking any of the following buttons:
- The Download Coarse PSSM button will not use decimals in the scores, thus rounding the results to coarse weights, which is suitable when the number of training sequences is not very large.
- The Download Fine PSSM button will use one decimal precision, thus providing more adjusted weights, which can be suitable if the number of training sequences is large enough. If in doubt, both PSSMs can be downloaded and tested against a validation dataset not used for training the PSSM.
This PSSM file can then be selected together with the custom regular expression to search for new motifs as previously explained in the Search with a Custom Motif section.

As previously stated, when the Input motifs button is clicked an intermediate page will be shown to specify the training motifs in the following manner:

First of all, an input fasta file with the training sequences must be uploaded by clicking the Select fasta button. This fasta file does not have any special requirement, although internal annotations will be recognized if the file was generated previously from this training page. By clicking the following button you can download the example fasta file used for this documentation:
After selecting the fasta file, a table with the Input motifs will be displayed. From this table the user can optionally assign a weight to each of the matches, and even re-define the position of the match if it is a part of a larger motif. If this table is modified, the Refresh button must be pressed to show the updated results. Finally, the Matched column displays the number of matches of the custom regex within the motif position range defined. This number will be used to divide the motif weight between the different matches when building the PSSM.
The resulting training motif list can optionally be downloaded as a fasta file by clicking the Download motifs button. This file will have internal annotations of the motif positions and weights so that they can be recovered in the future if it is uploaded by clicking the Select fasta button for refining or continuing the training process.
Once the input motif list has been defined, clicking the Continue with PSSM training button will redirect to the previous training page to download the PSSM file as previously explained.

If you define a custom motif (custom regex and/or custom PSSM) that can be of interest to others, please contact us and we will be happy to include it in the Wregex dropdown list of motifs to facilitate its use through the basic search form.

Running Wregex on your Local PC

Wregex execution at https://ehubio.ehu.eus/wregex/ has some constraints in order to avoid denial of service to other users, i.e. input target protein dataset limit and execution time limit. These constraints can be avoided by running Wregex locally on your computer and changing the default limits.

To run Wregex on your local computer, you need a Java EE 8 (or later) application server. Then you can download the *.war binary distributable and copy it into your deployments directory. Both the binary file and the Java code can be accessed from the Downloads page.

You can change the default execution limits by editing the WEB-INF/web.xml file. Please take into account that Wregex uses a predefined set of databases which must be present in your system. You can modify (add/delete) these databases by editing the resources/data/databases.xml file. If you need help with this process please contact us and we will be happy to assist you.

Table of Contents