The WMT shared task on sign language translation follows the procedures established by the shared task on machine translation that has been running since 2007. Here, we are outlining the participation process in order to make it more clear for new participants.
Please note: The participation is about exchanging ideas, not about winning the competition! We encourage broad participation -- if you feel that your method is interesting but not state-of-the-art, then please participate in order to disseminate it and measure progress. The community may use your idea in a better manner and cite your paper, or even learn something useful from the negative results.
The participant downloads the training data and optionally the software of the baseline system.
The training data can be fed into the baseline system we provide in order to train a new system. The baseline system is a very basic system with a simple implementation of sequence-to-sequence methods for this task. The participants are allowed to modify the baseline system in order to improve the performance, or use software written by themselves.
As per WMT standards, it is possible that the participants use external data from other sources in order to improve the quality of their system. In this case though, the systems will be referred to as unconstrained and indicated as such in the final evaluation.
Note that basic linguistic tools such as taggers, parsers, or morphological analyzers are allowed in the constrained condition as well as pretrained language models released before February 2022. General tools for video processing such as pose extraction, image feature extraction or models such as CLIP are also allowed for a constrained submission.
During the system training step, the participants do not have access to the test data.
Step 2: Processing of the test data
The shared task organizers make the test sets available. The test set only contains the source side of every translation direction.
The participants use their systems (trained in the previous step) to translate the newly given test set.
The test set should be used only for decoding, i.e. using the existing trained system to produce translations. It is meant to be “blind” for the training process, i.e. it should by no means be used as part of the training data, or for other steps such as tuning, data augmentation etc.
Step 3: Submission of the system outputs
The authors submit their systems outputs to the WMT platform called Ocelot. By submitting system outputs to Ocelot, participants agree that their system outputs will be made publicly available later.
Participants are allowed to upload seven submissions in total, but they have to indicate one of them as the primary submission. Since the resources for human evaluation are limited, the organizers will give priority to the primary submissions.
While making a submission participants have to define if a particular system is constrained or unconstrained.
For the DSGS-to-German translation direction, Ocelot produces scores based on automatic metrics. (The platform does not display any automatic scores for the German-to-DSGS direction.) Importantly, while the ranking of systems on Ocelot is based on these automatic scores, the final ranking of the shared task is based on human evaluation instead.
One week after the system submission deadline, each participant has to submit a one-page abstract of the system description. It may be a full system description paper or only a draft that can be later modified for the final (4-page) system description paper (step 5).
Furthermore, submissions should use the WMT XML format. We provide the test sources in this XML format, see the data tab. The exact submission format we require depends on the translation direction:
In this case system outputs are German text. We require that these outputs are combined with the XML file containing the test sources, see here for an example.
DSGS outputs must be one mp4 video for each line of German input text. Participants are free to submit any content they deem suitable as a translation. Examples: avatar animations, videos featuring photo-realistic signers or a pose estimation video.
Output videos should be wrapped in XML as well, but instead of text the <segment> elements in the XML should contain public links.