Last year's data can be found here.
Licensing and attribution
The datasets can be used for non-commercial research. Please note that distributing the datasets or making them accessible to third parties is not permitted, either in their original or edited form.
Besides, we ask that you cite the shared task overview paper once it can be referenced (at the moment a BibTeX entry does not exist).
Overview of datasets
We provide separate training and test data. The training data is available right away. The test data will be released in two stages, starting with a release of the test sources only.
The training data comprises the Signsuisse lexicon and SRF corpus, see below for a more detailed description. Signsuisse contains lexical items in Swiss German Sign Language (DSGS), French Sign Language of Switzerland (LSF-CH), and Italian Sign Language of Switzerland (LIS-CH) represented as videos and glosses. SRF contains parallel data between Swiss German Sign Language and German, of which the linguistic domain is general news. Both datasets are distributed through SwissUbase.
The test data will consist of 50% Signsuisse and 50% SRF examples. All signers in the test data are known signers. Our evaluation will not test generalization across signers. If participants want to test this, one way would be to test on last year’s FocusNews data.
Participants may find the LSF parallel corpus as an additional useful resource.
Accessing the data
Direct download links:
Training corpus 1: Signsuisse Lexicon
(This text currently describes the following release version: 2.0)
We collected 18,221 lexical items from the Signsuisse website, 17,221 of which are released in training data and 1,000 are reserved for testing and therefore not included in the training data release. The lexicon contains three languages:
DSGS (9044 items, 500 reserved),
LSF-CH (6423 items, 250 reserved),
LIS-CH (2754 items, 250 reserved).
The lexical items are represented as videos and glosses, which enable sign-by-sign translation from spoken languages to signed languages, as illustrated by our baseline system (demo). For each lexical item, there is also one signed example sentence represented in a video as well as the corresponding spoken language translation (there are 16 exceptions where an example is missing). The example videos and sentences can be viewed as parallel data between signed and spoken languages.
The videos were recorded with different framerates, either 24, 25, or 30 fps, and the video resolution is 640 x 480.
Training corpus 2: SRF
(This text currently describes the following release version: 1.0)
These are daily national news and weather forecast episodes broadcast by the Swiss National TV (Schweizerisches Radio und Fernsehen, SRF) The episodes are narrated in Standard German of Switzerland (different from Standard German of Germany, and different from Swiss German dialects) and interpreted into Swiss German Sign Language (DSGS). The interpreters are hearing individuals, some of them children of Deaf adults (CODAs).
The subtitles are partly preproduced, and partly created live via respeaking based on automatic speech recognition.
While both the subtitles and the signing are based on the original speech (audio), due to the live subtitling and live interpreting scenario, a temporal offset between audio and subtitles as well as audio and signing is inevitable. This is visualized in the figure below:
Different from last year’s edition of the shared task, the offset between the signing and the subtitles was not manually corrected for the current edition. On the other hand, the size of the training data is much larger than last year.
The parallel data comprises 779 episodes of approximately 30 minutes each with the sign language videos (without audio track, only signers visible) and the corresponding subtitles.
We selected episodes from 2014 to 2021 interpreted by 4 known hearing interpreters who consented to having their likeness used for this shared task. We reserved one episode from each year for test, so in training data there are 771 episodes.
The videos have a framerate of 25 fps and a resolution of 1280 x 720.
Earlier releases of a similar dataset
A small subset of this data has been published as part of the Content4All project (EU Horizon 2020, grant agreement no. 762021).
(This text currently describes the following release version: 2.0. Currently, only the test sources are released.)
We distribute separate test data for our 4 translation directions.
French-to-LSF-CH / Italian-to-LIS-CH
We previously reserved 250 Signsuisse entries for test purposes for each translation direction and each entry contains one signed example sentence represented in a video as well as the corresponding spoken language translation. This data serves as our test data for French-to-LSF-CH and Italian-to-LIS-CH translation.
German-to-DSGS / DSGS-to-German
This subset of the test data has two distinct parts:
Previously reserved Signsuisse entries as described above.
One additional, undisclosed SRF episode that is manually aligned by a deaf signer for each translation direction (in contrast to the original interpretation where signers are hearing interpreters).
For each data set described above we provide videos and corresponding subtitles. In addition, we include pose estimates (location of body keypoints in each frame) as a convenience for participants.
For SRF we are not distributing the original videos, but ones that are preprocessed in a particular way so that they only show the part of each frame where the signer is located (cropping) and the background is replaced with a monochrome color (signer masking):
We identify a rectangle (bounding box) where the signer is located in each frame, then crop the video to this region.
Signer segmentation and masking
To the cropped video we apply an instance segmentation model, Solo V2 (Wang et al., 2020), to separate the background from the signer. This produces a mask that can be superimposed on the cropped video to replace each background pixel in a frame with a grey color.
This process also adjusts timecodes in a heuristic manner if needed. For instance, if automatic sentence segmentation detects that a well-formed sentence stops in the middle of a subtitle, a new end time will be computed. The end time is proportional to the location of the last character of the sentence, relative to the entire length of the subtitle. See Example 2 below for an illustration of this case.
00:05:22,607 --> 00:05:24,687
Die Jury war beeindruckt
00:05:24,687 --> 00:05:28,127
und begeistert von dieser gehörlosen Frau.
Original subtitle example 1
00:05:22,607 --> 00:05:28,127
Die Jury war beeindruckt und begeistert von dieser gehörlosen Frau.
After automatic segmentation example 1
00:00:24,708 --> 00:00:27,268
Die Invalidenversicherung Region Bern startete
00:00:27,268 --> 00:00:29,860
dieses Pilotprojekt und will herausfinden, ob man es
00:00:29,860 --> 00:00:33,460
zukünftig umsetzen kann. Es geht um die Umsetzung
Original subtitle example 2
00:00:24,708 --> 00:00:31,720
Die Invalidenversicherung Region Bern startete dieses Pilotprojekt und will herausfinden, ob man es zukünftig umsetzen kann.
After automatic segmentation example 2
"Poses'' are an estimate of the location of body keypoints in video frames. The exact set of keypoints depends on the pose estimation system, well known ones are OpenPose and Mediapipe Holistic. Usually such a system provides 2D or 3D coordinates of keypoints in each frame, plus a confidence value for each keypoint.
The input for pose processing are cropped and masked videos (see above). As any machine learning system, pose prediction does not have perfect accuracy, it is expected that it fails in some instances.
Binary .pose format
Unlike last year’s JSON format data, we use a binary .pose format to deliver this year’s pose data for more compact storage and faster I/O. To read the .pose format into Numpy array, use the pose library:
from pose_format import Pose
path = '<path _to _the binary_file>.pose'
pose = Pose. read(open (path, "rb"). read( ))
pose_data = pose.body. data # frames, people, points, dimensions
pose_confidence = pose.body. Confidence
This library is a complete toolkit for working with poses, including some normalization, augmentation, and visualization utilities. See this README for more examples.
To convert last year’s JSON format to .pose, use this script.
We used the Openpose 137 model (default one) for the Signsuisse data and the Openpose 135 model for the SRF data. The two models are both widely used and the 137 model has two more keypoints because it repeats the wrists twice. For further detail, each pose file we release starts with a header describing the list of keypoints it includes in order.
OpenPose often detects several people in our videos, even though there is only 1 single person present. We distribute the original predictions which contain all people that OpenPose detected.
As an alternative, we also predict poses with the Mediapipe Holistic system developed by Google. Unlike our OpenPose model, it is a regression model and outputs 3D (X,Y, Z) coordinates. For the SRF data, values from Holistic are normalized between 0 and 1, instead of referring to actual video coordinates. The actual values can be restored given the size of the original video using this script.
The poses of the Signsuisse data contain the full 543 keypoints, while the poses of the SRF data contain a reduced set of 203 keypoints where the dense FACE_LANDMARKS are reduced to the contour keypoints according to the variable mediapipe.solutions.holistic.FACEMESH_CONTOURS. Reducing the SignSuisse poses to a format similar to SRF can be done using this script.