Proposed LR taxonomy
Central to the model is the LR taxonomy, which allows us to organize the resources in a structured way, taking into consideration the specificities of each type.
The proposed LR taxonomy constitutes an integral part of the metadata model, whereby the types of LRs (attributes and values) belong to the element set. The basic element used to categorize LRs in types that lead to coherent sets of descriptions is the resourceType with the following values:
- corpus (including written/text, oral/spoken, multimodal/multimedia corpora)
- lexical / conceptual resource (including terminological resources, word lists, semantic lexica, ontologies etc.)
- language description (including grammars)
- tool / service (including basic processing tools, applications, web services etc. required for processing data resources).
Central to the description of the LRs in the META-SHARE context is also the mediaType element, which specifies the form/physical medium of the resource. The notion of medium is preferred over the written/spoken/multimodal distinction, as it has clearer semantics and allows us to view LRs as a set of media representations, each of which can be described through a distinctive set of features. Thus, the following mediaType values are foreseen:
- text (+textNumerical, textNgram)
A resource may consist of parts belonging to different types of media: for instance, a multimodal corpus includes a video part (moving image), an audio part (dialogues) and a text part (subtitles and/or transcription of the dialogues); a multimedia lexicon includes the text part, but may also include a video and/or an audio part; a sign language resource is also a resource with various media types (video, image, text). Similarly, tools can be applied to resources of different media types: e.g. a tool can be used both for video and for audio files. Thus, for each part of the resource, the respective feature set (components and elements) should be used: e.g. for a spoken corpus and its transcriptions, the audio feature set will be used for the audio part and the text feature set for the transcribed part.