Media Resource Control Protocol (MRCP)
- What is this document?
- What is the Media Resource Control Protocol (MRCP)?
- How do MRCP server and client communicate?
- Specification of the Recogniser Resource in MRCPv2
|Author:||Ivan A. Uemlianin|
- update to MRCPv2.12 annotated; write script to annotate
This set of pages outlines my current understanding about the Media Resource Control Protocol (MRCP). It includes documentation, an annotated specification, and some python scripts. My aim, and my primary interest in MRCP, is to develop a MRCP speech recognition resource as part of the trefnydd speech recognition toolkit. Consequently this document does not cover other resource types like speech synthesis or recording.
MRCP is a network protocol that implements a common interface to a range of speech engine types, including speech recognition (ASR), speech synthesis (aka TTS), speaker or voice recognition (aka speaker verfication), and simple voice recording. An MRCP resource server (also known simply as an MRCP server) will provide access to one or more of these resources.
A remote client, which may be a desktop computer, a server or some other handheld or embedded device, contacts the server over the internet and establishes a session. Once the session has been set up, client and server exchange messages requesting and providing various speech processing services. The term 'MRCP' covers both (a) the setting up of the session, and (b) the format of the messages exchanged during the session. See How do MRCP server and client communicate? for details.
To date [13/05/07] there are two versions of MRCP. Also of interest is the requirements document (ietf rfc 4313) which discusses the need to go beyond MRCP version 1 and lays the foundations for version 2. This document shall focus on MRCP version 2.
As of 10th March 2007, the latest draft of MRCP version 2 is 2.12 (dated March 5, 2007).
Throughout this document I refer to MRCPv2.10, i.e., the 10th draft of the MRCP version 2 specification. This document actually refers to a local, annotated version of the 10th draft. The annotation consists only of some extra internal hyperlinks, making the document a bit easier to navigate.
A concise summary of the differences between the MRCP versions is given in (Burke, 2007:6-7). Here are some of the main points:
- MRCPv1 uses the Real Time Streaming Protocol (RTSP) for the base-layer communications between parties; MRCPv2 uses SIP and RTP. See How do MRCP server and client communicate? for further discussion.
- MRCPv2 adds support for speaker verification and simple speech recording.
- the development of MRCPv2 seems to have been based on wider input and a more principled process. Consequently, MRCPv2 is more vendor-neutral, and is more transparently standards-based.
|?!||I read Burke (2007) to mean that MRCPv1 is now obsolete, or at least deprecated.|
A client requiring speech processing resources (e.g., a phone calling an automated call centre) contacts the MRCP server over TCP and uses Session Initiaition Protocol (SIP) and Session Description Protocol (SDP) to set up a session (see Establishing a session_). Once the session as been set up, two separate channels are opened: a media channel for audio data streams, which uses the Real-time Transport Protocol (RTP); and a control channel, running on TCP, in which messages in MRCP format are exchanged (see The control channel: MRCP message format).
MRCPv2 employs a session establishment and management protocol such as SIP in conjunction with SDP. The client finds and reaches a MRCPv2 server using conventional INVITE and other SIP transactions for establishing, maintaining, and terminating SIP dialogs. The SDP offer/answer exchange model over SIP is used to establish a resource control channel for each resource. The SDP offer/answer exchange is also used to establish media sessions between the server and the source or sink of audio.
SIP messages are text messages sent over the regular internet (i.e., TCP). As with most text-based internet message formats, SIP messages consis of a first line, a header and an optional body. The body of a SIP message is in SDP format.
|?!||Are SIP bodies always SDP?|
|?!||SDP is called a protocol, but it seems to be really a message (body) format.|
Essentially the SIP part of the message (i.e., the first line and the header) describes the participants in a suggested communication session, while the SDP part (i.e., the message body) describes what kind of session(s) is/are required (by the client), or are available (from the server). SIP/SDP are not used only for MRCP: see the external references in the Glossary for further discussion.
An MRCP session will often consist of two separate channels, one for audio datastreams, known as the media channel; and another, known as the control channel, for requests, results and other negotiation pertaining to the speech processing. The media channel is a binary data channel in RTP format (over UDP); the control channel is a text-based channel sending MRCP format messages over TCP.
n.b.: An MRCP session may consist of only a control channel, if the server is providing only an 'interpretation' resource. This kind of MRCP resource carries out naltural language processing on textual data.
|?!||So we have an SIP session which morphs into an RTP session and an MRCP session. In programming terms, does that mean three different servers, or should an MRCP server handle SIP, RTP and MRCP? The latter sounds more sensible to me.|
|?!||Nonetheless, the MRCP server would have to read/write messages in all three formats. That means, e.g., if I'm using Twisted, I'll have to implement RTP and MRCP protocols (Twisted seems already to have SIP).|
The MRCP message itself is a text string. The protocol header and header names are in US-ASCII, everything is in UTF-8 by default. All lines end in CRLF. [MRCPv2 Section 5]
generic-message = start-line message-header CRLF [ message-body ] start-line = request-line / response-line / event-line message-header = 1*(generic-header / resource-header) resource-header = recognizer-header / synthesizer-header / recorder-header / verifier-header
The start-line identifies the message as either a Request for services (client to server), a Response (server to client) to the request, or an Event (server to client) informing the client of a change in state in the server.
# headers ...
Message headers are in the form name:value. The header set or report variables pertaining to the current request or the session. Section 6.2 describes generic methods and headers; Section 9.4 describes recogniser resource headers.
The optional message body "contains resource-specific and message-specific data carried as a MIME entity" [5.1]. n.b.: the message body always contains textual data only - binary (i.e., audio) data is sent over the separate media channel. MRCP 'leverages' the W3C voice browser formats extensively, and recognition-related message bodies contain SRGS (or JSGF) for recognition grammars, or NLSML for results. Section 9.5 describes recogniser resource message bodies.
Speech recognition resources are given the IANA-registered typenames 'speechrecog' (for full speech recognition) or 'dtmfrecog' (for DTMF recognition). MRCP defines other resource types, but we shall limit our concern to these two.
The Speech Recognizer Resource is described in Section 9 of the MRCPv2 specification.
Two types of recogniser are defined: dtmfrecog and speechrecog. The dtmfrecog resource isn't a speech recogniser as it recognises, and acts on, only DTMF input (and ignores speech). The speechrecog resource does recognise speech, and can optionally include the dtmfrecog functionality. Whether a particular speechrecog resource recognises DTMF input depends on the grammar(s) activated (see below).
Four modes are identified:
Full recognition of the input against the grammar. Tries to match all of the input against the given grammar. Returns a 'no-match' status on failure.
Looks for a match within the input, ignoring input that does not match.
Voice Enrolled Grammars
Also known as speaker-dependent recognition. Speaker recognition is performed on the input. This is known as 'enrollment', and acts like username/password login. On enrollment, functions are enabled which depende on the given speaker, for example training voice commands or maintaining a list of contacts. The specification says that it is optional for an recogniser resource to support voice enrolled grammars (Section 9.2).
No recognition is performed. In this mode, 'the resource takes text as input and produces an "interpretation" of the input according to the supplied grammar." This interpretation might for example convert a written enquiry into an SQL database query.
This document will cover Normal Mode and Hotword Mode recognition. This document will not cover DTMF recognition, Voice Enrolled Grammars, or Interpretation.
This diagram of the recogniser state machine is based on the diagram at MRCPv2 Section 9.1 (with slight alterations for ease of viewing):
Idle Recognizing Recognized State State State | | | |---------RECOGNIZE---->|---RECOGNITION-COMPLETE-->| | | | |<--------STOP----------|<-----RECOGNIZE-----------| | | | | | /-----------| | /--------| GET-RESULT | | START-OF-INPUT | \---------->| | \------->| | |------------\ | | | | |----------\ | | DEFINE-GRAMMAR | START-INPUT-TIMERS | |<-----------/ |<---------/ | | | | | |------\ | |-------\ | RECOGNIZE | | STOP |<-----/ | |<------/ | | | |<-------------------STOP--------------------------| | | |<-------------------DEFINE-GRAMMAR----------------|
For details on each method or event, please see either the MRCPv2 Specification, or trefnydd/MRCP.
See References/Software for links to MRCP resources available on the web.
See trefnydd/MRCP for notes on MRCP support in trefnydd.
DTMF: see Dual-tone multi-frequency
IANA: see Internet Assigned Numbers Authority
NLSML: see Natural Language Semantics Markup Language
SRGS: see Speech Recognition Grammar Specification
MRCP Resource types:
See MRCPv2.10 4.2. Managing Resource Control Channels for details.
Resource Type Resource Description Described in speechrecog Speech Recognizer Section 9 dtmfrecog DTMF Recognizer Section 9 speechsynth Speech Synthesizer Section 8 basicsynth Basic Synthesizer Section 8 speakverify Speaker Verification Section 11 recorder Speech Recorder Section 10
RTP: see Real-time Transport Protocol
RTSP: see Real Time Streaming Protocol
SDP: see Session Description Protocol
SIP: see Session Initiation Protocol
UDP: see User Datagram Protocol
- Requirements for Distributed Control of Automatic Speech Recognition (ASR), Speaker Identification/Speaker Verification (SI/SV), and Text-to-Speech (TTS) Resources (2005):
- MRCP Version 1 (2006):
the Intel MRCP client library (links to User's guide).
LIVE555 Streaming media:
From the website: "This [LGPL] code forms a set of C++ libraries for multimedia streaming, using open standard protocols (RTP/RTCP, RTSP, SIP) ... They can easily be extended to support additional (audio and/or video) codecs, and can also be used to build basic RTSP or SIP clients and servers."
The site doesn't mention MRCP explicitly, but I came across this interesting exchange on their mailing list (via Google).
A set of MRCP-related resources (including a speech server) written in 'the Java programming language'.