Media Resource Control Protocol (MRCP)

Contents

Author:	Ivan A. Uemlianin
Contact:	ivan@llaisdy.com

TODO

update to MRCPv2.12 annotated; write script to annotate

What is this document?

This set of pages outlines my current understanding about the Media Resource Control Protocol (MRCP). It includes documentation, an annotated specification, and some python scripts. My aim, and my primary interest in MRCP, is to develop a MRCP speech recognition resource as part of the trefnydd speech recognition toolkit. Consequently this document does not cover other resource types like speech synthesis or recording.

What is the Media Resource Control Protocol (MRCP)?

Brief definition

MRCP is a network protocol that implements a common interface to a range of speech engine types, including speech recognition (ASR), speech synthesis (aka TTS), speaker or voice recognition (aka speaker verfication), and simple voice recording. An MRCP resource server (also known simply as an MRCP server) will provide access to one or more of these resources.

A remote client, which may be a desktop computer, a server or some other handheld or embedded device, contacts the server over the internet and establishes a session. Once the session has been set up, client and server exchange messages requesting and providing various speech processing services. The term 'MRCP' covers both (a) the setting up of the session, and (b) the format of the messages exchanged during the session. See How do MRCP server and client communicate? for details.

Versions

To date [13/05/07] there are two versions of MRCP. Also of interest is the requirements document (ietf rfc 4313) which discusses the need to go beyond MRCP version 1 and lays the foundations for version 2. This document shall focus on MRCP version 2.

Version 1

See A Media Resource Control Protocol (MRCP) Developed by Cisco, Nuance, and Speechworks (2006).

Requirements specification

See Requirements for Distributed Control of Automatic Speech Recognition (ASR), Speaker Identification/Speaker Verification (SI/SV), and Text-to-Speech (TTS) Resources (2005).

Version 2

As of 10th March 2007, the latest draft of MRCP version 2 is 2.12 (dated March 5, 2007).

Throughout this document I refer to MRCPv2.10, i.e., the 10th draft of the MRCP version 2 specification. This document actually refers to a local, annotated version of the 10th draft. The annotation consists only of some extra internal hyperlinks, making the document a bit easier to navigate.

Differences between versions 1 & 2

A concise summary of the differences between the MRCP versions is given in (Burke, 2007:6-7). Here are some of the main points:

MRCPv1 uses the Real Time Streaming Protocol (RTSP) for the base-layer communications between parties; MRCPv2 uses SIP and RTP. See How do MRCP server and client communicate? for further discussion.
MRCPv2 adds support for speaker verification and simple speech recording.
the development of MRCPv2 seems to have been based on wider input and a more principled process. Consequently, MRCPv2 is more vendor-neutral, and is more transparently standards-based.

?!	I read Burke (2007) to mean that MRCPv1 is now obsolete, or at least deprecated.

How do MRCP server and client communicate?

A client requiring speech processing resources (e.g., a phone calling an automated call centre) contacts the MRCP server over TCP and uses Session Initiaition Protocol (SIP) and Session Description Protocol (SDP) to set up a session (see Establishing a session_). Once the session as been set up, two separate channels are opened: a media channel for audio data streams, which uses the Real-time Transport Protocol (RTP); and a control channel, running on TCP, in which messages in MRCP format are exchanged (see The control channel: MRCP message format).

Establishing a session

From MRCPv2.10 4.1. Connecting to the Server:

MRCPv2 employs a session establishment and management protocol such
as SIP in conjunction with SDP.  The client finds and reaches a
MRCPv2 server using conventional INVITE and other SIP transactions
for establishing, maintaining, and terminating SIP dialogs.  The SDP
offer/answer exchange model over SIP is used to establish a resource
control channel for each resource.  The SDP offer/answer exchange is
also used to establish media sessions between the server and the
source or sink of audio.

SIP messages are text messages sent over the regular internet (i.e., TCP). As with most text-based internet message formats, SIP messages consis of a first line, a header and an optional body. The body of a SIP message is in SDP format.

?!	Are SIP bodies always SDP?
?!	SDP is called a protocol, but it seems to be really a message (body) format.

Essentially the SIP part of the message (i.e., the first line and the header) describes the participants in a suggested communication session, while the SDP part (i.e., the message body) describes what kind of session(s) is/are required (by the client), or are available (from the server). SIP/SDP are not used only for MRCP: see the external references in the Glossary for further discussion.

MRCPv2.10 sections 4.2. Managing Resource Control Channels and section 14.1. Examples: Message Flow give details and examples of the client-server dialogue.

An MRCP session will often consist of two separate channels, one for audio datastreams, known as the media channel; and another, known as the control channel, for requests, results and other negotiation pertaining to the speech processing. The media channel is a binary data channel in RTP format (over UDP); the control channel is a text-based channel sending MRCP format messages over TCP.

n.b.: An MRCP session may consist of only a control channel, if the server is providing only an 'interpretation' resource. This kind of MRCP resource carries out naltural language processing on textual data.

?!	So we have an SIP session which morphs into an RTP session and an MRCP session. In programming terms, does that mean three different servers, or should an MRCP server handle SIP, RTP and MRCP? The latter sounds more sensible to me.
?!	Nonetheless, the MRCP server would have to read/write messages in all three formats. That means, e.g., if I'm using Twisted, I'll have to implement RTP and MRCP protocols (Twisted seems already to have SIP).

The media channel: RTP messages

See the external references in the Glossary for RTP for details on the media channel.

The control channel: MRCP message format

The MRCP message itself is a text string. The protocol header and header names are in US-ASCII, everything is in UTF-8 by default. All lines end in CRLF. [MRCPv2 Section 5]

generic-message  =    start-line
                      message-header
                      CRLF
                      [ message-body ]

start-line       =    request-line / response-line / event-line

message-header   =    1*(generic-header / resource-header)

resource-header  =    recognizer-header
                 /    synthesizer-header
                 /    recorder-header
                 /    verifier-header

The start-line identifies the message as either a Request for services (client to server), a Response (server to client) to the request, or an Event (server to client) informing the client of a change in state in the server.

# headers ...

Message headers are in the form name:value. The header set or report variables pertaining to the current request or the session. Section 6.2 describes generic methods and headers; Section 9.4 describes recogniser resource headers.

The optional message body "contains resource-specific and message-specific data carried as a MIME entity" [5.1]. n.b.: the message body always contains textual data only - binary (i.e., audio) data is sent over the separate media channel. MRCP 'leverages' the W3C voice browser formats extensively, and recognition-related message bodies contain SRGS (or JSGF) for recognition grammars, or NLSML for results. Section 9.5 describes recogniser resource message bodies.

Specification of the Recogniser Resource in MRCPv2

Speech recognition resources are given the IANA-registered typenames 'speechrecog' (for full speech recognition) or 'dtmfrecog' (for DTMF recognition). MRCP defines other resource types, but we shall limit our concern to these two.

Resource overview

The Speech Recognizer Resource is described in Section 9 of the MRCPv2 specification.

Types and modes of recogniser

Two types of recogniser are defined: dtmfrecog and speechrecog. The dtmfrecog resource isn't a speech recogniser as it recognises, and acts on, only DTMF input (and ignores speech). The speechrecog resource does recognise speech, and can optionally include the dtmfrecog functionality. Whether a particular speechrecog resource recognises DTMF input depends on the grammar(s) activated (see below).

Four modes are identified:

Normal Mode

Full recognition of the input against the grammar. Tries to match all of the input against the given grammar. Returns a 'no-match' status on failure.
Hotword Mode

Looks for a match within the input, ignoring input that does not match.
Voice Enrolled Grammars

Also known as speaker-dependent recognition. Speaker recognition is performed on the input. This is known as 'enrollment', and acts like username/password login. On enrollment, functions are enabled which depende on the given speaker, for example training voice commands or maintaining a list of contacts. The specification says that it is optional for an recogniser resource to support voice enrolled grammars (Section 9.2).
Interpretation

No recognition is performed. In this mode, 'the resource takes text as input and produces an "interpretation" of the input according to the supplied grammar." This interpretation might for example convert a written enquiry into an SQL database query.

This document will cover Normal Mode and Hotword Mode recognition. This document will not cover DTMF recognition, Voice Enrolled Grammars, or Interpretation.

The recogniser state machine

This diagram of the recogniser state machine is based on the diagram at MRCPv2 Section 9.1 (with slight alterations for ease of viewing):

Idle                    Recognizing                Recognized
State                   State                      State
  |                       |                          |
  |---------RECOGNIZE---->|---RECOGNITION-COMPLETE-->|
  |                       |                          |
  |<--------STOP----------|<-----RECOGNIZE-----------|
  |                       |                          |
  |                       |              /-----------|
  |              /--------|       GET-RESULT         |
  |       START-OF-INPUT  |              \---------->|
  |              \------->|                          |
  |------------\          |                          |
  |            |          |----------\               |
  |      DEFINE-GRAMMAR   |   START-INPUT-TIMERS     |
  |<-----------/          |<---------/               |
  |                       |                          |
  |                       |------\                   |
  |-------\               |   RECOGNIZE              |
  |      STOP             |<-----/                   |
  |<------/                                          |
  |                                                  |
  |<-------------------STOP--------------------------|
  |                                                  |
  |<-------------------DEFINE-GRAMMAR----------------|

For details on each method or event, please see either the MRCPv2 Specification, or trefnydd/MRCP.

Implementations

See References/Software for links to MRCP resources available on the web.

See trefnydd/MRCP for notes on MRCP support in trefnydd.

Glossary

DTMF: see Dual-tone multi-frequency

IANA: see Internet Assigned Numbers Authority

JSGF: see Java Speech Grammar Format or Java Speech API

NLSML: see Natural Language Semantics Markup Language

SRGS: see Speech Recognition Grammar Specification

MRCP Resource types:

See MRCPv2.10 4.2. Managing Resource Control Channels for details.

Resource Type Resource Description Described in

speechrecog Speech Recognizer Section 9

dtmfrecog DTMF Recognizer Section 9

speechsynth Speech Synthesizer Section 8

basicsynth Basic Synthesizer Section 8

speakverify Speaker Verification Section 11

recorder Speech Recorder Section 10

Resource Type	Resource Description	Described in
speechrecog	Speech Recognizer	Section 9
dtmfrecog	DTMF Recognizer	Section 9
speechsynth	Speech Synthesizer	Section 8
basicsynth	Basic Synthesizer	Section 8
speakverify	Speaker Verification	Section 11
recorder	Speech Recorder	Section 10

RTP: see Real-time Transport Protocol

RTSP: see Real Time Streaming Protocol

SDP: see Session Description Protocol

SIP: see Session Initiation Protocol

UDP: see User Datagram Protocol

References

MRCP specifications

Requirements for Distributed Control of Automatic Speech Recognition (ASR), Speaker Identification/Speaker Verification (SI/SV), and Text-to-Speech (TTS) Resources (2005):

http://tools.ietf.org/html/rfc4313

MRCP Version 1 (2006):

http://tools.ietf.org/html/rfc4463

MRCPv2.10

at ietf.org: http://tools.ietf.org/html/draft-ietf-speechsc-mrcpv2-10
local annotated version

MRCPv2.12

http://tools.ietf.org/html/draft-ietf-speechsc-mrcpv2-12

Other relevant W3C specifications

Speech Interface Framework
VoiceXML
SISR: Semantic Interpretation for Speech Recognition
SSML: Speech Synthesis Markup Language

Software

the Aculab MRCP Client Library
the Intel MRCP client library (links to User's guide).
Voxpilot offer a free evaluation copy of their Open Media Platform, an MRCP-based VoiceXML media server.
LIVE555 Streaming media:

From the website: "This [LGPL] code forms a set of C++ libraries for multimedia streaming, using open standard protocols (RTP/RTCP, RTSP, SIP) ... They can easily be extended to support additional (audio and/or video) codecs, and can also be used to build basic RTSP or SIP clients and servers."

The site doesn't mention MRCP explicitly, but I came across this interesting exchange on their mailing list (via Google).
SpeechForge:

A set of MRCP-related resources (including a speech server) written in 'the Java programming language'.