Resources

Public E-Mail Lists

Public Documents

Tutorials

Links/Misc

 

M4IF Home

About MPEG-4

About M4IF

News and Events

M4IF Membership Information

M4IF Members'
Area

MPEG-4 Resources

MPEG-4 Products
and Services

Contact us



©MPEG-4 Industry Forum

Last modified: Wed Feb 06 11:35:43 EST 2002

Levels for MPEG-4 Visual Profiles

by Fernando Pereira and Paulo Nunes (Instituto Superior Técnico, Lisboa - Portugal)

The MPEG-4 Visual standard defines (by October 2001) 18 visual object types and 19 visual profiles. Nine visual profiles have been defined in MPEG-4 Visual Version 1 [MPEG4-2]: Simple, Simple Scalable, Core, Main, N-bit, Scaleable Texture, Simple Face Animation, Basic Animated Texture, and Hybrid.

Six additional visual profiles have been defined in MPEG-4 Visual Version 2 [MPEG4-2]: Core Scalable, Advanced Core, Advanced Coding Efficiency, Advanced Real Time Simple, Advanced Scaleable Texture, and Simple FBA.

Moreover 2 additional profiles have been defined in the 1st Extension to the 2nd Edition of the MPEG-4 Visual standard [MPEG01a]: Simple Studio and Core Studio. And 2 profiles in the 2nd Extension to the 2nd Edition of the MPEG-4 Visual standard [MPEG01b]: Advanced Simple and Fine Granularity Scalability.

In the following, the mechanism specified to define video levels – Video Buffering Verifier – as well as the visual levels defined for all visual profiles will be presented.

Down


A.1 Video Buffering Verifier Mechanism

The idea of using a Video Buffering Verifier mechanism to bound the decoding complexity of a given set of bitstreams is not new, and was already adopted in previous MPEG video coding standards, MPEG-1 [MPEG1-2] and MPEG-2 [MPEG2-2]. In these standards, the major purpose of the Video Buffering Verifier mechanism was to set some restrictions on the maximum variability of the number of bits per picture, especially in the case of constant bitrate operation, and thus on the complexity of the encoded video streams.

Generically, the complexity of the encoded video is directly related to the encoded bitrate and to the decoded video data rate that the decoder generates, e.g. measured in terms of the number of MB/s. For frame-based video coding, e.g. MPEG-1 and MPEG-2, the decoded video data rate is typically constant since the frames have fixed dimensions and are usually encoded at fixed frame rates. This is not the general case for object-based video coding, as in MPEG-4, since the several video objects composing a scene may vary in size along time and may be encoded at different VOP rates. Therefore, the amount and type[1] of MB/s that a given object-based video decoder has to process may largely vary over time in comparison with frame-based coding solutions [Nunes].

In the MPEG-4 context, to limit the decoding complexity of a set of bitstreams corresponding to a video scene it is then necessary to set some limits on the variability of the number of decoded MB/s, and their complexity, and also on the picture memory required to store the decode data. This constitutes the major novelty of the MPEG-4 Video Buffering Verifier mechanism, relatively to the previous MPEG standards, since it does not only bound the bitstream buffer memory but also the MB decoding capacity and the MB picture memory.

The MPEG-4 Video Buffering Verifier mechanism [MPEG4-2; Annex D] consists of three normative models, see Figure A.1, each one defining a set of rules and limits to verify if the amount required for a specific type of decoding resource is within the values allowed by the corresponding profile and level specification, see Table A.1:

  1. Video Rate Buffer Verifier (VBV) – This model is used to verify that the bitstream memory required at the decoder(s) does not exceed the values specified for the corresponding profile and level. The model is defined in terms of the VBV buffer sizes for all the VOLs corresponding to the objects building the scene. Each VBV buffer size corresponds to the maximum amount of bits that the decoder can store in the bitstream memory for the corresponding VOL; there is, however, also a limitation on the sum of the VOL VBV buffer sizes. The bitstream memory is the memory where the decoder puts the bits received for a VOL while waiting to be decoded.
  2. Video Complexity Verifier (VCV) – This model is used to verify that the computational power (processing speed), defined in terms of MB/s, required at the decoder does not exceed the values specified for the corresponding profile and level. The model is defined in terms of the VCV MB/s decoding rate and VCV buffer size and is applied to all MBs in the scene. If arbitrarily shaped VOs exist in the scene, an additional VCV buffer and VCV decoding rate is also defined, to be applied only to the boundary MBs.
  3. Video Reference Memory Verifier (VMV) – This model is used to verify that the picture memory required at the decoder for the decoding of a given scene does not exceed the values specified for the corresponding profile and level. The model is defined in terms of the VMV buffer size, which is the maximum number of decoded MBs that the decoder can store during the decoding process of all VOLS corresponding to the scene.


Figure A.1 Video buffering verifier model [MPEG4-2]

The Video Presentation Model (VPM) is not a normative part of the MPEG-4 Visual specification [MPEG4-2]. It is an algorithm for checking that the set of bitstreams corresponding to a scene does not require an amount of presentation memory higher than a given amount of memory expressed in units of MB. It is also used to constraint the speed of the compositor in terms of maximum number of MB/s. The Video Presentation Verifier (VPV) operates in the same way as the VCV in terms of occupancy dynamics [MPEG4-2].

In order that the set of visual elementary streams corresponding to a given scene may be considered compliant with a given profile and level, the encoder must guarantee that none of the above mentioned buffers overflows and, additionally, it must also guarantee that, in certain circumstances, the VBV buffer never underflows.

A.1.1 Video Rate Buffer Verifier Definition

The MPEG-4 VBV model defines a set of rules and limits for examining a video elementary bitstream with a delivery rate function, R(t). This model simulates the occupancy of the decoder bitstream buffer in order to control the amount of bitstream memory required at the decoder. Its purpose is to guarantee that the bitstream memory required is less than the specified buffer size, i.e. to verify that the decoder bitstream buffer occupancy never goes beyond the limits of the specified buffer size for the relevant profile@level. In the case of visual scenes composed by multiple VOs, each with one or more VOLs, the MPEG-4 Visual standard specifies that the video rate buffer model shall be applied independently to each VOL (using a particular buffer size and rate function for each VOL). Additionally, the maximum total bitstream buffer size (defined as the sum of all VOL bitstream buffer sizes) for the given profile and level shall not be exceeded, see Table A.1. Notice that the bitrate and buffer size allocation, among the several VOs and, for each VO, among the several VOLs, is a non-normative issue although it can significantly determine the performance of object-based video encoders, and thus deserves careful attention.

The VBV applies to video data encoded as a combination of I-, P-, B-, and S-VOPs, using several coding tools organized in terms of video object types. Face animation, still texture, and mesh objects are not constrained by the VBV model. The coded video bitstreams shall be constrained to comply with the requirements of the VBV specified in the following sections.

A.1.1.1 VBV Model Parameters

The VBV model for a given elementary stream (ES) is defined by the three following parameters: vbv_buffer_size, vbv_occupancy, and bit_rate. These parameters have to be defined for all the ESs corresponding to the various objects in a scene. These parameters can be specified at video level, this means through the video ES, or by means of systems level configuration information [MPEG4-1]. In the first case, the VBV model parameters are specified in the VOL header, when the one-bit flag vbv_parameters is set to ‘1’. In the second case, the VBV model parameters are conveyed to the video decoder through the Object Description Information, more precisely through the DecoderConfigDescriptor field of the ES_Descriptor associated to the ES in question.

When the vbv_buffer_size and vbv_occupancy parameters are specified by systems level configuration information, the bitstream shall be constrained according to the specified values, and these values shall not be part of the video ES. It may happen, however, that these parameters are not explicitly specified; in this case, it is assumed that the ES is constrained according to the default values of the corresponding profile and level combination[2].

A.1.1.2 VBV Occupancy Dynamics

The VBV occupancy dynamics specifies when the bitstream bits enter the VBV buffer and when they are removed from it to be decoded, i.e. the process by which the VBV buffer is filled and drained. This process is mainly driven by the time instants at which the VOP bits are removed from the VBV.

A.1.1.3 VBV Model Constraints

This section applies to all the cases considered in the VBV model except for basic sprites, which have a special treatment. The first I-VOP of a sprite VO is divided into N sections of 396 MBs and each section is treated as a different VOP. The remaining S-VOPs are treated as any other VOP.

A.1.2 Video Complexity Verifier Definition

The MPEG-4 VCV model defines a set of rules and limits for examining a set of ESs building a visual scene to control if the required amount of decoder processing power is less than the maximum complexity specified for the given profile and level, both measured in MBs per second, see Table A.1. This model is applied to all MBs of all ESs of the scene together.

The VCV applies to video objects encoded as a combination of I-, P-, B- and S-VOPs[6]. A separate VCV model applies to still texture objects [MPEG4-2]. Face animation and mesh objects are not constrained by this model.

The coded video bitstreams for a certain scene shall be constrained to globally comply with the requirements of the VCV defined in the following sections.

A.1.2.1 VCV Model Parameters

The VCV model consists in two virtual buffers accumulating the number of MBs in the encoded data:

  1. The VCV Buffer accumulates all MBs of all VOLs for the scene.
  2. The Boundary MB VCV Buffer (B-VCV)[7] accumulates only boundary MBs.

Notice that boundary MBs (i.e. MBs including shape information which is not totally transparent or totally opaque) are included in both the VCV and the B-VCV buffers.

The VCV model is defined by the size of the buffers mentioned above, the corresponding draining rates (i.e. the VCV and B-VCV decoding rates), and the latency of the VCV model (which depends on the VCV buffer size and VCV decoding rate).

A.1.2.2 VCV Occupancy Dynamics

The VCV dynamics simulates the VOP decoding process. At the VOP decoding times, the VOP encoded data is added to the VCV buffers and is removed from these buffers as the decoding process progresses. The time instant at which a given VOP is completely decoded depends on the amount and type of MBs to be decoded, the occupancy of the VCV buffers at the VOP decoding time, and the maximum decoding speed specified through the VCV decoding rates for the profile@level in question.

A.1.2.3 VCV Model Constraints

Compliance regarding the VCV model can only be guaranteed if the set of ESs building a scene fulfills the constraints imposed by the VCV model relatively to the occupancy of the VCV buffers and the VOP decoding duration defined as follows:

A.1.3 Video Reference Memory Verifier Definition

The MPEG-4 VMV model defines a set of rules and limits for examining the set of ESs building a visual scene to control if the required amount of decoder picture memory, measured in MB units, is less than the maximum memory specified for the chosen profile and level, see Table A.1. The VMV models the memory requirements of all VOLs of all VOs in the scene (this model assumes a common memory space, shared by all VOLs of all VOs).

The VMV applies to video objects encoded as a combination of I-, P-, B-, S-VOPs, and still texture objects. Face animation, mesh objects, and I-VOPs in basic sprite sequences are not constrained by this model.

The coded video bitstreams shall be constrained to comply with the requirements of the VMV defined in the following sections.

A.1.3.1 VMV Model Parameters

The VMV model consists of a MB buffer that accumulates all the decoded MBs of all VOPs and stores them until they are no longer needed for the prediction of other VOPs. The VMV model is defined by the size of this buffer, the vmv_buffer_size, defining the maximum amount of decoded MBs that the decoder can store at any time instant, see Table A.1.

A.1.3.2 VMV Occupancy Dynamics

The VMV dynamics simulates the decoded VOP memory allocation and de-allocation process. As each VOP is being processed, the decoder needs to allocate memory to store the decoded data. This data remains in the decoder memory until it is no longer needed, e.g. for prediction. At this point in time, the memory allocated to store this data is instantaneously released and can be used again.

A.1.3.3 VMV Model Constraints

A given set of visual ESs building a scene conforms with a given profile@level, with respect to the VMV model, if it never overflows the VMV model buffer.

A.1.4 Interaction between the VBV, VCV, and VMV Models

A given set of ESs building a visual scene is considered compliant with a given profile and level if it fulfills all the constraints defined by the several Video Buffering Verifier models. Bitstream compliance with a given profile@level guarantees that the resources required at the decoder do not exceed a certain pre-defined amount corresponding to the relevant profile@level. Moreover it defines strict timing for completion of decoding and composition of VOPs as explained in the following:

  1. The VBV model defines the time at which the coded bits for each VOP are available for decoding and the time at which they should be removed from the VBV buffer - the coded bits for each VOP should be removed from the VBV buffer at the VOP decoding times, ti, computed from the composition time information in the video ES or conveyed by systems decoding time stamps.
  2. The VCV model defines the decoding speed of the MB data, and, thus, the time at which each VOP is available for composition - a given VOP should be available for composition, at most, at the VOP composition time plus the VCV latency, i.e. at the time it is supposed to be available to the compositor.
  3. The VMV model defines the amount of picture memory allocated at each time instant and the time it should be released - a given VOP should be removed from the VMV buffer at its composition time plus the VCV latency (B-VOP) or at the composition time plus the VCV latency of the next P or I VOP (I or P-VOPs).

The various models are independent but interact with each other in the following way:

In order to avoid these situations, the Video Buffering Verifier mechanism imposes strict times for starting and ending any VOP decoding - constraint imposed by the VCV model.

The Video Buffering Verifier models provide the mechanism allowing any encoder to produce bitstreams that will be decodable by any decoder compliant with the selected profile@level. This mechanism allows to simultaneously limit the amount of decoding resources needed at the receiving terminals as well as ensure the timely reconstruction of the encoded information.

It is important to highlight that it is a major task of the encoder to simulate each of the Video Buffering Verifier models in order to produce bitstreams compliant with the intended profile and level. If any of these models tends to be violated, the encoder has to take appropriate countermeasures to avoid it. Although the Video Buffering Verifier is defined for the decoders, it is in fact a major module of any encoder generating compliant sets of bitstreams.

A.2 Definition of Levels for Video Profiles

Table A.1 describes the MPEG-4 Visual levels for the Version 1 and Version 2 profiles only including natural visual (or video) data, this means the so-called MPEG-4 video profiles. Note that Level 0 for the Simple profile has been defined in the 2nd Extension to the 2nd Edition of the MPEG-4 Visual standard [MPEG01b].

Table A.1 Levels for the MPEG-4 video profiles

Visual profile

Level

Typical visual session size

Max. number of objects 1

Maximum number objects per type

Max. unique quant. tables

Max. VMV buffer size
(MB units)2

Max. VCV buffer size (MB)8

VCV decoder rate (MB/s) 4

VCV boundary MB
decoder rate (MB/s)9

Max. total VBV buffer size
(units of 16384 bits)5

Max. VOL VBV buffer size
(units of 16384 bits)

Max. video packet length (bits)6

Max. sprite size (MB units)

Wavelet restric­tions

Max. bitrate (kbit/s)

Max. enhancement layers
per object

Simple10

L0

QCIF

1

1 x Simple

1

198

99

1485

N.A.

10

10

2048

N. A.

N. A.

64

N. A.

Simple

L1

QCIF

4

4 x Simple

1

198

99

1485

N.A.

10

10

2048

N. A.

N. A.

64

N. A.

Simple

L2

CIF

4

4 x Simple

1

792

396

5940

N. A.

40

40

4096

N. A.

N. A.

128

N. A.

Simple

L3

CIF

4

4 x Simple

1

792

396

11880

N. A.

40

40

8192

N. A.

N. A.

384

N. A.

Advanced Real Time Simple

L1

QCIF

4

4 x Simple or Adv. Real Time Simple

1

198

99

1485

N.A.

10

10

8192

N. A.

N. A.

64

N. A.

Advanced Real Time Simple

L2

CIF

4

4 x Simple or Adv. Real Time Simple

1

792

396

5940

N. A.

40

40

16384

N. A.

N. A.

128

N. A.

Advanced Real Time Simple

L3

CIF

4

4 x Simple or Adv. Real Time Simple

1

792

396

11880

N. A.

40

40

16384

N. A.

N. A.

384

N. A.

Advanced Real Time Simple

L4

CIF

16

16 x Simple or Adv. Real Time Simple

1

792

396

11880

N. A.

80

80

16384

N. A.

N. A.

2000

N. A.

Simple Scalable

L1

CIF

4

4 x Simple or Simple Scalable

1

1782

495

7425

N. A.

40

40

2048

N. A.

N. A.

128

1 spatial or temporal enhancement layer

Simple Scalable3

L2

CIF

4

4 x Simple or Simple Scalable

1

3168

792

23760

N.A.

40

40

4096

N. A.

N. A.

256

1 spatial or temporal enhancement layer

Core

L1

QCIF

4

4 x Core or Simple

4

594

198

5940

2970

16

16

4096

N. A.

N. A.

384

1

Core

L2

CIF

16

16 x Core or Simple

4

2376

792

23760

11880

80

80

8192

N. A.

N. A.

2000

1

Advanced Core

L1

QCIF

4

4 x Core or Simple or Adv. Scalable Texture

4

594

198

5940

2970

16

8

4096

N. A.

see Table A.5

384

1

Advanced Core

L2

CIF