|
||||||||||||||||||||
|
||||||||||||||||||||
©MPEG-4 Industry Forum Last modified: Wed Feb 06 11:35:43 EST 2002 |
Levels for MPEG-4 Visual Profilesby Fernando Pereira and Paulo Nunes (Instituto Superior Técnico, Lisboa - Portugal)The MPEG-4 Visual standard defines (by October 2001) 18 visual object types and 19 visual profiles. Nine visual profiles have been defined in MPEG-4 Visual Version 1 [MPEG4-2]: Simple, Simple Scalable, Core, Main, N-bit, Scaleable Texture, Simple Face Animation, Basic Animated Texture, and Hybrid. Six additional visual profiles have been defined in MPEG-4 Visual Version 2 [MPEG4-2]: Core Scalable, Advanced Core, Advanced Coding Efficiency, Advanced Real Time Simple, Advanced Scaleable Texture, and Simple FBA. Moreover 2 additional profiles have been defined in the 1st Extension to the 2nd Edition of the MPEG-4 Visual standard [MPEG01a]: Simple Studio and Core Studio. And 2 profiles in the 2nd Extension to the 2nd Edition of the MPEG-4 Visual standard [MPEG01b]: Advanced Simple and Fine Granularity Scalability. In the following, the mechanism specified to define video levels – Video Buffering Verifier – as well as the visual levels defined for all visual profiles will be presented. |
|||||||||||||||||||
The idea of using a Video Buffering Verifier mechanism to bound the decoding complexity of a given set of bitstreams is not new, and was already adopted in previous MPEG video coding standards, MPEG-1 [MPEG1-2] and MPEG-2 [MPEG2-2]. In these standards, the major purpose of the Video Buffering Verifier mechanism was to set some restrictions on the maximum variability of the number of bits per picture, especially in the case of constant bitrate operation, and thus on the complexity of the encoded video streams.
Generically, the complexity of the encoded video is directly related to the encoded bitrate and to the decoded video data rate that the decoder generates, e.g. measured in terms of the number of MB/s. For frame-based video coding, e.g. MPEG-1 and MPEG-2, the decoded video data rate is typically constant since the frames have fixed dimensions and are usually encoded at fixed frame rates. This is not the general case for object-based video coding, as in MPEG-4, since the several video objects composing a scene may vary in size along time and may be encoded at different VOP rates. Therefore, the amount and type[1] of MB/s that a given object-based video decoder has to process may largely vary over time in comparison with frame-based coding solutions [Nunes].
In the MPEG-4 context, to limit the decoding complexity of a set of bitstreams corresponding to a video scene it is then necessary to set some limits on the variability of the number of decoded MB/s, and their complexity, and also on the picture memory required to store the decode data. This constitutes the major novelty of the MPEG-4 Video Buffering Verifier mechanism, relatively to the previous MPEG standards, since it does not only bound the bitstream buffer memory but also the MB decoding capacity and the MB picture memory.
The MPEG-4 Video Buffering Verifier mechanism [MPEG4-2; Annex D] consists of three normative models, see Figure A.1, each one defining a set of rules and limits to verify if the amount required for a specific type of decoding resource is within the values allowed by the corresponding profile and level specification, see Table A.1:

Figure A.1 Video buffering verifier model [MPEG4-2]
The Video Presentation Model (VPM) is not a normative part of the MPEG-4 Visual specification [MPEG4-2]. It is an algorithm for checking that the set of bitstreams corresponding to a scene does not require an amount of presentation memory higher than a given amount of memory expressed in units of MB. It is also used to constraint the speed of the compositor in terms of maximum number of MB/s. The Video Presentation Verifier (VPV) operates in the same way as the VCV in terms of occupancy dynamics [MPEG4-2].
In order that the set of visual elementary streams corresponding to a given scene may be considered compliant with a given profile and level, the encoder must guarantee that none of the above mentioned buffers overflows and, additionally, it must also guarantee that, in certain circumstances, the VBV buffer never underflows.
The MPEG-4 VBV model defines a set of rules and limits for examining a video elementary bitstream with a delivery rate function, R(t). This model simulates the occupancy of the decoder bitstream buffer in order to control the amount of bitstream memory required at the decoder. Its purpose is to guarantee that the bitstream memory required is less than the specified buffer size, i.e. to verify that the decoder bitstream buffer occupancy never goes beyond the limits of the specified buffer size for the relevant profile@level. In the case of visual scenes composed by multiple VOs, each with one or more VOLs, the MPEG-4 Visual standard specifies that the video rate buffer model shall be applied independently to each VOL (using a particular buffer size and rate function for each VOL). Additionally, the maximum total bitstream buffer size (defined as the sum of all VOL bitstream buffer sizes) for the given profile and level shall not be exceeded, see Table A.1. Notice that the bitrate and buffer size allocation, among the several VOs and, for each VO, among the several VOLs, is a non-normative issue although it can significantly determine the performance of object-based video encoders, and thus deserves careful attention.
The VBV applies to video data encoded as a combination of I-, P-, B-, and S-VOPs, using several coding tools organized in terms of video object types. Face animation, still texture, and mesh objects are not constrained by the VBV model. The coded video bitstreams shall be constrained to comply with the requirements of the VBV specified in the following sections.
The VBV model for a given elementary stream (ES) is defined by the three following parameters: vbv_buffer_size, vbv_occupancy, and bit_rate. These parameters have to be defined for all the ESs corresponding to the various objects in a scene. These parameters can be specified at video level, this means through the video ES, or by means of systems level configuration information [MPEG4-1]. In the first case, the VBV model parameters are specified in the VOL header, when the one-bit flag vbv_parameters is set to ‘1’. In the second case, the VBV model parameters are conveyed to the video decoder through the Object Description Information, more precisely through the DecoderConfigDescriptor field of the ES_Descriptor associated to the ES in question.
When the vbv_buffer_size and vbv_occupancy parameters are specified by systems level configuration information, the bitstream shall be constrained according to the specified values, and these values shall not be part of the video ES. It may happen, however, that these parameters are not explicitly specified; in this case, it is assumed that the ES is constrained according to the default values of the corresponding profile and level combination[2].
The VBV occupancy dynamics specifies when the bitstream bits enter the VBV buffer and when they are removed from it to be decoded, i.e. the process by which the VBV buffer is filled and drained. This process is mainly driven by the time instants at which the VOP bits are removed from the VBV.

This section applies to all the cases considered in the VBV model except for basic sprites, which have a special treatment. The first I-VOP of a sprite VO is divided into N sections of 396 MBs and each section is treated as a different VOP. The remaining S-VOPs are treated as any other VOP.
(A.1)The MPEG-4 VCV model defines a set of rules and limits for examining a set of ESs building a visual scene to control if the required amount of decoder processing power is less than the maximum complexity specified for the given profile and level, both measured in MBs per second, see Table A.1. This model is applied to all MBs of all ESs of the scene together.
The VCV applies to video objects encoded as a combination of I-, P-, B- and S-VOPs[6]. A separate VCV model applies to still texture objects [MPEG4-2]. Face animation and mesh objects are not constrained by this model.
The coded video bitstreams for a certain scene shall be constrained to globally comply with the requirements of the VCV defined in the following sections.
The VCV model consists in two virtual buffers accumulating the number of MBs in the encoded data:
Notice that boundary MBs (i.e. MBs including shape information which is not totally transparent or totally opaque) are included in both the VCV and the B-VCV buffers.
The VCV model is defined by the size of the buffers mentioned above, the corresponding draining rates (i.e. the VCV and B-VCV decoding rates), and the latency of the VCV model (which depends on the VCV buffer size and VCV decoding rate).
The size of each VCV buffer, respectively vcv_buffer_size and boundary_vcv_buffer_size, defines the maximum number of MBs that a given decoder can instantaneously have in the decoding queue to process, i.e. the maximum occupancy of the VCV buffers in MB units. In the current MPEG-4 Visual specification [MPEG4-2], the two buffers have always the same maximum dimension for all profile@levelcombinations.
These MBs are consumed by the decoder, from each buffer, at a given VCV decoding rate, in MB/s, as specified for each profile@level. The VCV decoding rate, H, specifies the draining rate of the VCV buffer while the B-VCV decoding rate, HB , specifies the draining rate of the B-VCV. Together they define the maximum speed of the decoding process. As can be seen in Table A.1, the B-VCV decoding rate, HB , is typically half the VCV decoding rate, H.
For each profile@level combination, MPEG-4 Visual defines the maximum VCV buffer size (the same for the VCV and B-VCV buffers) and the draining rates for the VCV and B-VCV buffers.
(A.2)
This parameter imposes a minimum latency in the decoding process, as explained in section A.1.1. Notice that, by definition, the latency of the VCV model is imposed by the VCV buffer not by the B-VCV buffer. Since the B-VCV decoding rate, HB, is typically half the VCV decoding rate, H, this means that it is not possible to decode a full B-VCV during a time interval of L since the two buffers have the same size. This implies that at full decoding rate, the amount of boundary MBs in the scene cannot exceed 50 % of the total number of MBs.
The VCV dynamics simulates the VOP decoding process. At the VOP decoding times, the VOP encoded data is added to the VCV buffers and is removed from these buffers as the decoding process progresses. The time instant at which a given VOP is completely decoded depends on the amount and type of MBs to be decoded, the occupancy of the VCV buffers at the VOP decoding time, and the maximum decoding speed specified through the VCV decoding rates for the profile@level in question.
The VCV buffer is empty at the start of decoding and is filled instantaneously with encoded data at VOP decoding times as the decoding process advances. At the VOP decoding time, ti, Mi is added to the VCV buffer occupancy, vcv(t), and simultaneously MBi is added to the B-VCV buffer occupancy, b-vcv(t).
If the occupancy of the VCV buffers becomes zero, the VCV model decoder becomes idle and remains idle until tnext, as exemplified in Figure A.3.
Figure A.3 Dynamics of the VCV occupancy [Nunes]
(A.3)
where vcv(ti) is the VCV occupancy before the MBs representing VOP i, Mi, are added to vcv(t), H is the VCV decoding rate, b-vcv(ti) is the B-VCV occupancy before the boundary MBs of VOP i, MBi, are added to b-vcv(t), and HB is the B-VCV decoding rate.
Compliance regarding the VCV model can only be guaranteed if the set of ESs building a scene fulfills the constraints imposed by the VCV model relatively to the occupancy of the VCV buffers and the VOP decoding duration defined as follows:
When the VCV buffers become empty, the decoder simply remains idle and the VCV buffer occupancies, vcv(t) and b-vcv(t), remain unchanged during the idle period; this is illustrated in Figure A.3, which shows the occupancy of a VCV buffer, vcv(t), as a function of time.
The MPEG-4 VMV model defines a set of rules and limits for examining the set of ESs building a visual scene to control if the required amount of decoder picture memory, measured in MB units, is less than the maximum memory specified for the chosen profile and level, see Table A.1. The VMV models the memory requirements of all VOLs of all VOs in the scene (this model assumes a common memory space, shared by all VOLs of all VOs).
The VMV applies to video objects encoded as a combination of I-, P-, B-, S-VOPs, and still texture objects. Face animation, mesh objects, and I-VOPs in basic sprite sequences are not constrained by this model.
The coded video bitstreams shall be constrained to comply with the requirements of the VMV defined in the following sections.
The VMV model consists of a MB buffer that accumulates all the decoded MBs of all VOPs and stores them until they are no longer needed for the prediction of other VOPs. The VMV model is defined by the size of this buffer, the vmv_buffer_size, defining the maximum amount of decoded MBs that the decoder can store at any time instant, see Table A.1.
The VMV dynamics simulates the decoded VOP memory allocation and de-allocation process. As each VOP is being processed, the decoder needs to allocate memory to store the decoded data. This data remains in the decoder memory until it is no longer needed, e.g. for prediction. At this point in time, the memory allocated to store this data is instantaneously released and can be used again.
For S-VOPs, the amount of picture memory required for the decoding of the VOP is defined as the number of MBs in the reconstructed VOP. The memory used for storing the sprite is not constrained by the VMV model.
The decoding duration of VOP i, Ti, is identical in the VCV and VMV models and starts at si and ends at ei , as defined in section A.1.2.
Figure A.4 Dynamics of the VMV occupancy [Nunes]
A given set of ESs building a visual scene is considered compliant with a given profile and level if it fulfills all the constraints defined by the several Video Buffering Verifier models. Bitstream compliance with a given profile@level guarantees that the resources required at the decoder do not exceed a certain pre-defined amount corresponding to the relevant profile@level. Moreover it defines strict timing for completion of decoding and composition of VOPs as explained in the following:
The various models are independent but interact with each other in the following way:
In order to avoid these situations, the Video Buffering Verifier mechanism imposes strict times for starting and ending any VOP decoding - constraint imposed by the VCV model.
The Video Buffering Verifier models provide the mechanism allowing any encoder to produce bitstreams that will be decodable by any decoder compliant with the selected profile@level. This mechanism allows to simultaneously limit the amount of decoding resources needed at the receiving terminals as well as ensure the timely reconstruction of the encoded information.
It is important to highlight that it is a major task of the encoder to simulate each of the Video Buffering Verifier models in order to produce bitstreams compliant with the intended profile and level. If any of these models tends to be violated, the encoder has to take appropriate countermeasures to avoid it. Although the Video Buffering Verifier is defined for the decoders, it is in fact a major module of any encoder generating compliant sets of bitstreams.
Table A.1 describes the MPEG-4 Visual levels for the Version 1 and Version 2 profiles only including natural visual (or video) data, this means the so-called MPEG-4 video profiles. Note that Level 0 for the Simple profile has been defined in the 2nd Extension to the 2nd Edition of the MPEG-4 Visual standard [MPEG01b].
|
Visual profile |
Level |
Typical visual session size |
Max. number of objects 1 |
Maximum number objects per type |
Max. unique quant. tables |
Max. VMV buffer size |
Max. VCV buffer size (MB)8 |
VCV decoder rate (MB/s) 4 |
VCV boundary MB |
Max. total VBV buffer size |
Max. VOL VBV buffer size |
Max. video packet length (bits)6 |
Max. sprite size (MB units) |
Wavelet restrictions |
Max. bitrate (kbit/s) |
Max. enhancement layers |
|
Simple10 |
L0 |
QCIF |
1 |
1 x
Simple |
1 |
198 |
99 |
1485 |
N.A. |
10 |
10 |
2048 |
N. A. |
N. A. |
64 |
N. A. |
|
Simple |
L1 |
QCIF |
4 |
4 x
Simple |
1 |
198 |
99 |
1485 |
N.A. |
10 |
10 |
2048 |
N. A. |
N. A. |
64 |
N. A. |
|
Simple |
L2 |
CIF |
4 |
4 x
Simple |
1 |
792 |
396 |
5940 |
N. A. |
40 |
40 |
4096 |
N. A. |
N. A. |
128 |
N. A. |
|
Simple |
L3 |
CIF |
4 |
4 x
Simple |
1 |
792 |
396 |
11880 |
N. A. |
40 |
40 |
8192 |
N. A. |
N. A. |
384 |
N. A. |
|
Advanced Real Time Simple |
L1 |
QCIF |
4 |
4 x
Simple or Adv. Real Time Simple |
1 |
198 |
99 |
1485 |
N.A. |
10 |
10 |
8192 |
N. A. |
N. A. |
64 |
N. A. |
|
Advanced Real Time Simple |
L2 |
CIF |
4 |
4 x
Simple or Adv. Real Time Simple |
1 |
792 |
396 |
5940 |
N. A. |
40 |
40 |
16384 |
N. A. |
N. A. |
128 |
N. A. |
|
Advanced Real Time Simple |
L3 |
CIF |
4 |
4 x
Simple or Adv. Real Time Simple |
1 |
792 |
396 |
11880 |
N. A. |
40 |
40 |
16384 |
N. A. |
N. A. |
384 |
N. A. |
|
Advanced
Real Time Simple |
L4 |
CIF |
16 |
16 x
Simple or Adv. Real Time Simple |
1 |
792 |
396 |
11880 |
N. A. |
80 |
80 |
16384 |
N. A. |
N. A. |
2000 |
N. A. |
|
Simple
Scalable |
L1 |
CIF |
4 |
4 x
Simple or Simple Scalable |
1 |
1782 |
495 |
7425 |
N. A. |
40 |
40 |
2048 |
N. A. |
N. A. |
128 |
1 spatial
or temporal enhancement layer |
|
Simple
Scalable3 |
L2 |
CIF |
4 |
4 x
Simple or Simple Scalable |
1 |
3168 |
792 |
23760 |
N.A. |
40 |
40 |
4096 |
N. A. |
N. A. |
256 |
1 spatial
or temporal enhancement layer |
|
Core |
L1 |
QCIF |
4 |
4 x Core
or Simple |
4 |
594 |
198 |
5940 |
2970 |
16 |
16 |
4096 |
N. A. |
N. A. |
384 |
1 |
|
Core |
L2 |
CIF |
16 |
16 x Core
or Simple |
4 |
2376 |
792 |
23760 |
11880 |
80 |
80 |
8192 |
N. A. |
N. A. |
2000 |
1 |
|
Advanced Core |
L1 |
QCIF |
4 |
4 x Core
or Simple or Adv. Scalable Texture |
4 |
594 |
198 |
5940 |
2970 |
16 |
8 |
4096 |
N. A. |
see Table
A.5 |
384 |
1 |
|
Advanced
Core |
L2 |
CIF |
|