SIGGRAPH 2010: Deferred Lighting / Deferred Shading

August 5th, 2010 Wolfgang Engel 6 comments

There was a presentation given at the Beyond Programmable Shading day on Deferred Shading. I believe the presenter wanted to compare Deferred Lighting with Deferred Shading. I couldn’t attend the Beyond Programmable Shading day this year, so I was only looking at the slides. The interesting part was that Deferred Shading was implemented with the Compute Shader and performance figures were given for ATI’s 5870 and a NVIDIA GTX 480 for up to 1000 lights.
You can find the talk here:

http://bps10.idav.ucdavis.edu/

Having helped to ship games with Deferred Shading and later with Deferred Lighting gives me a good rough estimate on how those two compare with each other.

The presentation shows that the highest end graphics cards seem to max out with 1000 lights in the given scenario with the help of compute shader support. Most of my tests four years ago were done on a XBOX 360 or PS3 and later on lower end graphics cards. From what I remember in an artifical scenario 256 – 512 lights on a XBOX 360 / PS3 were possible in a similar setting that was described in the talk.
On low-end PC graphics cards like the 9600M GT of my MacBook Pro, we can run 8000 small point lights while a whole game level and colliding particles are rendered with more than 40 fps.
Comparing Deferred Shading with Deferred Lighting, I believe Deferred Lighting should be faster in all scenarios because you fetch fewer render targets and you do not resolve the lighting equation for each light.

Because the presenter used high-end NVIDIA and ATI cards I thought it would be cool to use an integrated INTEL GPU to show off Deferred Lighting and everyone could enjoy it. The drivers for those GPUs are really good now and our system only requires DX10, so we don’t use the compute shader. So I thought I give it a try on my two-and-a-half year old Lenovo X301 (http://www.notebookcheck.net/Lenovo-ThinkPad-X301.16099.0.html) with an Intel Graphics Media Accelerator (GMA) 4500MHD. This chipset is obviously not INTEL’s latest but to my surprise it ran our demo quite well. Wikipedia says this GPU has a theoretical memory bandwidth of 12.8 GB/s. The GPUs used in the SIGGRAPH presentation have a theoretical memory bandwidth of 153.6 GB/sec (ATI RADEON 5870) and 177.4 GB/sec (GeForce GTX 480) if they are standard GPUs; some vendors sell those GPUs with higher memory clock rates.
I chose to visualize 1000 small point lights without specular and the resolution of the notebook is set to 1024×768, so much smaller than what was used on SIGGRAPH. The particles, to which the point lights are attached to, also collide with the environment and bounce off. Nevertheless it was running during my tests with roughly 11 to 22 fps :-)
Here are two screenshots and a shot of the laptop:

I think it would be cool to see more stuff running on INTEL integrated chip sets, after all it is fun to get things going on low-end GPUs :-) … raise your hand if you’re with me :-)

Categories: Uncategorized Tags:

Massive Point Light Soft Shadows

June 30th, 2010 Wolfgang Engel 10 comments

Deferred lighting offers the ability to render thousands of lights. The next frontier in game development is to attach an equal number of shadows to those lights. This short note describes a new algorithm that can be used to render a large amount of shadows casted by point lights with perceptually correct penumbra.
This algorithm is based on Randy Fernando’s “Percentage-closer soft shadows” and Jesus Gumbau et all.’s “Screen-Space Soft Shadows” in GPU Pro. Jesus Gumbau co-authored this technique.
The algorithm can be split up in the following steps:

  1. Calculate the Cube Shadow Map
  2. Generate a coarse or dilated version from the Cube Shadow Map above by rendering into a smaller Cube Map. Each pixel of the coarse shadow map will approximate a block of pixels of the standard Cube Shadow Map.
  3. Based on each coarse cube shadow map, calculate the penumbra size and blend the result into a screen-space texture.
  4. Blend the shadow data of all cube maps into a screen-space shadow map.
  5. Apply an anistropic Gaussian filter kernel to the screen-space shadow map while adjusting the kernel size based on the data in the penumbra size stored in the screen-space texture above.

Coarse Cube Shadow Map

Fernando [Fernando 2005] came up with the idea of doing a blocker search to find the average depth of the blockers. In this approach the blocker search is replaced by generating a minimum z map (min-z map) [Gumbau et al. 2010]. The min-z map approximates the distance between the light and the blockers.

Calculating the Penumbra Size

Fernando’s [Fernando 2005]  showed that assuming that the light source, blocker and receiver can be treated as parallel, the similar triangle approach can be used to estimate the penumbra size as shown in the following equation:

Following Gumbau [Gumbau et al. 2010] this work extends this idea by adding an additional parameter to this equation:

dobserver describes the distance to the observer and will be used to scale the screen-space filter kernel. dblocker represents the content of the coarse screen-space shadow data.

Anisotropic Screen-Space Gauss Filter Kernel

Using a separable Gaussian Filter kernel allows to filter the shadow data with less texture fetches compared to the commonly used PCF based filter method. Compared to PCF filtering the Gaussian filter kernel performs with O(n+n) instead of O(n2). To determine the shape and the orientation of the filter kernel, the normal of the current pixel is fetched from the normal buffer and projected into eye space. To perform the anisotropic filtering in an efficient way, Geusebroek’s [Geusebroek et al. 2003] approach is applied.

Data Layout

This new algorithm is compliant with the concept of Deferred Lighting. The shadow and the coarse min-z data is stored in regular cube depth render targets. While the G-Buffer holds color, normal and depth data of the scene, there is a light buffer for the light data, a main back buffer and a separate –at least- 16-bit floating point buffer that holds the penumbra size values in screen-space. The main back buffer and the buffer that holds the screen-space penumbra size values covered in 3. can be filled up together if treated as a Multiple-Render Target (MRT).

In that case the first three channels of the main buffer are filled with the lit but un-shadowed scene while at the same time the screen-space shadow data mentioned in 4. can occupy the fourth channel of this render target.

More info at the Korean Game Developer Conference and in an article in GPU Pro 2 …

References

Randy Fernando, “Percentage-Closer Soft Shadows”, SIGGRAPH 2005 Sketch

Jan-Mark Geusebroek, Arnold W. M. Smeulders, J. van de Weijer, IEEE Transactions on Image Processing, Volume 12 (8), page 938-943, 2003

Jesus Gumbau, Miguel Chover and Mateu Sbert, “Screen Space Soft Shadows”, to appear in GPU Pro, AK Peters, 2010

Categories: D3D10, Shadow Maps Tags:

Screen-Space Global Illumination

June 5th, 2010 Wolfgang Engel No comments

I just realized that I posted the last time on June 29th, 2009 on Screen-Space Ambient Occlusion. Compared to other screen-space effects that are part of a whole PostFX it is quite expensive and doesn’t really add that much. One solution to this quality / performance problem is to re-use some of the work necessary to do a one bounce global illumination effect. This way you pay some more but overall get two effects instead of only one. Here is a screenshot of a one bounce diffuse indirect lighting effect running on a mobile phone:

You can find more screenshots of the mobile work we do at the
Confetti Special Effects fan page. More info is available soon on Qualcomm’s developer website or in the Paris Master Class :-)

Paris Master Class

May 24th, 2010 Wolfgang Engel 1 comment

I am going to teach a Paris Master Class this year on June 24th and June 25th.
The topics covered will be
- Deferred Lighting, Z Pre-Pass Renderer, Deferred Shading
- Programmable MSAA
- Post-Effect Pipeline
- Shadows (Cascaded Shadow Maps, Soft Shadows, Point Light Soft Shadows)
- GPU Particle System
- Global Illumination (Screen-Space, Reflective Shadow Maps)
- Order-Independent Transparency

I hope to see some of you there.

GPU Debugging / Profiling

May 23rd, 2010 Wolfgang Engel No comments

In the last week, I started working with the OpenGL run-time of our rendering framework that still utilizes OpenGL 2.0. We can target MacOS easy with this (although they upgrade hopefully soon to 3.2) and have an easy migration path to OpenGL ES 2.0 on the iPhone and Android.
One of the tools I discovered in the process was gDEBugger. This tool allows us to debug on Windows, Mac and the iPhone (soon on the device) with the same interface. I like that it also provides a bunch of performance data. Checking out things like the number of drawcalls, a list of the gl calls used most often and also the batching size of our draw calls are cool features. Additionally being able to break at any gl call or glError was very useful already … and then there is the usual stuff you expect like preview of VBOs, textures, render buffers and source code. Additionally you can figure out performance bottlenecks by not sending draw calls, raster operations, fixed function pipeline calls, texture data fetches, geometry and fragment shader execution. You can analyze the amount of GPU memory occupied and you have all kind of statistics that you can call from everywhere.
We tried the demo version throughout the last week on Windows and the iPhone and Confetti has now licensees for both. If you want to try for yourself, you can get a demo version on their website.

Oolong Engine, Confetti, Light Pre-Pass on SPU, GPU Pro, Order-Independent Transparency

May 10th, 2010 Wolfgang Engel 1 comment

A lot of smaller things happened in the last months. Oolong Engine got an iPad update and many bugfixes:

Oolong Engine

Confetti Special Effects posted a couple of screenshots of several projects we are working on. So far we are showing of particle collisions, a next-gen dynamic skydome system with god rays, terrain self-shadowing, volumetric clouds and atmospheric scattering that is correct even when watched from space and a fully-dynamic global illumination system. Please check out our Facebook page:

Confetti Special Effects

In other news there are cool slides of a SPU-based implementation of the Light Pre-Pass renderer here:

A Bizarre Way to do Real-Time Lighting

There is a GPU Pro article that describes many of the implementation details. BTW: GPU Pro is on its way to the book sellers and looks fantastic. Color makes such a big difference for a book like this.

Kaori Kubota send me a link to an example implementation of Order-Independent Transparency where he changed the DirectX OIT example to use AMD’s linked list implementation. Framerate went on my machine from 10 to more than 1400 fps. He also wrote a documentation:

OIT

Cascaded Shadow Maps

May 5th, 2010 Wolfgang Engel 6 comments

Some people were asking me about Cascaded Shadow Maps in the last few weeks. Here is a short high-level view how I see the development in the last five years in that area.
More than five years ago I wrote the article “Cascaded Shadow Maps” in ShaderX5. At the time I was trying to visualize an idea that was based on a talk by John Carmack in 2004 where he described “Cascaded Shadow Maps” the first time. In the article I think I generated a few additional buzzwords like “Deferred Shadows” and “Shadow Collector”. The “Shadow Collector” was later also called shadow mask.
The overall experience in implementing Cascaded Shadow Maps can be summarized with the words “I know now 1000 ways on how to not implement them” and the actual implementation I ended up with is more straightforward than sophisticated. I want to cover some of the challenges now.

How to Create a Light Frustum
After splitting up the view frustum into slices constructing a tight light view frustum can be a challenge. There are several ways to do this.
Just using the eight points of the slice of the view frustum and constructing the light view frustum around this works well but what happens is that the quality of the shadow changes depending on how the camera is facing the sun. That gave sometimes very good results and sometimes bad results. Having the same error level distributed over the time of day in a game is an important requirement. So I build a sphere around the frustum slice that intersects the eight points of this slice. Then around this sphere the orthographic light view frustum is build. This follows Michael Valient’s article in ShaderX6. The orthographic light view frustum rotates around that sphere similar to a joint. The quality is evenly distributed independent of the angle of the camera and the light source. You can even use this in a flight simulator and fly loops and it will work appropriately.

Shadow Collector / Mask / Deferred Shadowing
After rendering into the shadow maps one intermediate step before the scene will be rendered can be to store all the shadow data in a screen-space texture. This can happen along with Z-buffer and / or Normal buffer rendering.
The main reason to do this is that you can’t fetch the texture data easily while rendering the main scene. Let’s say we have four cascades and therefore four shadow maps each 640×640. Fetching four maps of this size along with all the other maps that are necessary while rendering the scene will degrade performance substantially. So this needs to be done before the scene is rendered.
Having this screen-space texture also offers a good way to blur the data again. The reason why I called this screen-space texture shadow collector is that it can be used to collect all the shadows of the scene. There might be cloud shadows (just projected down), character shadows (for high-res self-shadowing), point and spot light shadows and all kind of special cases. All this data can be alpha blended into the shadow collector and then applied to the scene by fetching this texture. Obviously there might be other ways to store the data when you are using a Deferred Lighting approach.

How to Pick the Right Texture map
When you render into the shadow collector you need to have a good way to find out which texture map to pick. Michal Valient’s article has a great solution for this. I just do a sphere-pixel distance check for all four maps in the pixel shader. This is the sphere that you used to construct the light view frustum around. What I like about this is that the break between different maps is in circle form and not a line. This is easier to hide. There are probably smarter ways to do this.

How to Store the Shadow Map data
Storing the intial shadow data is best done in a texture atlas. There are numerous challenges on how to fetch data from a texture atlas that were described in a ShaderX3 article by Matthias Wloka. Nevertheless it is the best representative aside from a texture array that is only available on more recent hardware.
Jonathan Blow has a blog entry on how to store this data in the most efficient way.

There is an interesting discussion following his blog entry. Some people argue that the usage of a texture atlas is not necessary as long as you render the data immediately into the shadow collector. This way you render four times. On most platforms I worked on this is not efficient. You only want to render all the data coming from the Cascaded Shadow Maps once.
In other words Jonathan’s idea is really interesting. One person mentions the following link in this context .

Packing Square Tiles into One Texture

How to soften the Penumbra
On most hardware platforms you can’t run any filter kernel bigger than 4 taps. There are tricks to detect the areas where you can run bigger filter kernels but most of them didn’t work out well for me.
The amount of research that goes into this area is astonishing. It astonishes even more that nearly all solutions don’t work in any real-world game environment with a 24 hour time of day cycle.
The one solution I found working is Exponential Shadow maps. They only require a one channel depth buffer and you can render with the double-speed writes if you use the native depth buffer format. Unfortunately they suffer under light bleeding artifacts.
I used therefore a four-tap screen-space filter kernel that rotates and dithers the data and that worked well.

Future Development
There is hope that we can use eight-tap filter kernels on most platforms soon :-) … but what will offer new opportunities is the ability of DX10 to render out to several maps in one render pass. Because Cascaded Shadow Maps is a shadow level-of-detail system that distributes shadow data quite evenly along the view frustum, I expect people to come up with better LOD schemes, in other words, more frustra or in other words multi-frustum shadow maps as already described by Tom Forsyth in 2004 are the future. ShaderX had interesting articles in this area that were called Virtual Shadow Maps.

Categories: D3D10, Shadow Maps Tags:

Order-Independent Transparency II

April 22nd, 2010 Wolfgang Engel 3 comments

Holger Gruen and Nicolas Thibieroz describe the usage of per-pixel linked lists with DirectX 11 in their GDC presentation “OIT and Indirect Illumination using DX11 Linked Lists”

Per-Pixel linked lists represent one pixel in the viewport / framebuffer with a list of structures consisting of the actual data and a link that links this list entry to the next one.  This list of elements is stored in a read/write structured buffer. A structured buffer is a buffer that contains elements of equal sizes. Use a structure with one or more member types to define an element. The buffer in the power-point slides is called “Fragment and Link Buffer”:

struct  FragmentAndLinkBuffer_STRUCT
{
  FramentData_STRUCT FragmentData;
  uint uNext;
}
RWStructuredBuffer<FragmentAndLinkBuffer_STRUCT>FLBuffer;

In addition to indexing, a structured buffer supports accessing a single member like this:

float4 myColor = FLBuffer[27].Color; // cache challenges here?

This buffer can now hold any number of list entries that represent one pixel. Each list entry refers to the previous list entry in the variable uNext. To make this work, a so called “Start Offset Buffer” is used. This buffer stores the offset into the “Fragment and Link Buffer” representing a pixel in the framebuffer. In case one pixel is “overwritten” by another pixel, this offset value is read from the “Start Offset Buffer”, written into the uNext variable of the new list entry and the offset value of the new list entry is then written into the “Start Offset Buffer” and so on. The slides show the following images (there is a nice animation in the slide show that is worth checking out):

The viewport is represented by the 2D grid on the left. For each pixel a value is written into the “Start Offset Buffer”. For the first pixel it is 0, then 1 and so on. This value represents the offset into the “Fragment and Link Buffer”. The first image on the left shows how the first value is written into the “Start Offset Buffer” and how this value represents the offset into the “Fragment and Link Buffer”. The second image shows how two more pixels are stored in one entry of a linked list. The third image on the right side shows how another list entry is added to the first entry in the “Fragment and Link Buffer”. The fourth entries uNext variable holds now 0 as the value.
Traversing the linked list is easy. Just following the offset value in the “Start Offset Buffer” of the pixel that needs to be rendered and then following the uNext entry leads to the fourth and then the first entry in the “Fragment and Link Buffer” as shown in the fourth image.

To implement Order-Independent Transparency with this technique, only transparent pixels need to be stored in a Per-Pixel Linked List. In the rendering phase, those list entries are stored in a back-to-front order and blended in a pixel shader. The blend mode can then be unique per-pixel.

Categories: Order-Independent Transparency Tags:

DigiPen

April 17th, 2010 Wolfgang Engel 8 comments

Yesterday I gave a talk on “Introduction to Real-Time Global Illumination” at DigiPen. I was invited by the graphics club.

It astonished me how much knowledge the people in the auditorium had. They asked me the right questions and it was fun just going on tangents on several topics not related to GI.

The students also showed me demos they did and those were impressive. They actually have to write their “engines” from scratch in C/C++ (no middleware allowed, e.g. no sound or physics middleware).

Categories: Global Illumination, Leadership Tags:

Confetti Special Effects: Facebook Page

April 6th, 2010 Wolfgang Engel No comments

Confetti created a facebook page to show some of the work-in-progress shots of what we are working on. You can find it here:

http://www.facebook.com/pages/Confetti-Special-Effects-Inc/159613387880?v=wall

We will also have a better website soon … very soon :-)

Categories: Confetti, General GPU Programming Tags:

Edge Detection Trick

March 20th, 2010 Wolfgang Engel 13 comments

Benualdo posted in the Light Pre-Pass Thread a cool trick on how to detect edges to run a per-sample shader for MSAA (just in case centroid sampling doesn’t work for you). Here it is:
———-
another stupid trick for edge detection pass on platforms that support sampling the MSAA surface with linear sampling: sample the normal buffer twice, once with POINT sampling and once with LINEAR sampling. Use clip(-abs(L-P)+eps). The linear sampled value should be used to compute the lighting of “non-MSAA” texels in the same shader to avoid an extra pass.
———-
eps is a small threshold value to bias the texkill test so that when the multisampled normals are only a little different then we could use the averaged value to perform the lighting at non-MSAA resolution during the first pass as an optimization.

Categories: Deferred Lighting, Edge Detection, MSAA Tags:

GPU Pro

February 27th, 2010 Wolfgang Engel 7 comments

There is a blog concerning the upcoming book GPU Pro at http://gpupro.blogspot.com/.
I posted the Table of Contents for GPU Pro. You can pre-order it on Amazon here.
There is another blog for GPU Pro 2 with a call for authors, in case you want to see your name written in golden letters in a book :-)

Categories: GPU Pro Tags:

Hardware Tessellation

January 31st, 2010 Wolfgang Engel 7 comments

I was thinking about the advantages of Hardware Tessellation. I can see mainly three:
- Compression
Reduces on-disk storage, system, video memory usage ->only the coarse mesh is stored
Animation data is only stored for the coarse mesh
- Memory bandwidth
GPU fetches only vertex data of coarse mesh through PCI-E bus -> higher vertex cache and fetch performance
- Scalability
Subdivision is recursive -> offers auto-LOD with adaptive metrics

With the DirectX 11 implementation it might also reduce the workload of the vertex shader because the shader transforms or animates only the coarse mesh. But if we add up the additional workload of the hull and domain shader it might be a wash.

For console developers, being able to store more world geometry on disc and in memory would be a great advantage. The reduction of the read memory bandwidth -while reading the data from memory- would also increase the efficiency.
The main question is if tessellating the geometry puts such a huge workload on the GPU that it is not feasible. I would love to have some real-world data here …

Categories: Hardware Tesselation Tags:

Direct3D 11 Overview

December 31st, 2009 Wolfgang Engel 5 comments

Here is a first draft for the data flow in the DirectX 11 rendering pipeline:

And here is the DirectCompute overview:

I would consider those now beta.

Categories: D3D11 Tags:

Direct3D 10 Overview

December 30th, 2009 Wolfgang Engel No comments

I started working on a Direct3D 10 overview that only covers one page. Here is the latest version.

Please note that this overview has nothing to do with the way the hardware works. It is just a diagram that shows the data flow and the usage of the Direct3D 10 API to stream the data through several logical stages that might be represented in hardware by one unit. If you are interested in the actual hardware design, I would recommend reading

A Closer Look at GPUs

Categories: D3D10 Tags:

New Links

December 24th, 2009 Wolfgang Engel No comments

I updated my list of links on the right side with some of the websites I keep an eye on.
I never met Brian Karis but he has a few very forward thinking posts on his blog. The same is true for Pierre Terdiman. He covers many non-graphics related tasks and I believe I read his blog and former website since 7 years (?). Aurelio Reis has some cool procedural stuff on his blog. Simon Green worked on some of the coolest stuff that you can find in the NVIDIA SDK. His blog has some interesting entries on how the GPUs nowadays can render CG movie content in real-time while CPUs still need a lot more time to do the same. Then I also added Mike Acton’s blog. I wonder how I could have forgotten this as often as Mike and I met in the last few months. He is certainly one of the SPU and Multi-core programming authorities in the industry. I especially like his opinion regarding C++ and data-centric design. Lots of people repeated this mantra in the last two years but I heard it from him before.

Categories: General GPU Programming Tags:

CSE 190 GPU Programming UCSD

November 29th, 2009 Wolfgang Engel 4 comments

I am going to teach GPU Programming in the upcoming quarter at UCSD. Look out for course CSE 190. Here is the announcment:

Course Objectives:
This course will cover techniques on how to implement 3D graphics
techniques in an efficient way on the Graphics Processing Unit (GPU).

Course Description:
This course focuses on algorithms and approaches for programming a
GPU, including vertex, hull, tesselator, domain, geometry, pixel and
compute shaders. After an introduction into each of the algorithms,
the students will learn step-by-step on how to implement those
algorithms on the GPU. Particular subjects may include geometry
manipulations, lighting, shadowing, real-time global illumination,
image space effects and 3D Engine design.

Example Textbook(s):
A list of reading assignments will be given out each week.

Laboratory work:
Programming assignments.

Very exciting :-)

Categories: General GPU Programming Tags:

Order-Independent Transparency

November 29th, 2009 Wolfgang Engel 1 comment

Transparent objects that require alpha blending cannot be rendered on top in a G-Buffer. Blending two or more normals, depth or position values leads to wrong results.
In other words deferred lighting of objects that need to be visible through each other is not easily possible because the data for the object that is visible through another object is lost in a G-Buffer that can only store one layer of data for normals, depth and position.
The traditional way to work around this is to have a separate rendering path that deals with rendering and lighting of transparent objects that need to be alpha blended. In essence that means there is a second lighting system that can be forward rendered and usually has a lower quality than the deferred lights.
This system breaks down as soon as you have light numbers that are higher than a few dozen lights because forward rendering can’t render so many lights. In that case it would be an advantage to use the same deferred lighting system that is used on opaque objects on transparent objects that would require alpha blending.
The simple case is for example windows where you can look through one window and maybe two more windows behind each other and see what is behind them. For example you look through the window from the outside into a house and then in the house is another glass wall through which you can look and then behind that glass wall is a freshwater tank that is lit … etc. you got the idea.
This would be the “light” case to solve. Much harder are scenarios in which the number of transparent objects that can be behind each other is much higher … like with particles or a room of transparent T-pots :-) .

On DirectX9 and DirectX 10 class of hardware, one of the solutions that is mentioned to solve the problem of order-independent transparency is called Depth Peeling. It seems this techniques was first described by Abraham Mammen (“Transparency and antialiasing algorithms Implemented with the virtual pixel maps technique”, IEEE Computer Graphics and Applications, vol. 9, no. 4, pp. 43-55, July/Aug. 1989) and Paul Diefenbach (“Pipeline rendering: Interaction and realism through hardware-based multi-pass rendering”, Ph.D., University of Pennsylvania, 1996, 152 pages)(I don’t have access to those papers). A description of the implementation was given by Cass Everitt here. The idea is to extract each unique depth in a scene into layers. Those layers are then composited in depth-sorted order to produce the correct blended image.
In other words: the standard depth test gives us the nearest fragment/pixel. The next pass over the scene gives us the second nearest fragment/pixel; the pass after this pass the third nearest fragment/pixel. The passes after the first pass are rendered by using the depth buffer computed in the first pass and “peel away” depths values that are less than or equal to the values in that depth buffer. All the values that are not “peeled away” are stored in another depth buffer. Pseudo code might look like this:

const float bias 0.0000001;

// peel away pixels from previous layers
// use a small bias to avoid precision issues.
clip(In.pos.z – PreviousPassDepth – bias);

By using the depth values from the previous pass for the following pass, multiple layers of depth can be stored. As soon as all the depth layers are generated, for each of the layers the G-Buffer data needs to be generated. This might be the color and normal render targets. In case we want to store three layers of depth, color and normal data also need to be stored for those three depth layers.
Having a scene that has many transparent objects overlay each other, the number of layers increases substantially and therefore the memory consumption.

A more advanced depth peeling technique was named Dual Depth Peeling and described by Louis Bavoil et al. here. The main advantage of this technique is that it peels a layer from the front and a layer from the back at the same time. This way four layers can be peeled away in two geometry passes.
On hardware that doesn’t support independent blending equations in MRTs, the two layers per pass are generated by using MAX blending and writing out each component of a float2(-depth, depth) variable into a dedicated render target that is part of a MRT.

Nicolas Thibieroz describes in “Robust Order-Independent Transparency via Reverse Depth Peeling in DirectX 10″ in ShaderX6 a technique called Reverse Depth Peeling. While depth peeling extracts layers in a front-to-back order and stores them for later usage, his technique peels the layers in back-to-front order and can blend with the backbuffer immediately. There is no need to store all the layers compared to depth peeling. Especially on console platforms this is a huge advantage.
The order of operations is:

1. Determine furthest layer
2. Fill-up depth buffer texture
3. Fill-up normal and color buffer
4. Do lighting & shadowing
5. Blend in backbuffer
6. Go to 1 for the next layer

Another technique is giving up MSAA and using the samples to store up to eight layers of data. Kevin Myers et al. uses in the article “Stencil Routed A-Buffer” the stencil buffer to do sub-pixel routing of fragments. This way eight layers can be written in one pass. Because the layers are not ordered by depth they need to be sorted afterwards. The drawbacks are that the algorithm is limited to eight layers, allocates lots of memory (8xMSAA can be depending on the underlying implementation a 8x screen-size render target), requires hardware that supports 8xMSAA and the bitonic sort might be expensive. Giving up MSAA, the “light” case described above would be easily possible with this technique with satisfying performance but it won’t work on scenes where many objects are visible behind several other objects.

Another technique extends Dual Depth Peeling by attaching a sorted bucket list. The article “Efficient Depth Peeling via Bucket Sort” by Fang Liu et al. describes an adaptive scheme that requires two geometry passes to store depth value ranges in a bucket list, sorted with the help of a depth histogram. An implementation will be described in the upcoming book GPU Pro. The following image from this article shows the required passes.

The Initial Pass is similar to Dual Depth Peeling. Similar to other techniques that utilize eight render targets, 32:32:32:32 each, the technique has huge memory requirements.

To my knowledge those are the widely known techniques for order-independent transparency on DirectX 10 today. Do you know of any newer techniques suitable for DirectX 10 or DirectX 11 hardware?

Categories: Order-Independent Transparency Tags:

You want to become a Graphics Programmer …

November 15th, 2009 Wolfgang Engel 1 comment

I regularly receive e-mails with the question what kind of books I recommend if someone wants to become a graphics programmer. Here is my current list (maybe some of you guys can add to this list?):
First of all math is required:
- Vector Calculus
- Vector Calculus, Linear Algebra, and Differential Forms I have the 1999 version of this book
- Computer Graphics Mathematical First Steps
- Mathematics for Computer Graphics

For a general knowledge in programming the CPU:
- Write Great Code Volume 1: Understanding the Machine

For a better knowledge on how to program the GPU:
- DirectX documentation
- NVIDIA GPU Programming Guide
- ATI GPU Programming Guide

To learn about how to program certain effects in an efficient way:
- ShaderX – ShaderX7
- GPU Gems – GPU Gems 3
- GPU Pro and GPU Pro Blog

To start learning DirectX 10 API + Shader Programming:
- Introduction to 3D Programming with DirectX 10
- Programming Vertex, Geometry and Pixel Shaders

To start learning OpenGL & OpenGL ES:
- Khronos group

For general overview:
- Real-Time Rendering
- Fundamentals of Computer Graphics (this one also belongs in the math section)

To get started with C:
- C Programming Language

To learn C++
- C++ for Game Developers
- C++ Cookbook
- there is a long list of more advanced C++ books …

Categories: General GPU Programming Tags:

River of Lights II

October 23rd, 2009 Wolfgang Engel 1 comment

More work-in-progress shots.

Categories: Deferred Lighting Tags:

BitMasks / Packing Data into fp Render Targets

October 15th, 2009 Wolfgang Engel No comments

Recently I had the need to pack bit fields into 32-bit channels of a 32:32:32:32 fp render target.
First of all we can assume that all registers in the pixel shader operate in 32-bit precision and output data is written into a 32-bit fp render target. The 32-bit (or single-precision) floating point format uses 1 sign, 8-bits of exponent, and 23 bits of mantissa following the IEEE 754 standard.

To maintain maximum precision during floating-point computations, most computations use normalized values. Keeping floating-point numbers normalized is beneficial because it maintains the maximum number of bits of precision in a computation. If several higher-order bits of the mantissa are all zero, the mantissa has that many fewer bits of precision available for computation. Therefore a floating-point computation will be more accurate if it involves only normalized values whose higher-order mantissa bit contains one.

The IEEE 754 32-bit floating-point format specifies special cases in case the bits in the exponent are all set to zeros or ones. If all exponent bits are set, then the number represents either =/- infinity or a NaN (not-a-number), depending on the mantissa value. If all exponent bits are zero, then the number is denormalized and automatically gets flushed to zero as specified in the Direct3D 10 single-precision floating-point specifications (see Nicolas Thibieroz, “Packing Arbitrary Bit Fields into 16-bit Floating-Point Render Targets in DirectX10″, ShaderX7).

When packing bit values, those cases need to be avoided.

// Pack three positive normalized numbers between 0.0 and 1.0 into a 32-bit fp
// channel of a render target
float Pack3PNForFP32(float3 channel)
{
// layout of a 32-bit fp register
// SEEEEEEEEMMMMMMMMMMMMMMMMMMMMMMM
// 1 sign bit; 8 bits for the exponent and 23 bits for the mantissa
uint uValue;
// pack x
uValue = ((uint)(channel.x * 65535.0 + 0.5)); // goes from bit 0 to 15
// pack y in EMMMMMMM
uValue |= ((uint)(channel.y * 255.0 + 0.5)) << 16

// pack z in SEEEEEEE
// the last E will never be 1b because the upper value is 254
// max value is 11111110 == 254
// this prevents the bits of the exponents to become all 1
// range is 1.. 254
// to prevent an exponent that is 0 we add 1.0
uValue |= ((uint)(channel.z * 253.0 + 1.5)) << 24

return asfloat(uValue);
}
// unpack three positive normalized values from a 32-bit float
float3 Unpack3PNFromFP32(float fFloatFromFP32)
{
float a, b, c, d;
uint uValue;
uint uInputFloat = asuint(fFloatFromFP32);
// unpack a
// mask out all the stuff above 16-bit with 0xFFFF
a = ((uInputFloat) & 0xFFFF) / 65535.0;
b = ((uInputFloat >> 16) & 0xFF) / 255.0;
// extract the 1..254 value range and subtract 1
// ending up with 0..253
c = (((uInputFloat >> 24) & 0xFF) – 1.0) / 253.0;
return float3(a, b, c);
}
Categories: General GPU Programming Tags:

River of Lights

September 30th, 2009 Wolfgang Engel No comments

Work in progress shot here. More than 8000 lights attached to particles in this hallway.Resolution is 1280×720 and the GPU still runs with 158 frames per second. The whole level has about 16k lights.

Categories: Deferred Lighting Tags:

SIGGRAPH 2009 Impressions: Inferred Lighting

August 11th, 2009 Wolfgang Engel 1 comment

There is a new lighting approach that extends the Light Pre-Pass idea. It is called Inferred Lighting and it was presented by Scott Kircher and Alan Lawrence from Volition. Here is the link

http://graphics.cs.uiuc.edu/~kircher/publications.html

They assume a Light Pre-pass concept as covered here on this blog with three passes. The geometry pass where they fill up the buffer, the lighting pass where light properties are rendered into a light buffer and a material pass in which the whole scene is rendered again, this time re-constructing different materials.
Their approach adds several new techniques to the toolset used to do deferred lighting / Light Pre-Pass.

1. They use a much smaller G-Buffer and Light buffer with a size of 800×540 on the XBOX 360. This way their memory bandwidth usage and pixel shading cost should be greatly reduced.

2. To upscale the final light buffer, they use Discontinuity Sensitive Filtering. During the geometry pass, one 16 bit channel of the DSF buffer is filled with the linear depth of the pixel, the other 16 bit channel is filled with an ID value that semi-uniquely identifies continuous regions. The upper 8 bits are an object ID, assigned per-object (renderable instance) in the scene. Since 8 bits only allows 256 unique object IDs, scenes with more than this number of ob-jects will have some objects sharing the same ID.
The lower 8 bits of the channel contain a normal-group ID. This ID is pre-computed and assigned to each face of the mesh. Anywhere the mesh has continuous normals, the ID is also continuous. A normal is continuous across an edge if and only if the two triangles share the same normal at both vertices of the edge.
By comparing normal-group IDs the discontinuity sensitive filter can detect normal discontinuities without actually having to reconstruct and compare normals. Both the object ID and normal-group ID must exactly match the material pass polygon being rendered before the light buffer sample can be used (depth must also match withinan adjustable threshold).
During the material pass, the pixel shader computes the locations of the four light buffer texels that would normally be accessed if regular bilinear filtering would be used. These four locations are point sampled from the DSF buffer. The depth and ID values retrieved from the DSF buffer are compared against the depth and ID of the object being rendered. The results of this comparison are used to bias the usual bilinear filtering weights so as to discard samples that do not belong to the surface currently rendering. These biased weights are then used in custom bilinear filtering of the light buffer. Since the filter only uses the light buffer samples that belong to the object being rendered, the resulting lighting gives the illusion of being at full resolution. This same method works even when the framebuffer is multisampled (hardware MSAA), however sub-pixel artifacts can occur, due to the pixel shader only being run once per pixel, rather than once per sample.
The authors report that such sub-pixel artifacts are typically not noticeable.

3. The authors of this paper also implemented a technique that allows to render alpha polygons with the Light Pre-Pass / Deferred lighting. It is based on stippling and the usage of the DSF filtering.
During the geometry pass the alpha polygons are rendered using a stipple pattern, so that their G-Buffer samples are interleaved with opaque polygon samples.
In the material pass the DSF for opaque polygons will automatically reject stippled alpha pixels, and alpha polygons are handled by finding the four closest light buffer samples in the same stipple pattern, again using DSF to make sure the samples were not overwritten by some other geometry.
Since the stipple pattern is a 2×2 regular pattern, the effect is that the alpha polygon gets lit at half the resolution of opaque objects. Opaque objects covered by one layer of alpha have a slightly reduced lighting resolution (one out of every four samples cannot be used).

Categories: SIGGRAPH Tags:

SIGGRAPH 2009

July 28th, 2009 Wolfgang Engel No comments

SIGGRAPH is next week and I am still preparing my talk. If you are around please come by and say hi. My talks title is “Light Pre-Pass Renderer Mark III” and it is part of the “Advances in Real-Time Rendering in 3D Graphics and Games” day on Monday next week:

http://www.siggraph.org/s2009/sessions/courses/details/?id=12

I collected all the new development in this area, and added a few new things I found out while working on DirectX 10 / 11 implementations and will post a link to the slides here. Especially on the PS3 there is lots of new and interesting development (Judging from the number of games that will ship with this approach I want to believe that it is the most popular way to apply lots of lights in games now). I received a first draft of an article for ShaderX8 / GPU Pro from Steven Tovey about how they implemented the Light Pre-Pass in the upcoming game Blur on the PS3. They based their approach on work done by Matt Swoboda. The results look very cool. You can check out the screenshots on their website.

There is lots of progress happening with the Oolong Engine for the iPhone / iPod Touch. Check out the change list on

http://code.google.com/p/oolongengine

We got OpenGL ES 2.0 running and there is a new tutorial series that looks really cool.

In other news somehow my name was mentioned on “The Escapist”. Here is the link for your entertainment:

http://www.escapistmagazine.com/articles/view/columns/publishers-note/6250-Publishers-Note-Made-By-People.2

Categories: SIGGRAPH Tags:

MSAA on the PS3 with Light Pre-Pass on the SPU

July 3rd, 2009 Wolfgang Engel No comments

In the previous “MSAA on the PS3″ thread Matt Swoboda jumped in and mentioned that they implemented MSAA on the SPU in the Phyre Engine. I knew that they implemented the Light Pre-Pass on the SPU but I completely forgot that they also had a solution to do MSAA on the SPU.
You can find the presentation “Deferred Lighting and Post Processing on PLAYSTATION®” here.
Because it is possible to read and write per sample with the SPU, they can achieve a similar functionality as the per-sample frequency of DirectX 10.1-class graphics hardware where each sample can be treated separately. So they can calculate the lighting for each of the sample values and write the results into each of the samples in the light buffer.

Categories: Deferred Lighting, MSAA, PS3 Tags:

Ambient Occlusion in Screen-Space

June 29th, 2009 Wolfgang Engel No comments

Screen-Space Ambient Occlusion (SSAO) is quite popular in the moment. ShaderX7 had several articles and there are lots of approaches to gradually improve the effect.
A good way to look at SSAO or any similar approach is to consider it part of a whole pipeline of effects that can share resources and extend the idea to include one diffuse (and specular) indirect bounce of light by re-using resources.
The overall issues with SSAO are:
1. quite expensive for the image quality improvement. Using the astonishing high amount of frame-time for other effects is an intriguing idea. In other words the performance / quality-improvement ratio is not very good compared to e.g. PostFX where a bunch of effects consumes a similar amount of time.
2. a typical problem is that lighting is ignored by SSAO. Using the classical SSAO implementation under varying illumination introduces objectionable artifacts because the ambient term is darkened equally (obviously you can apply SSAO to the diffuse and specular term like a shadow term … but then it isn’t ambient anymore). If you have a “global ambient” light term like skylights, SSAO will diminish the effect. It also leads to problems with dynamic shadows.

Overall I believe a fundamental shift to more generic method is necessary to solve those issues. This is one of the things I am looking into … so expect an update at some point in the future.

Categories: Global Illumination Tags:

MSAA on the PS3 with Deferred Lighting / Shading / Light Pre-Pass

June 17th, 2009 Wolfgang Engel No comments

The Killzone 2 team came up with an interesting way to use MSAA on the PS3. You can find it on page 39 of the following slides:

http://www.dimension3.sk/mambo/Articles/Deferred-Rendering-In-Killzone/View-category.php

What they do is read both samples in the multisampled render target, do the lighting calculations for both of them and then average the result and write it into the multi-sampled (… I assume it has to be multi-sampled because the depth buffer is multisampled) accumulation buffer. That somehow decreases the effectiveness of MSAA because the pixel averages all samples regardless of whether they actually pass the depth-stencil test. The multisampled accumulation buffer may therefore contain different values per sample when it was supposed to contain a unique value representing the average of all sample. Then on the other side they might only store a value in one of the samples and resolve afterwards … which would mean the pixel shader runs only once.
This is also called “on-the-fly resolves”.

It is better to write into each sample a dedicated value by using the sampling mask but then you run in case of 2xMSAA your pixel shader 2x … DirectX10.1+ has the ability to run the pixel shader per sample. That doesn’t mean it fully runs per sample. The MSAA unit seems to replicate the color value accordingly. That’s faster but not possible on the PS3. I can’t remember if the XBOX 360 has the ability to run the pixel shader per-sample but this is possible.

Categories: Deferred Lighting, Edge Detection, MSAA, PS3 Tags:

Multisample Anti-Aliasing

June 13th, 2009 Wolfgang Engel 2 comments

Utilizing the Multisample Anti-Aliasing (MSAA) functionality of graphics hardware for deferred lighting can be challenging. Nicolas Thibieroz wrote an excellent article about MSAA published in ShaderX7 with the title “Deferred Shading with Multisampling Anti-Aliasing in DirectX10″.
The following figure from the ShaderX7 article shows how MSAA works:

The pixel represented by a square has two triangles (blue and yellow) crossing some of its sample points. The black dot represents the pixel sample location (pixel center); this is were the pixel shader is executed. The cross symbol corresponds to the location of the multisamples where the depth tests are performed. Samples passing the depth test receive the output of the pixel shader. Those samples are replicated by the MSAA back-end into a multisampled render target that represents each pixel with -in that case- four samples. That means the render target size for an intended resolution of 1280×720 would be 2560×1440 representing each pixel with four samples but the pixel shader only writes 1280×720 times (assuming there is no overdraw) while the MSAA back-end replicates for each pixel four samples into the multisampled render target.
With deferred lighting there can be several of those multi-sampled render targets as part of a Multiple-Render-Target (MRT). In the so called Geometry stage, data is written into this MRT; therefore called G-Buffer. In case of 4xMSAA each of the render targets of the G-Buffer would be 2560×1440 in size.
In case of Deferred Lighting / Light Pre-Pass the G-Buffer holds normal and depth data. This data can never be resolved because resolving it would lead to incorrect results as shown by Nicolas in his article.
After the Geometry phase comes the Lighting or Shading phase in a Deferred Lighting/Light Pre-Pass/Deferred Shading renderer. In an ideal world you could blit each sample (not pixel) into the multisampled render target -that holds the result of the Shading phase- by reading the G-Buffer sample and performing all the calculations necessary on it.
In other words to achieve the best possible MSAA quality with those renderer designs, lighting equations would need to be applied on a per-sample basis into a multisampled render target and then later resolved.
This is possible with DirectX 10.1 graphics hardware (AMD’s 10.1 capable cards; didn’t try if S3 cards that support 10.1 can do this as well) that allows to execute a pixel shader at sample frequency.
To make this a viable option, this operation needs to be restricted to samples that belong to pixel edges. There are two passes necessary to make this work. One pass will use the pixel shader that runs operations performed on samples and in a second pass the pixel shader is run that performs operations per-pixel, which means the result of the pixel shader calculation is output to all samples passing the depth-stencil test.
To restrict the pixel shader that performs operations per-sample, a stencil test is used.
One interesting idea covered in the article is to detect edges with centroid sampling (available already on DirectX9 class graphics hardware). During the G-Buffer phase the vertex shader writes a variable unique to every pixel (e.g. pixel position data) into two outputs, while the associated pixel shader declares two inputs: one without and one with centroid sampling enabled. The pixel shader then compares the centroid-enabled input with the one without it. Differing values mean that samples were only partially covered by the triangle, indicating an edge pixel. A “centroid value” of 1.0 is then written out to a selected area of the G-Buffer (previously cleared to 0.0) to indicate that the covered samples belong to an edge pixel. Those values are then averaged while being resolved to find out the value per pixel. If the result is not exactly 0, then the current pixel is an edge pixel. This is shown in the following image from the article.
On the left the pixel shader input will always be evaluated at the center of the pixel regardless of whether it is covered by the triangle. On the right with centroid sampling, the two rightmost depth samples are covered by the triangle. The comparison of the values in the pixel shader will lead to the result that the samples were only partially covered by the triangle, indicating an edge pixel.
Because DirectX10 capable graphics hardware does not support the pixel shader running at sample frequency, a different solution needs to be developed here.
The best MSAA quality in that case is achieved by running the pixel shader multiple times per pixel, only enabling output to a single sample each pass. This can be achieved by using the OMSetBlendState() API. The results of this method would be identical to the DirectX 10.1 method but obviously due to the increased number of rendering passes and slightly reduced texture cache effectiveness more expensive.

Categories: D3D10, Deferred Lighting, MSAA Tags:

Deferred Lighting / Particle System

May 23rd, 2009 Wolfgang Engel No comments

Here is a shot of a GPU based particle system with lights attached to each particle. I used Emil Persson’s example Deferred Shading program as a basis to implement a Light Pre-Pass renderer with 4k lights and 4k particles. It runs fairly well on a GeForce 9600 GT here:

Categories: Deferred Lighting, Particle System Tags:

Light Pre-Pass: Knee-Deep

May 18th, 2009 Wolfgang Engel 2 comments

Several companies adopted the Light Pre-Pass idea, modified it or came up with similar ideas:

  • Crytek: they call it Deferred lighting contrary to Deferred shading. The technique is mentioned in the new Cry Engine 3 presentation here
  • Garagegames in their new Torque 3D engine currently in beta. Read the article from Pat Wilson in ShaderX7 and the garagegames website
  • Insomniac came up with a Pre-lighting approach that is similar to this. See Mark Lee’s presentation from GDC 2009 here
  • DICE is using it since a long time already
  • I believe EA used it in Dead Space :-)
  • Carsten Dachsbacher described a similar idea in his article “Splatting of Indirect Illumination” here and in ShaderX5
One of the interesting areas in this context is the ability to implement a one-bounce global illumination effect with the data in the G-Buffer and the light buffer …
Categories: Deferred Lighting Tags:

3D Supershape

May 1st, 2009 Wolfgang Engel No comments

Over the last few years I was looking into the 3D Supershape formula described by Paul Bourke here and originally developed by Johan Gielis. I love the shape of the objects that are a result of those and therefore I always wanted to use it to create my own demos after I saw the one from Jetro Lauha (http://jet.ro/creations). Here is my first attempt to generate C source out of the equations:

Suitable C pseudo code could be:

float r = pow(pow(fabs(cos(m * o / 4)) / a, n2) + pow(fabs(sin(m * o / 4)) / b, n3), 1 / n1);

The result of this calculation is in polar coordinates. Please note the difference between the equation and the C code. The equation has a negative power value, the C doesn’t. To extend this result into 3D, the spherical product of several superformulas is used. For example, the 3D parametric surface is obtained multiplying two superformulas S1and S2. The coordinates are defined by the relations:

The sphere mapping code uses two r values:

point->x = (float)(cosf(t) * cosf(p) / r1 / r2);
point->y = (float)(sinf(t) * cosf(p) / r1 / r2);
point->z = (float)(sinf(p) / r2);

Because r1 and r2 had a positive power value in the C code above we have to divide by those variables here. Here is a Mathematica render of this code:
Categories: Geometry Manipulations Tags:

Rockstar Games

April 30th, 2009 Wolfgang Engel No comments

Today GTA IV was launched a year ago and it is my last day where I am employed at Rockstar Games. After fantastic more than four years I felt like I should get a break to go back to some research topics and see my kids growing for a while :-) , so I gave my notice two weeks ago.

Categories: Rockstar Games Tags:

Beagle Board

April 30th, 2009 Wolfgang Engel No comments

I got the whole development environment going and wrote a few small little graphics demos for it. All the PowerVR demos I tried ran on it nicely. Very cool!

If you are interested in a next-gen mobile development platform I would defitely recommend looking into this athttp://beagleboard.org/

Any further development has now moved to lowest priority … maybe at some point I will play around more with Angstroem. There is an online image builder

http://amethyst.openembedded.net/~koen/narcissus/

Categories: Handheld Development Tags:

BeagleBoard.org Ubuntu 8.04

April 21st, 2009 Wolfgang Engel 1 comment

In the last few days I setup a development environment for a BeagleBoard (see beagleboard.org). I wanted to hold the next-gen environment for future phones and the OpenPandora in my hands today. Overall the size of the board is astonishingly small and you can power it with the USB port. The board runs Angstroem -a Linux OS-, it has the OMAP3530 processor on there. It has a dedicated video decode DSP, the PowerVR SGX chipset, a sound chip and a few other things that I haven’t used so far. You can even plug in a keyboard and a mouse and you have a full-blown computer with 256 MB RAM and 256 MB SDRAM.
To get this going I had to install a Linux OS on one of my PCs; Ubuntu 8.04. To relieve the pain of having to google all the Linux commands again and again I try to write down a few notes for myself here:
- minicom is not installed by default. You have to install it yourself. To do this you have to open up Applications -> Add/Remove and refresh the package list (you need an internet connection for this) and then install the build essentials first and then minicom by typing into a terminal:
sudo apt-get install build-essential
sudo apt-get install minicom
- to look for the RS232 serial device you can use
dmesg | grep tty
I found adding environment variables to the PATH statement different on Ubuntu 8.04. You can set an environment variable by using
export VARNAME=some_string
e.g
export PATH=$PATH:some/other/path
To check if it is set you can use
echo $PATH
For the PLATFORM you set it by typing
export PLATFORM=LinuxOMAP3
you use
echo $PLATFORM
to check if it is correct.
Similar for library pathes you type
export LIBDIR=$PWD
from the directory where the lib files are. To check that this works you can use
echo $LIBDIR
To make all those variable values persistent you can copy those statements at the end of the .bashrc file. Some other things I found convenient were:
gksudo gedit
start the editor with sudo.
Copying a file from one in another directory can be done by using the cp command like this
$ cp -i goulash recipes/hungarian
cp: overwrite recipes/hungarian/goulash (y/n)?

You can copy a directory path in the terminal by dragging the file from the file browser into the terminal command line.

Categories: Handheld Development Tags:

ShaderX7 on Sale

March 21st, 2009 Wolfgang Engel No comments

ShaderX7 has more than 800 pages. I like the following screenshot from Amazon.com:
ShaderX8 is already announced. Proposals are due by May 19th, 2009. Please send them to wolf at shaderx.com. An example proposal, writing guidelines and a FAQ can be downloaded from www.shaderx6.com/ShaderX6.zip. The schedule is available on http://www.shaderx8.com/.

Thanks to Eric Haines for reminding me to add this to this page :-)

Categories: ShaderX Tags:

Mathematica

March 19th, 2009 Wolfgang Engel No comments

I switched from Maple to Mathematica last week. One of my small little projects is to store all the graphics algorithms I liked to visualize in the last few years in one file. A kind of condensed memory of the things I worked on. Here is an example for a simple Depth of Field effect (as already covered in my GDC 2007 talk):

Distance runs on the axis called Z value. So 0 is close to the camera and 1.0 is far away. You can see how the near and far blur plane fade in and out with increasing of the value called Range. The equation to plot this in mathematica is rather simple. In practice it is a quite efficient approach to achieve the effect.

Plot3D[R*Abs[0.5 - z], {z, 1, 0}, {R, 0, 1},
PlotStyle -> Directive[Pink, Specularity[White, 50], Opacity[0.8]],
PlotLabel -> “Depth of Field”, AxesLabel -> {“Z value”, “Range”}]

My plan is to develop a few new algorithms and show the results here. It will be an exercise in thinking about new things for me. If you have any suggestions on what I should cover, please do not hesitate to post them in the comment line.

Categories: Mathematics Tags:

Team Leadership in the Game Industry

February 23rd, 2009 Wolfgang Engel No comments

A few of my friends contributed to the book “Team Leadership in the Game Industry” by Seth Spaulding II. So I was curious what you can write about leaders in this industry. Having spent most of my professional life outside of the game industry I believe I developed a different frame of reference than many of my colleagues.

First of all: the book is great and definitely worth a read. It is written in a very informative, instructive and entertaining way (… if you know the guys that contributed to it you know that it is worth it :-) ).

With that being said, let’s start with the review by looking at the Table of Content. I know that I usually spent more time than other people with reading the TOC. This is the best way for me to figure out what a book has to offer. A good TOC shows you the big picture of a book and allows you to see the pattern that the author chose on how to approach the topic. In most cases it even allows you to proof the underlying logic.
The book consists of 9 chapters. Each chapter consists of a analysis of facts by the author followed by an interview of a game industry veteran. The topics span from “How We got here” over “Anatomy of a Game-Dev Company”, “How Leaders are Chosen …”, “A Litmus Test for Leads”, “Leadership Types and Traits …” and then they go into more detail with the “The Project Team Leader …”, “The Department Leader …”, “Difficult Employees …”, “The Effect of Great Team Leadership” followed by a “Sample Skill Ladder” for artists in the appendix.

You might feel the need to discuss some of the details covered in each chapter but it is clear that this is the right formal approach to slice up the delicate topic of leadership in our industry.

When I first skipped through the book I wanted to figure out what kind of values the author has. After all a good leader makes it clear what kind of values he/she follows. I found it in the introduction. Here is the quote: “As will be seen, a major cause of people leaving a company is the perceived poor quality of their supervisors and senior management. The game business is a talent-based industry -the stronger and deeper your talent is, the better chances are of creating a great game. It is very difficult, in any hiring environment, to build the right mix of cross-disciplinary talent who function as a team at a high level; indeed, most companies never manage it. Once you get talented individuals on board, it’s critical not to lose them. Finding and nurturing compentent leaders who have the trust of the team will generate more retention than any addition of pool tables, movie nights, or verbal commitments to the value of “quality of life”.”
You might think this is the most obvious thing to say in the game industry.

Obviously the book wants to cover the process to setup a creative and great environment for all humans involved in the process of creating great games. Creating a great working environment starts with picking the right leaders that enable people by helping them to give their best. A great leader serves his/her people. He/she sees the best in everyone and has the ability to expose this talent. Many interviewees in the book also mention that humor is a leadership skill. I trained junior managers for BMW, Daimler, ABB and other companies back in Germany for two years on weekends and I always thought this is a strong skill. Making people laugh starts a lot of processes in the body that make people more relaxed and in general brighten up their day. Whoever can do this can certainly improve the morale and therefore efficiency of a team in seconds … priceless.

Managing a creative team is a completely different story than -for example- a sales team. The human factor in the relationship between people plays an important role. They have to create something together, while a sales person is on his own out in the field and comes back with a number and relies on a relationship with a potential customer that only lasts a few hours face-to-face time, a creative team stays together for years and has to overcome all the things that come up when humans have to live in a small space together. There is a complex social network in place that defines the relationships between those humans and it is important to keep the team running with all the constantly changing love/hate -and in-between- relationships on board. People on the team might even deal with difficult personal relationships and you end up with a mixture of chaos and randomness typical for family or close friends scenarios. In that context it was interesting to see what the interviewees thought about the question if leaders are born and / or can be trained to be successful in the game industry. Obviously someone who was active as a boy-scout leader, speaker/president of the students association at his university or volunteered to work with other people in general, already showed some level of social committment that is a good starting point for a leader ship role in our industry.

So defining and following the right values is a fundamental requirement for a book on leadership. Obviously after having set the values comes the part where those values need to be applied and used and this is where the book shines. It is hands-down and even if you do not agree with the author in every detail the fact that he wrote all this down earns the highest respect.

So now that I made it obvious that I am excited about this book, let’s think about how it might be improved in the future. A potential improvement I could see is to start the book with a target description. Not that the author fails to describe a target but I would appreciate it to go into more detail in this area.
What is the company you would want to work for? What is the environment you want to offer to make people as productive as possible? Obviously it is a hen / egg problem. Good people want to work in good teams and good teams consist of good people … there are social -soft skills- and knowledge -hard skills- attached to each person of that team.
A good team starts with a good leader who sets values and standards and hires the right people.

Assuming you are the leader of this future team, how would you create the environment for your dream team? How do you want people to feel when they are part of this team? What should they take home every night when they are exhausted? What do you want them to tell their wifes / better halves how it is to work with you as their leader?
A happy employee -fully enforced to be creative :-) – should tell his wife/girlfriend that he works very hard but is treated fair and enjoys the family related benefits of the company.
He should tell his friends that he is working in a team where information is shared and where his potential is not only used as much as possible but also amplified. He needs to feel like he is growing with the team and the tasks.
He should tell his colleagues that he enjoys working with them and the team and that he enjoys coming into work every day and that he is excited about the project he is working on …

So if we make that into a list of items we could describe how an employee should feel about working in a company with good Leaders. Might be a great starting point for discussing leader core abilities.

Categories: Leadership Tags:

Larrabee on GDC

February 3rd, 2009 Wolfgang Engel No comments

I am really looking forward to Mike Abrash’s and Tom Forsyth’s talks at GDC about Larrabee:

Talking about the Larrabee instruction set will be super cool … can’t wait to see this.
Categories: General GPU Programming Tags:

ShaderX7 Update

February 2nd, 2009 Wolfgang Engel No comments

I updated the ShaderX7 website at

http://www.shaderx7.com/

There is now the first draft of the cover and the Table of Content. Enjoy! :-)

As before I will rest for a second when the new book comes out and think about what happened since I founded the series now eight years ago … my perception of time slows down for this second :-) and I hear myself saying:”Chewbacca start the hyperdrive, let’s go to the next planet, I need to play cards, drink alcohol and find some entertainment … how about Tantoine?”

Categories: ShaderX Tags:

iP* programming tip #9

January 25th, 2009 Wolfgang Engel No comments

This issue of the iPhone / iPod Touch programmig tips series focuses on some aspects of VFP assembly programming. My friend Noel Llopis brought an oversight in the VFP math library to my attention, that I still need to fix. So I start with the description of the problem here and promise to fix it soon in the VFP library :-)
First let’s start with the references. My friend Aaron Leiby has a blog entry on how to start programming the VFP unit here:

A typical inline assembly template might look like this:

asm ( assembler template         : output operands                  /* optional */         : input operands                   /* optional */         : list of clobbered registers      /* optional */         );

The last two lines of code hold the input and output operands and the so called clobbers, that are used to inform the compiler on which registers are used.
Here is a simple GCC assembly example -that doesn’t use VFP assembly- that shows how the input and output operands are specified:

asm(“mov %0, %1, ror #1″ : “=r” (result) ” : “r” (value));

The idea is that “=r” holds the result and “r” is the input. %0 refers to “=r” and %1 refers to “r”.
Each operand is referenced by numbers. The first output operand is numbered 0, continuing in increasing order. There is a max number of operands … I don’t know what the max number is for the iPhone platform.

Some instructions clobber some hardware registers. We have to list those registers in the clobber-list, ie the field after the third ’:’ in the asm function. So GCC will not assume that the values it loads into these registers will be valid.
In other words a clobber list tells the compiler which registers were used but not passed as operands. If a register is used as a scratch register this register need to be mentioned in there. Here is an example:

asm volatile("ands    r3, %1, #3"     "\n\t"          "eor     %0, %0, r3"     "\n\t"          "addne   %0, #4"                : "=r" (len)                  : "0" (len)                   : "cc", "r3"         );

r3 is used as a scratch register here. It seems the cc pseudo register tells the compiler about the clobber list. If the asm code changes memory the “memory” pseudo register informs the compiler about this.

asm volatile("ldr     %0, [%1]"         "\n\t"           "str     %2, [%1, #4]"     "\n\t"           : "=&r" (rdv)           : "r" (&table), "r" (wdv)           : "memory"          );

This special clobber informs the compiler that the assembler code may modify any memory location. Btw. the volatile attribute instructs the compiler not to optimize your assembler code.

If you want to add something to this tip … please do not hesitate to write it in the comment line. I will add it then with your name.

Partial Derivative Normal Maps

January 10th, 2009 Wolfgang Engel No comments

To make my collection of normal map techniques more complete on this blog I also have to mention a special normal mapping technique that Insomniac’s Mike Acton brought to my attention a long time ago (I wasn’t sure if I am allowed to publish it … but now they have slides on their website).

The idea is to store the paritial derivate of the normal in two channels of the map like this
dx = (-nx/nz);
dy = (-ny/nz);
Then you can reconstruct the normal like this:
nx = -dx;
ny = -dy;
nz = 1;
normalize(n);
The advantage is that you do not have to reconstruct Z, so you can skip one instruction in each pixel shader that uses normal maps.
This is especially cool on the PS3 while on the XBOX 360 you can also create a custom texture format to let the texture fetch unit do the scale and bias and save a cycle there.
More details can be found at
Look for Partial Derivative Normal Maps.

Handling Scene Geometry

January 4th, 2009 Wolfgang Engel No comments

I recently bumped into a post by Roderic Vicaire on the www.gamedev.net forums. It is here.
Obviously there is no generic solution to handle all scene geometry in the same way but depending on the game his naming conventions make a lot of sense (read “Scenegraphs say no” in Tom Forsyth’s blog).
- SpatialGraph: used for finding out what is visible and should be drawn. Should make culling fast
- SceneTree: used for hierarchical animations, e.g. skeletal animation or a sword held in a character’s hand
- RenderQueue: is filled by the SpatialGraph. Renders visible stuff fast. It sorts sub arrays per key, each key holding data such as depth, shaderID etc. (see Christer Ericson’s blog entry “Sort based-draw call bucketing” for this)

Categories: Geometry Manipulations Tags:

Major Oolong Update

December 29th, 2008 Wolfgang Engel No comments

Two days ago I commited a major Oolong update. Please check out the Oolong Engine blog at

http://www.oolongengine.com

I updated the memory manager, the math library, upgraded to the latest POWERVR POD format and added to each example VBO support. Please also note that in previous updates a new memory manager was added, the VFP math library was added and a bunch of smaller changes were done as well.
The things on my list are: looking into the sound manager … it seems like the current version allocates memory in the frame and adding the DOOM III level format as a game format. Obviously zip support would be nice as well … let’s see how far I get.

Programming Vertex, Geometry and Pixel Shaders

December 25th, 2008 Wolfgang Engel 1 comment

A christmas present: we just went public with “Programming Vertex, Geometry and Pixel Shaders”. I am a co-author of this book and we published it free on www.gamedev.net at

http://wiki.gamedev.net/index.php/D3DBook:Book_Cover

If you have any suggestions, comments or additions to this book, please give me a sign or write it into the book comment pages.

Categories: General GPU Programming Tags:

Good Middleware

December 24th, 2008 Wolfgang Engel No comments
Kyle Wilson wrote up a summary about how good middleware should be:

http://gamearchitect.net/2008/09/19/good-middleware/

An interesting read.
Categories: General GPU Programming Tags:

Quake III Arena for the iPhone

December 24th, 2008 Wolfgang Engel No comments

Just realized that one of the projects I contributed some code to went public in the meantime. You can get the source code at

http://code.google.com/p/quake3-iphone/

There is a list of issues. If you have more spare time than me, maybe you can help out.

iP* programming tip #8

December 23rd, 2008 Wolfgang Engel No comments

This is the christmas issue of the iPhone / iPod touch programming tips. This time we deal with the touch interface. The main challenge I found with the touch screen support is that it is hard to use it to track for example forward / backward / left / right and fire at the same time. Let’s say the user presses fire and then he presses forward, what happens when he accidentally slides his finger a bit?
The problem is that each event is defined by the region it happens on the screen. When the user slides his finger, he is leaving this region. In other words if you handle on-screen touches as touch is on and finger lifted is off, if the finger is moved away and then lifted, the event is still on.
The work around is that if the user slides away with his finger the previous location of this finger is used to check if the current location is in the even region. If it is not, it defaults to switch off.
Touch-screen support for a typical shooter might work like this:
In touchesBegan, touchesMoved and touchesEnd there is a function call like this:

// Enumerates through all touch objects
for (UITouch *touch in touches)
{
[self _handleTouch:touch];
touchCount++;
}

_handleTouch might look like this:

- (void)_handleTouch:(UITouch *)touch
{
CGPoint location = [touch locationInView:self];
CGPoint previousLocation;

// if we are in a touchMoved phase use the previous location but then check if the current
// location is still in there
if (touch.phase == UITouchPhaseMoved)
previousLocation = [touch previousLocationInView:self];
else
previousLocation = location;


// fire event
// lower right corner .. box is 40 x 40
if (EVENTREGIONFIRE(previousLocation))
{
if (touch.phase == UITouchPhaseBegan)
{
// only trigger once
if (_bitMask ^ Q3Event_Fire)
{
[self _queueEventWithType:Q3Event_Fire value1:K_MOUSE1 value2:1];

_bitMask|= Q3Event_Fire;
}
}
else if (touch.phase == UITouchPhaseEnded)
{
if (_bitMask & Q3Event_Fire)
{
[self _queueEventWithType:Q3Event_Fire value1:K_MOUSE1 value2:0];

_bitMask^= Q3Event_Fire;
}
}
else if (touch.phase == UITouchPhaseMoved)
{
if (!(EVENTREGIONFIRE(location)))
{
if (_bitMask & Q3Event_Fire)
{
[self _queueEventWithType:Q3Event_Fire value1:K_MOUSE1 value2:0];

_bitMask^= Q3Event_Fire;
}
}
}
}

Tracking if the switch is on or off can be done with a bit mask. The event is send off to the game with a separate _queueEventWithType method.

iP* programming tip #7

December 15th, 2008 Wolfgang Engel No comments

This time I will cover Point Sprites in the iPhone / iPod touch programming tip. The idea is that a set of points -as the simplest primitive in OpenGL ES rendering- describes the positions of Point Sprites, and their appearance comes from the current texture map. This way, Point Sprites are screen-aligned sprites that offer a reduced geometry footprint and transform cost because they are represented by one point == vertex. This is useful for particle systems, lens flare, light glow and other 2-D effects.

  • glEnable(GL_POINT_SPRITES_OES) – this is the global switch that turns point sprites on. Once enabled, all points will be drawn as point sprites.
  • glTexEnvi(GL_POINT_SPRITES_OES, GL_COORD_REPLACE_OES, GL_TRUE) – this enables [0..1] texture coordinate generation for the four corners of the point sprite. It can be set per-texture unit. If disabled, all corners of the quad have the same texture coordinate.
  • glPointParametervf(GLenum pname, const GLfloat * params) – this is used to set the point attenuation as described below.

The point size of a point sprite can be derived with the formula:
user_clamp represents GL_POINT_SIZE_MIN and GL_POINT_SIZE_MIN settings of the glPointParametervf(). impl_clamp represents an implementation-dependent point size range.
GL_POINT_DISTANCE_ATTENUATION is used to pass in params as an array containing the distance attenuation coefficients a, b, and c, in that order.
In case multisampling is used (not officially supported), the point size is clamped to have a minimum threshold, and the alpha value of the point is modulated by the following equation:
GL_POINT_FADE_THRESHOLD_SIZE specifies the point alpha fade threshold.
Check out the Oolong engine example Particle System for an implementation. It uses 600 point sprites with nearly 60 fps. Increasing the number of point sprites to 3000 lets the framerate drop to around 20 fps.

Free ShaderX Books

December 12th, 2008 Wolfgang Engel No comments

Eric Haines provided a home for the three ShaderX books that are now available for free. Thanks so much for this! Here is the URL

http://tog.acm.org/resources/shaderx/

Categories: ShaderX Tags:

iP* programming tip #6

December 12th, 2008 Wolfgang Engel No comments

This time we are covering another fixed-function technique used in DirectX 7/8 times: Matrix Palettes support is an extension of OpenGL ES 1.1 that is supported on the iPhone.
It allows the usage of a set of matrices to transform the vertices and the normals. Each vertex has a set of indices into the palette, and a corresponding set of n weights.
The vertex is transformed by the modelview matrices specified by the vertices respective indices. These results are subsequently scaled by the weights of the respective units and then summed to create the eyespace vertex.

A similar procedure is followed for normals. They are transformed by the inverse transpose of the modelview matrix.

The main OpenGL ES functions that support Matrix Palette are

  • glMatrixMode(GL_MATRIX_PALETTE) – Set the matrix mode to palette
  • glCurrentPaletteMatrix(n) – Set the currently active palette matrix and loads each matrix in the palette
  • To enable vertex arrays
    glEnableClientState(MATRIX_INDEX_ARRAY)
    glEnableClientState(WEIGHT_ARRAY)
  • To load the index and weight per-vertex data
    glWeightPointer()
    glMatrixIndexPointer()

On the iPhone there are up to nine bones per sub-mesh supported (check GL_MAX_PALETTE_MATRICES_OES). Check out the Oolong example MatrixPalette for an implementation.

GDC Talk

December 11th, 2008 Wolfgang Engel No comments

My GDC talk was accepted. I am happy … yeaaahhh :-)

Categories: GDC Tags:

Cached Shadow Maps

December 9th, 2008 Wolfgang Engel No comments

A friend just asked me about how to design a shadow map system for many lights with shadows. A quite good explanation was given in the following post already in 2003:

http://www.gamedev.net/community/forums/viewreply.asp?ID=741199

Yann Lombard explains on how to pick a light source first that should cast a shadow. He is using distance, intensity, influence and other parameters to pick light sources.

He has a cache of shadow maps that can have different resolutions. His cache solution is pretty generic. I would build a more dedicated cache just for shadow maps.
After having picked the light sources that should cast shadows, I would only constantly update shadows in that cache that change. This depends on if there is an object with a dynamic flag in the shadow view frustum.
If you think about it how it happens when you approach a scene with lights that cast shadows:
1. the lights are picked that are close enough and appropriate to cast shadows -> shadow maps are updated
2. then while we move on, for the lights in 1. we only update shadow maps if there is an object in shadow view that is moving / dynamic; we start than with the next bunch of shadows while the shadows in 1 are still in view
3. and so on.

Categories: Shadow Maps Tags:

Dual-Paraboloid Shadow Maps

December 7th, 2008 Wolfgang Engel No comments

Here is an interesting post on Dual-Paraboloid Shadow maps. Pat Wilson describes a single pass approach here

http://www.gamedev.net/community/forums/topic.asp?topic_id=517022

This is pretty cool. Culling stuff into the two hemispheres is obsolete here. Other than this the usual comparison between cube maps and dual-paraboloid maps applies:

  • the number of drawcalls is the same … so you do not save on this front
  • you loose memory bandwidth with cube maps because in worst case you render everything into six maps that are probably bigger than 256×256 … in reality you won’t render six times and therefore have less drawcalls than dual-paraboloid maps
  • the quality is much better for cube maps
  • the speed difference is not that huge because dual paraboloid maps use things like texkill or alpha test to pick the right map and therefore rendering is pretty slow without Hierarchical Z.

I think both techniques are equivalent for environment maps .. for shadows you might prefer cube maps; if you want to save memory dual-paraboloid maps is the only way to go.

Update: just saw this article on dual-paraboloid shadow maps:

http://osman.brian.googlepages.com/dpsm.pdf

The basic idea is that you do the WorldSpace -> Paraboloid transformation in the pixel shader during your lighting pass. That avoids having the paraboloid co-ordinates interpolated incorrectly.

Categories: Shadow Maps Tags:

iP* programming tip #5

December 7th, 2008 Wolfgang Engel No comments

Let’s look today at the “pixel shader” level of the hardware functionality. The iPhone Application programming guide says that the application should not use more than 24 MB for textures and surfaces. It seems like those 24 MB are not in video card memory. I assume that all of the data is stored in system memory and the graphics card memory is not used.
Overall the iP* platform supports

  • The maximum texture size is 1024×1024
  • 2D texture are supported; other texture formats are not
  • Stencil buffers aren’t available

As far as I know stencil buffer support is available in hardware. That means the Light Pre-Pass renderer can only be implemented with the help of the scissor (hopefully available). As a side note: one of the other things that do not seem to be exposed is MSAA rendering. With the unofficial SDK it seems like you can use MSAA.
Texture filtering is described on page 99 of the iPhone Application programming guide. There is also an extension for anisotropic filtering supported, that I haven’t tried.

The pixel shader of the iP* platform is programmed via texture combiners. There is an overview on all OpenGL ES 1.1 calls at

http://www.khronos.org/opengles/sdk/1.1/docs/man/

The texture combiners are described in the page on glTexEnv. Per-Pixel Lighting is a popular example:

glTexEnvf(GL_TEXTURE_ENV,
// N.L
.. GL_TEXTURE_ENV_MODE, GL_COMBINE);
.. GL_COMBINE_RGB, GL_DOT3_RGB); // Blend0 = N.L

.. GL_SOURCE0_RGB, GL_TEXTURE); // normal map
.. GL_OPERAND0_RGB, GL_SRC_COLOR);
.. GL_SOURCE1_RGB, GL_PRIMARY_COLOR); // light vec
.. GL_OPERAND1_RGB, GL_SRC_COLOR);

// N.L * color map
.. GL_TEXTURE_ENV_MODE, GL_COMBINE);
.. GL_COMBINE_RGB, GL_MODULATE); // N.L * color map

.. GL_SOURCE0_RGB, GL_PREVIOUS); // previous result: N.L
.. GL_OPERAND0_RGB, GL_SRC_COLOR);
.. GL_SOURCE1_RGB, GL_TEXTURE); // color map
.. GL_OPERAND1_RGB, GL_SRC_COLOR);
Check out the Oolong example “Per-Pixel Lighting” in the folder Examples/Renderer for a full implementation.

iP* programming tip #4

December 5th, 2008 Wolfgang Engel No comments

All of the source code presented in this series is based on the Oolong engine. I will refer to the examples when it is appropriate so that everyone can look the code up or try it on its own. This tip covers the very simple basics of a iP* app. Here is the most basic piece of code to start a game:

// “View” for games in applicationDidFinishLaunching
// get screen rectangle
CGRect rect = [[UIScreen mainScreen] bounds];

// create one full-screen window
_window = [[UIWindow alloc] initWithFrame:rect];

// create OpenGL view
_glView = [[EAGLView alloc] initWithFrame: rect pixelFormat:GL_RGB565_OES depthFormat:GL_DEPTH_COMPONENT16_OES preserveBackBuffer:NO];

// attach the view to the window
[_window addSubView:_glView];

// show the window
[_window makeKeyAndVisible];

The screen dimensions are retrieved from a screen object. Erica Sadun compares the UIWindow functionality to a TV set and the UIView to actors in a TV show. I think this is a good way to memorize the functionality. In our case EAGLView, that comes with the Apple SDK, inherits from UIView and adds all the OpenGL ES functionality to it. We attach this view than to the window and make everything visible.
Oolong assumes a full-screen window that does not rotate. It is always in widescreen view. The reason for this is that otherwise the accelerometer usage -to drive a camera with the accelerometer for example- wouldn’t be possible.
There is a corresponding dealloc method to this code that frees all the allocated resources again.
The anatomy of a Oolong engine example uses mainly two files. A file with “delegate” in the name and the main application file. The main application file has the following methods:
- InitApplication()
- QuitApplication()
- UpdateScene()
- RenderScene()
The first pair of methods do one-time device dependent resource allocations and deallocations, while the UpdateScene() prepares scene rendering and the last method actually does what the name says. If you would like to extend this framework to handle orientation changes, you would add a pair of methods with names like InitView() and ReleaseView() and handle all orientation dependent code in there. Those methods would always been called when the orientation changes -only once- and at the start of the application.

One other basic topic is the usage of C++. In Apple speak this is called Objective-C++. Cocoa Touch wants to be addressed with Obj-C. So native C or C++ code is not possible. For game developers there is lots of existing C/C++ code to be re-used and its usage makes games easier to port to several platforms (quite common to launch an IP on several platforms at once). The best solution to this dilemma is to use Objective-C where necessary and then wrap to C/C++.
If a file has the postfix *.mm, the compiler can handle Objective-C, C and C++ code pieces at the same time to a certain degree. If you look in Oolong for files with such a postfix you will find many of them. There are whitepapers and tutorials available for Objective-C++ that describe the limitations of the approach. Because garbage collection is not used on the iP* device I want to believe that the challenges to make this work on this platform are smaller. Here are a few examples on how the bridge between Objective-C and C/C++ is build in Oolong. In our main application class in every Oolong example we bridge from the Objective-C code used in the “delegate” file to the main application file like this:

// in Application.h
class CShell
{
..
bool UpdateScene();

// in Application.mm
bool CShell::UpdateScene()
..

// in Delegate.mm
static CShell *shell = NULL;

if(!shell->Update()) printf(“Update error\n”);

An example on how to call an Objective-C method from C++ can look like this (C wrapper):

// in PolarCamera.mm -> C wrapper
void UpdatePolarCamera()
{
[idFrame UpdateCamera];
}
-(void) UpdateCamera
{
..
// in Application.mm
bool Cshell::UpdateScene()
{
UpdatePolarCamera();
..

The idea is to retrieve the id for a class and then use this id to address a function in the class from the outside.
If you want to see all this in action, open up the skeleton example in the Oolong Engine source code. You can find it at
Examples/Renderer/Skeleton
Now that we are at the end of this tip I would like to refer to a blog that my friend Canis wrote. He talks about memory management here. This blog entry applies to the iP* platforms quite well:

http://www.wooji-juice.com/blog/cocoa-6-memory.html

iP* programming tip #3

December 3rd, 2008 Wolfgang Engel No comments

Today I will cover the necessary files of an iP* application and the folders that potentially hold data on the device from your application.

  • .app folder holds everything without required hierarchy
  • .lproj language support
  • Executable
  • Info.plist – XML property list holds product identifier > allows communicate with other apps and register with Springboard
  • Icon.png (57×57) set UIPrerenderedIcon to true in Info.plist to not receive the gloss / shiny effect
  • Default.png … should match game background; no “Please wait” sign … smooth fade
  • XIB (NIB) files precooked addressable user interface classes >remove NSMainNibFile key from Info.plist if you do not use it
  • Your files; for example in demoq3/quake3.pak

If the game boots very fast a good mobile phone experience could be guaranteed by making a screenshot when the user ends the app and then using that screenshot while booting up the game and bringing it to the state it was before.
Every iP* app is sandboxed. That means that only certain folders, network resources and hardware can be accessed. Here is a list of folders that might be affected by your application:

  • Preferences files are in var/mobile/Library/Preferences based on the product identifier (e.g. com.engel.Quake.plist); updated when you use something like NSUserDefaults to add persistance to game data like save and load
  • App plug-in /System/Library (not available)
  • Documents in /Documents
  • Each app has a tmp folder
  • Sandbox spec e.g. in /usr/share/sandbox > don’t touch 

The sandbox paradigm is also responsible for a mechanism that stops your game if it eats up too many resources of the iPhone. I wonder under which conditions this is going to happen.

HLSL 5.0 OOP / Dynamic Shader Linking

December 2nd, 2008 Wolfgang Engel No comments

I just happen to bump into a few slides on the new HLSL 5.0 syntax. The slides are at

http://www.microsoft.com/downloads/details.aspx?FamilyId=32906B12-2021-4502-9D7E-AAD82C00D1AD&displaylang=en
I thought I comment on those slides because I do not get the main idea. The slides mention a combinatiorial explosion for shaders. They show on slide 19 three arrows that go in all three directions. One is called Number of Lights, another one Environmental Effects and the third one is called Number of Materials.
Regarding the first one: even if one has never worked on a game, everyone knows the words Deferred Lighting. If you want many lights you want to do the lighting in a way that the same shader is used for each light type. Assuming that we have a directional, point and spot light this brings me to three shaders (I actually use currently three but I might increase this to six).
One arrow talks about Environmental Effects. Most environmental effects nowadays are part of PostFX or a dedicated sky dome system. That adds two more shaders.
The last arrow says Number of Materials. Usually we have up to 20 different shaders for different materials.
This brings me to -let’s say 30 – 40- different shaders in a game. I can’t consider this a combinatorial explosion so far.
On slide 27 it is mentioned that the major driving point for introducing OOP is the dynamic shader linkage. It seems like there is a need for dynamic shader linkage because of the combinatorial explosion of the shaders.
So in essence the language design of the HLSL language is driven by the fact that we have too many shaders and someone assumes that we can’t cope with the shear quantity. To fix this we need dynamic shader linkage and to make this happen we need OOP in HLSL.
It is hard for me to follow this logic. It looks to me like we are doing a huge step back here. Not focusing on the real needs and adding code bloat.
Dynamic shader linkers are proven to be useless since a long time in game development; the previous attempts in this area were buried with DirectX 9 SDKs. The reason for this is that they do not allow to hand-optimize code which is a very important thing to do to make your title competitive. As soon as you change one of the shader fragments this has impact on the performance of other shaders. Depending on if you hit a performance sweetspot or not you can get a very different performance out of graphics cards.
Because the performance of your code base becomes less predictable, you do not want to use a dynamic shader linker if you want to create competitive games in the AAA segment.
Game developers need more control over the performance of the underlying hardware. We are already forced to use NV API and other native APIs to ship games on the PC platform with acceptable feature set and performance (especially SLI configs) because DirectX does not expose the functionality. For the DirectX 9 platform we look into Cuda and Cal support for PostFX.
This probably does not have much impact on the HLSL syntax but in general I would prefer having more abilities to squeeze out more performance from graphics cards over any OOP extension that does not sound like it increases performance. At the end of the day the language is a tool to squeeze out as much performance as possible from the hardware. What else do you want to do with it?
Categories: General GPU Programming Tags:

iP* programming tip #2

December 2nd, 2008 Wolfgang Engel No comments

Today’s tip will deal with the setup of your development environment. As a Mac newbie I was having a hard time to get used to the environment more than a year ago -when I started Mac development- and I still suffer under windowitis. I know that Apple does not want to copy MS’s Visual Studio but most people who are used to work with Visual Studio would put that on their holiday wishlist :-)
Here are a few starting points to get used to the environment:

  • To work in one window only, use the “All-in-One” mode if you miss Visual Studio (http://developer.apple.com/tools/xcode/newinxcode23.html)
    You have to load Xcode, but not load any projects. Go straight to Preferences/General Tab, and you’ll see “Layout: Default”. Switch that to “Layout: All-In-One”. Click OK. Then, you can load your projects.
  • Apple+tilde – cycle between windows in the foreground
  • Apple+w – closes the front window in most apps
  • Apple+tab – cycle through windows

Please note that Apple did a revolutionary thing on the new MacBook Pro’s (probably also the new MacBook’s) … there is no Apple key anymore. It is now called command key.

For everyone who prefers hotkeys to start applications you might check out Quicksilver. Automatically hiding and showing the Dock gives you more workspace. If you are giving presentations about your work, check out Stage Hand for the iPod touch / iPhone.

For reference you should have POWERVR SDK for Linux downloaded. It is a very helpful reference regarding the MBX chip in your target platforms.

Not very game or graphics programming related but very helpful is Erica Sadun’s book “The iPhone Developer’s Cookbook”. She does not waste your time with details you are not interested in and comes straight to the point. Just reading the first section of the book is already pretty cool.
You want to have this book if you want to dive into any form of Cocoa interface programming.
The last book I want to recommend is Andrew M. Duncan’s “Objective-C Pocket Reference”. I have this usually lying on my table if I stumble over Objective-C syntax. If you are a C/C++ programmer you probably do not need more than this. There are also Objective-C tutorials on the iPhone developer website and on the general Apple website.

If you have any other tip that I can add to the website I would mention it with your name.

Update: PpluX send me the following link:
He describes here how he disables deep sleep mode and modifies the usage of spaces.

The next iP* programming tip will be more programming related … I promise :-)

iP* programming tip #1

December 1st, 2008 Wolfgang Engel No comments

This is the first of a series of iPhone / iPod programming tips.
Starting iPhone development requires first the knowledge of the underlying hardware and what it can do for you. Here are the latest hardware specs I am aware of (a rumour was talking about iPods that run the CPU with 532 MHz … I haven’t found any evidence for this).

  • GPU: PowerVR MBXLite with VGPLite with 103 Mhz
  • ~DX8 hardware with vs_1_1 and ps_1_1 functionality
  • Vertex shader is not exposed
  • Pixel shader is programmed with texture combiners
  • 16 MB VRAM – not mentioned anywhere
  • CPU: ARM 1176 with 412 Mhz (can do 600 Mhz)
  • VFP unit 128-bit Multimedia unit ~= SIMD unit
  • 128 MB RAM; only 24 MB for apps allowed
  • 320×480 px at 163 ppi screen
  • LIS302DL, a 3-axis accelerometer with 412 Mhz (?) update rate
  • Multi-Touch: up to five fingers
  • PVRTC texture compression: color map 2-bit per pixel and normal map 4-bit per-pixel

The interesting part is that the CPU can do up to 600 Mhz, so it would be possible to increase the performance here in the future.
I wonder how the 16 MB VRAM are handled. I assume that this is the place where the VBO and textures are stored. Regarding the max size of apps of 24 MB; I wonder what happens if an application generates geometry and textures dynamically … when does the sandbox of the iPhone / iPod touch stop the application. I did not find any evidence for this.

WARP – Running DX10 and DX11 Games on CPUs

December 1st, 2008 Wolfgang Engel No comments

As a MVP I was involved into testing this new Windows Advanced Rasterization Platform. They just published the first numbers

http://msdn.microsoft.com/en-us/library/dd285359.aspx

Running Crysis on a 8 core CPU with a resolution of 800×600 at 7.2 fps is an achievement. If this would be hand-optimized very well, it would be the best way to write code for. 4 – 8 cores will be a common target platform in the next two years. Because it can be switched off if there is a GPU, this is a perfect target for game developers. What this means is that you can write a game with the DirectX 10 API and not only target all the GPUs out there but also machines without GPU … this is one of the best developments for the PC market since a long time. I am excited!

The other interesting consequence from this development is: if INTELs “Bread & Butter” chips run games with the most important game API, it would be a good idea if INTEL would put a bunch of engineers behind this and optimize WARP (in case they haven’t already done so). This is the big game market consisting of games like “The Sims” and “World of Warcraft” and similar games that we are talking about here. The high-end PC gaming market is much smaller.

Categories: D3D10, D3D11 Tags:

iPhone ARM VFP code

November 6th, 2008 Wolfgang Engel No comments

The iPhone has a kind of SIMD unit. It is called VFP unit and it is pretty hard to figure out how to program it. Here is a place where you can find soon lots of VFP asm code.

With help from Matthias Grundmann I wrote my first piece of VFP code. Here it is:void MatrixMultiplyF(
MATRIXf &mOut,
const MATRIXf &mA,
const MATRIXf &mB)
{
#if 0
MATRIXf mRet;

/* Perform calculation on a dummy matrix (mRet) */
mRet.f[ 0] = mA.f[ 0]*mB.f[ 0] + mA.f[ 1]*mB.f[ 4] + mA.f[ 2]*mB.f[ 8] + mA.f[ 3]*mB.f[12];
mRet.f[ 1] = mA.f[ 0]*mB.f[ 1] + mA.f[ 1]*mB.f[ 5] + mA.f[ 2]*mB.f[ 9] + mA.f[ 3]*mB.f[13];
mRet.f[ 2] = mA.f[ 0]*mB.f[ 2] + mA.f[ 1]*mB.f[ 6] + mA.f[ 2]*mB.f[10] + mA.f[ 3]*mB.f[14];
mRet.f[ 3] = mA.f[ 0]*mB.f[ 3] + mA.f[ 1]*mB.f[ 7] + mA.f[ 2]*mB.f[11] + mA.f[ 3]*mB.f[15];

mRet.f[ 4] = mA.f[ 4]*mB.f[ 0] + mA.f[ 5]*mB.f[ 4] + mA.f[ 6]*mB.f[ 8] + mA.f[ 7]*mB.f[12];
mRet.f[ 5] = mA.f[ 4]*mB.f[ 1] + mA.f[ 5]*mB.f[ 5] + mA.f[ 6]*mB.f[ 9] + mA.f[ 7]*mB.f[13];
mRet.f[ 6] = mA.f[ 4]*mB.f[ 2] + mA.f[ 5]*mB.f[ 6] + mA.f[ 6]*mB.f[10] + mA.f[ 7]*mB.f[14];
mRet.f[ 7] = mA.f[ 4]*mB.f[ 3] + mA.f[ 5]*mB.f[ 7] + mA.f[ 6]*mB.f[11] + mA.f[ 7]*mB.f[15];

mRet.f[ 8] = mA.f[ 8]*mB.f[ 0] + mA.f[ 9]*mB.f[ 4] + mA.f[10]*mB.f[ 8] + mA.f[11]*mB.f[12];
mRet.f[ 9] = mA.f[ 8]*mB.f[ 1] + mA.f[ 9]*mB.f[ 5] + mA.f[10]*mB.f[ 9] + mA.f[11]*mB.f[13];
mRet.f[10] = mA.f[ 8]*mB.f[ 2] + mA.f[ 9]*mB.f[ 6] + mA.f[10]*mB.f[10] + mA.f[11]*mB.f[14];
mRet.f[11] = mA.f[ 8]*mB.f[ 3] + mA.f[ 9]*mB.f[ 7] + mA.f[10]*mB.f[11] + mA.f[11]*mB.f[15];

mRet.f[12] = mA.f[12]*mB.f[ 0] + mA.f[13]*mB.f[ 4] + mA.f[14]*mB.f[ 8] + mA.f[15]*mB.f[12];
mRet.f[13] = mA.f[12]*mB.f[ 1] + mA.f[13]*mB.f[ 5] + mA.f[14]*mB.f[ 9] + mA.f[15]*mB.f[13];
mRet.f[14] = mA.f[12]*mB.f[ 2] + mA.f[13]*mB.f[ 6] + mA.f[14]*mB.f[10] + mA.f[15]*mB.f[14];
mRet.f[15] = mA.f[12]*mB.f[ 3] + mA.f[13]*mB.f[ 7] + mA.f[14]*mB.f[11] + mA.f[15]*mB.f[15];

/* Copy result in pResultMatrix */
mOut = mRet;
#else
#if (TARGET_CPU_ARM)
const float* src_ptr1 = &mA.f[0];
const float* src_ptr2 = &mB.f[0];
float* dst_ptr = &mOut.f[0];

asm volatile(
// switch on ARM mode
// involves uncoditional jump and mode switch (opcode bx)
// the lowest bit in the address signals whether are (bit cleared)
// or tumb should be selected (bit set)
".align 4 \n\t"
"mov r0, pc \n\t"
"bx r0 \n\t"
".arm \n\t"

// set vector length to 4
// example fadds s8, s8, s16 means that the content s8 - s11
// is added to s16 - s19 and stored in s8 - s11
"fmrx r0, fpscr \n\t" // loads fpscr status reg to r4
"bic r0, r0, #0x00370000 \n\t" // bit clear stride and length
"orr r0, r0, #0x00030000 \n\t" // set length to 4 (11)
"fmxr fpscr, r0 \n\t" // upload r4 to fpscr
// Note: this stalls the FPU

// result[0][1][2][3] = mA.f[0][0][0][0] * mB.f[0][1][2][3]
// result[0][1][2][3] = result + mA.f[1][1][1][1] * mB.f[4][5][6][7]
// result[0][1][2][3] = result + mA.f[2][2][2][2] * mB.f[8][9][10][11]
// result[0][1][2][3] = result + mA.f[3][3][3][3] * mB.f[12][13][14][15]
// s0 - s31
// if Fd == s0 - s7 -> treated as scalar all the other treated like vector
// load the whole matrix into memory - transposed -> second operand first
"fldmias %2, {s8-s23} \n\t"
// load first column to scalar bank
"fldmias %1!, {s0 - s3} \n\t"
// first column times matrix
"fmuls s24, s8, s0 \n\t"
"fmacs s24, s12, s1 \n\t"
"fmacs s24, s16, s2 \n\t"
"fmacs s24, s20, s3 \n\t"
// save first column
"fstmias %0!, {s24-s27} \n\t"

// load second column to scalar bank
"fldmias %1!, {s4-s7} \n\t"
// second column times matrix
"fmuls s28, s8, s4 \n\t"
"fmacs s28, s12, s5 \n\t"
"fmacs s28, s16, s6 \n\t"
"fmacs s28, s20, s7 \n\t"
// save second column
"fstmias %0!, {s28-s31) \n\t"

// load third column to scalar bank
"fldmias %1!, {s0-s3} \n\t"
// third column times matrix
"fmuls s24, s8, s0 \n\t"
"fmacs s24, s12, s1 \n\t"
"fmacs s24, s16, s2 \n\t"
"fmacs s24, s20, s3 \n\t"
// save third column
"fstmias %0!, {s24-s27} \n\t"

// load fourth column to scalar bank
"fldmias %1!, {s4-s7} \n\t"
// fourth column times matrix
"fmuls s28, s8, s4 \n\t"
"fmacs s28, s12, s5 \n\t"
"fmacs s28, s16, s6 \n\t"
"fmacs s28, s20, s7 \n\t"
// save fourth column
"fstmias %0!, {s28-s31} \n\t"

// reset vector length to 1
"fmrx r0, fpscr \n\t" // loads fpscr status reg to r4
"bic r0, r0, #0x00370000 \n\t" // bit clear stride and length
"fmxr fpscr, r0 \n\t" // upload r4 to fpscr

// switch to tumb mode
// lower bit of destination is set to 1
"add r0, pc, #1 \n\t"
"bx r0 \n\t"
".thumb \n\t"

// binds variables to registers
: "=r" (dst_ptr), "=r" (src_ptr1), "=r" (src_ptr2)
: "0" (dst_ptr), "1" (src_ptr1), "2" (src_ptr2)
: "r0"
);
#endif
#endif
}


Midnight Club: Los Angeles

October 20th, 2008 Wolfgang Engel No comments

Tomorrow is the day. Midnight Club Los Angeles will launch tomorrow. This is the third game I worked on for Rockstar. If you are into racing games you need to check it out :-)

Categories: Rockstar Games Tags:

Hardware GPU / SPU / CPU

October 16th, 2008 Wolfgang Engel 1 comment

I follow all the discussions about the future of game hardware with talks about Larrabee and GPUs and the death of 3D APIs and -depending on the view point- different hardware designs.

The thing I figure is that all this is quite interesting and inspiring but our cycles of change in computer graphics and graphics programming are pretty long. Most of the stuff we do is based on research papers that were released more than 30 years ago and written on typewriters.
Why should any new piece of hardware change all this in a very short amount of time?
There is a game market out there that grows in double digit percentage numbers on all kind of hardware. How much of this market and its growth would be influenced by any new hardware?
Some of the best distributed game hardware is pretty old and following most standards, sub-performant. Nevertheless it offers entertainment that people enjoy.
So how important is it if we program a CPU/SPU/GPU or whatever we call the next thing. Give me a washing machine with a display and I make an entertainment machine with robo rumble out of it.
Categories: General GPU Programming Tags:

S3 Graphics Chrome 440 GTX

October 3rd, 2008 Wolfgang Engel No comments

I bought a new S3 Chrome 440 GTX in the S3 online store. I wanted to know how this card is doing, especially because it is DirectX 10.1 compatible. The other reason why I bought it was that it has a HDMI output. Just putting it into my desktop machine was interesting. I removed a 8800 GTS which was really heavy and than this card that was so small and didn’t even need an extra power supply. It looks like some of my graphics cards from the end of the 90th when they started to put fans on the cards. With the small fan it should be possible to passively cool that card easily.

I just went through the DirectX 10 SDK examples. Motion Blur is about 5.8 fps and NBodyGravity is about 1.8 fps. The instancing example runs with 11.90 fps. I use the VISTA 64-bit beta drivers 7.15.12.0217-18.05.03. The other examples run fast enough. The CPU does not seem to become overly busy.
Just saw that there is a newer driver. The latest driver which is WHQL’ed has the version number 248. The motion blur example runs with 6.3 fps with some artefacts (the beta driver had that as well), Instancing ran with 11.77 fps and the NBodyGravity example with 1.83 fps … probably not an accurate way to measure this stuff at all but at least it gives a rough idea.

The integrated INTEL chip 4500 MHD in my notebook is slower than this but then it supports at least DX10 and the notebook is super light :-) … for development it just depends for me on the feature support (Most of the time I prototype effects on PCs).
While playing around with the two chipsets I just found out that the mobile INTEL chip also runs the new DirectX 10.1 SDK example Depth of Field with more than 20 fps. This is quite impressive. The Chrome 440 GTX is running this example with more than 100 fps. The new Raycast Terrain example runs with 19.6 fps on the Chrome and with less 7.6 fps on the Mobile INTEL chip set. The example that is not running on the Mobile INTEL chip is the ProceduralMaterial example. It runs with less than 1 fps on the Chrome 440 GTX.
Nevertheless it seems like both companies did their homework with the DirectX SDK.
So I just ran a bunch of ShaderX7 example programs against the cards. While the INTEL Mobile chip shows errors in some of the DirectX9 examples and crashes in some of the DirectX 10 stuff, the Chrome seems to even take the DirectX 10.1 examples that I have, that usually only run on ATI hardware … nice!
One thing that I haven’t thought of is GLSL support. I thought that only ATI and NVIDIA have GLSL support but S3 seems to have it as well. INTEL’s mobile chip does not have it so …

I will try out the 3D Futuremark Vantage Benchmark. It seems a Chrome 400 Series is in there with a score of 222. Probably not too bad considering the fact that they probably not pay Futuremark for being a member of their program.
Update October 4th: the S3 Chrome 440 GTX did 340 as the Graphics score in the trial version of the 3D Mark Vantage.

Categories: General GPU Programming Tags:

Old Interview

October 1st, 2008 Wolfgang Engel No comments

Just bumped into an old interview I gave to Gamedev.net. I still think everything in there is valid

While reading it I thought it is kind of boring. Many of my answers are so obvious … maybe this is just my perception. How can you make it into the game industry? Probably on the same way you can make it into any industry. Lots of education or luck or just being at the right time at the right place and then being creative, a good thinker etc.. There is no magic trick I think … it all comes with lots of sweat.
Categories: General GPU Programming Tags:

64-bit VISTA Tricks

October 1st, 2008 Wolfgang Engel No comments

I got a new notebook today with 64-bit VISTA pre-installed. It will replace a Desktop that had 64-bit VISTA on there. My friend Andy Firth provided me with the following tricks to make my life easier (it has a 64 GB solid state in there, so no hard-drive optimizations):

Switch Off User Account Control
This gets rid of the on-going “are you sure” questions.
Go to Control Panel. Click on User Account and switch it off.
Disable Superfetch
Press Windows key + R. Start services.msc and scroll down until you find Superfetch. Double click on it and change the startup type to Disabled.
Categories: General GPU Programming Tags:

Light Pre-Pass: More Blood

September 28th, 2008 Wolfgang Engel No comments

I spent some more time with the Light Pre-Pass renderer. Here are my assumptions:

N.H^n = (N.L * N.H^n * Att) / (N.L * Att)

This division happens in the forward rendering path. The light source has its own shininess value in there == the power n value. With the specular component extracted, I can apply the material shininess value like this.

(N.H^n)^nm

Then I can re-construct the Blinn-Phong lighting equation. The data stored in the Light Buffer is treated like one light source. As a reminder, the first three channels of the light buffer hold:

N.L * Att * DiffuseColor

Color = Ambient + (LightBuffer.rgb * MatDiffInt) + MatSpecInt * (N.H^n)^mn * N.L * Att

So how could I do this :-)

N.H^n = (N.L * N.H^n * Att) / (N.L * Att)

N.L * Att is not in any channel of the Light buffer. How can I get this? The trick here is to convert the first three channels of the Light Buffer to luminance. The value should be pretty close to N.L * Att.
This also opens up a bunch of ideas for different materials. Every time you need the N.L * Att term you replace it with luminance. This should give you a wide range of materials.
The results I get are very exciting. Here is a list of advantages over a Deferred Renderer:
- less cost per light (you calculate much less in the Light pass)
- easier MSAA
- more material variety
- less read memory bandwidth -> fetches only two instead of the four textures it takes in a Deferred Renderer
- runs on hardware without ps_3_0 and MRT -> runs on DX8.1 hardware

Categories: Deferred Lighting Tags:

Shader Workflow – Why Shader Generators are Bad

September 21st, 2008 Wolfgang Engel No comments

[quote]As far as I can tell from this discussion, no one has really proposed an alternative to shader permutations, merely they’ve been proposing ways of managing those permutations.[/quote]

If you define shader permutations as having lots of small differences but using the same code than you have to live with the fact that whatever is send to the hardware is a full-blown shader, even if you have exactly the same skinning code in every other shader.
So the end result is always the same … whatever you do on the level above that.
What I describe is a practical approach to handle shaders with a high amount of material variety and a good workflow.
Shaders are some of the most expensive assets in production value and time spend of the programming team. They need to be the highest optimized piece of code we have, because it is much harder to squeeze out performance from a GPU than from a CPU.
Shader generators or a material editor (.. or however you call it) are not an appropriate way to generate or handle shaders because they are hard to maintain, offer not enough material variety and are not very efficient because it is hard to hand optimize code that is generated on the fly.
This is why developers do not use them and do not want to use them. It is possible that they play a role in indie or non-profit development so because those teams are money and time constraint and do not have to compete in the AAA sector.
In general the basic mistake people make that think that ueber-shaders or material editors or shader generators would make sense is that they do not understand how to program a graphics card. They assume it would be similar to programming a CPU and therefore think they could generate code for those cards.
It would make more sense to generate code on the fly for CPUs (… which also happens in the graphics card drivers) and at other places (real-time assemblers) than for GPUs because GPUs do not have anything close to linear performance behaviours. The difference between a performance hotspot and a point where you made something wrong can be 1:1000 in time (following a presentation from Matthias Wloka). You hand optimize shaders to hit those hotspots and the way you do it is that you analyze the results provided by PIX and other tools to find out where the performance hotspot of the shader is.

Categories: General GPU Programming Tags:

ARM VFP ASM development

September 18th, 2008 Wolfgang Engel No comments

Following Matthias Grundmann’s invitation to join forces I setup a Google code repository for this:

here

The idea is to have a math library that is optimized for the VFP unit of an ARM processor. This should be useful on the iPhone / iPod touch.

More Mobile Development

September 12th, 2008 Wolfgang Engel No comments

Now that I had so much fun with the iPhone I am thinking about new challenges in the mobile phone development area. The Touch HD looks like a cool target. It has a DX8-class ATI graphics card in there. Probably on par with the iPhone graphics card and you can program it in C/C++ which is important for the performance.
Depending on how easy it will be to get Oolong running on this I will extend Oolong to support this platform as well.

Categories: Handheld Development Tags:

Shader Workflow

September 10th, 2008 Wolfgang Engel No comments

I just posted a forum message about what I consider an ideal shader workflow in a team. I thought I share it here:

Setting up a good shader workflow is easy. You just setup a folder that is called shaderlib, then you setup a folder that is called shader. In shaderlib there are files like lighting.fxh, utility.fxh, normals.fxh, skinning.fxh etc. and in the directory shader there are files like metal.fx, skin.fx, stone.fx, eyelashes.fx, eyes.fx. In each of those *.fx files there is a technique for whatever special state you need. You might have in there techniques like lit, depthwrite etc..
All the “intelligence” is in the shaderlib directory in the *.fxh files. The fx files just stitch together function calls. The HLSL compiler resolves those function calls by inlining the code.
So it is easy to just send someone the shaderlib directory with all the files in there and share your shader code this way.
In the lighting.fxh include file you will have all kinds of lighting models like Ashikhmin-Shirley, Cook-Torrance or Oren-Nayar and obviously Blinn-Phong or just a different BRDF that can mimic a certain material especially good. In normals.fxh you have routines that can fetch normals in different ways and unpack them. Obviously all the DXT5 and DXT1 tricks are in there but also routines that let you fetch height data to generate normals from it. In utility.fxh you have support for different color spaces, special optimizations for different platforms, like special texture fetches etc. In skinning.fxh you have all code related to skinning and animation … etc.
If you give this library to a graphics programmer he obviously has to put together the shader on his own but he can start looking at what is requested and use different approaches to see what fits best for the job. He does not have to come up with ways on how to generate a normal from height or color data or how to deal with different color spaces.
For a good, efficient and high quality workflow in a game team, this is what you want.

Categories: General GPU Programming Tags:

Calculating Screen-Space Texture Coordinates for the 2D Projection of a Volume

September 9th, 2008 Wolfgang Engel No comments

Calculating screen space texture coordinates for the 2D projection of a volume is more complicated than for an already transformed full-screen quad. Here is a step-by-step approach on how to achieve this:

1. Transforming position into projection space is done in the vertex shader by multiplying the concatenated World-View-Projection matrix.

2. The Direct3D run-time will now divide those values by Z; stored in the W component. The resulting position is then considered in clipping space, where the x and y value is clipped to the [-1.0, 1.0] range.

xclip = xproj / wproj
yclip = yproj / wproj

3. Then the Direct3D run-time transforms position into viewport space from the value range [-1.0, 1.0] to the range [0.0, ScreenWidth/ScreenHeight].

xviewport = xclipspace * ScreenWidth / 2 + ScreenWidth / 2
yviewport = -yclipspace * ScreenHeight / 2 + ScreenHeight / 2

This can be simplified to:

xviewport = (xclipspace + 1.0) * ScreenWidth / 2
yviewport = (1.0 – yclipspace ) * ScreenHeight / 2

The result represents the position on the screen. The y component need to be inverted because in world / view / projection space it increases in the opposite direction than in screen coordinates.

4. Because the result should be in texture space and not in screen space, the coordinates need to be transformed from clipping space to texture space. In other words from the range [-1.0, 1.0] to the range [0.0, 1.0].

u = (xclipspace + 1.0) * 1 / 2
v = (1.0 – yclipspace ) * 1 / 2

5. Due to the texturing algorithm used by Direct3D, we need to adjust texture coordinates by half a texel:

u = (xclipspace + 1.0) * ½ + ½ * TargetWidth
v = (1.0 – yclipspace ) * ½ + ½ * TargetHeight

Plugging in the x and y clipspace coordinates results from step 2:

u = (xproj / wproj + 1.0) * ½ + ½ * TargetWidth
v = (1.0 – yproj / wproj ) * ½ + ½ * TargetHeight

6. Because the final calculation of this equation should happen in the vertex shader results will be send down through the texture coordinate interpolator registers. Interpolating 1/ wproj is not the same as 1 / interpolated wproj. Therefore the term 1/ wproj needs to be extracted and applied in the pixel shader.

u = 1/ wproj * ((xproj + wproj) * ½ + ½ * TargetWidth * wproj)
v = 1/ wproj * ((wproj – yproj) * ½ + ½ * TargetHeight* wproj)

The vertex shader source code looks like this:

Float4 vPos = float4(0.5 * (float2(p.x + p.w, p.w – p.y) + p.w * inScreenDim.xy), pos.zw)

The equation without the half pixel offset would start at No. 4 like this:

u = (xclipspace + 1.0) * 1 / 2
v = (1.0 – yclipspace ) * 1 / 2

Plugging in the x and y clipspace coordinates results from step 2:

u = (xproj / wproj + 1.0) * ½
v = (1.0 – yproj / wproj ) * ½

Moving 1 / wproj to the front leads to:

u = 1/ wproj * ((xproj + wproj) * ½)
v = 1/ wproj * ((wproj – yproj) * ½)

Because the pixel shader is doing the 1 / wproj, this would lead to the following vertex shader code:

Float4 vPos = float4(0.5 * (float2(p.x + p.w, p.w – p.y)), pos.zw)

All this is based on a response of mikaelc in the following thread:

Lighting in a Deferred Renderer and a response by Frank Puig Placeres in the following thread:

Reconstructing Position from Depth Data

Gauss Filter Kernel

September 7th, 2008 Wolfgang Engel No comments

Just found a good tutorial on how to setup a Gauss filter kernel here:

OpenGL Bloom Tutorial

The interesting part is that he shows a way on how to generate the offset values and he also mentions a trick that I use for a long time. He reduces the filter kernel size by utilizing the hardware linear filtering. So he can go down from 5 to 3 taps. I usually use bilinear filtering to go down from 9 to 4 taps or 25 to 16 taps (with non-separable filter kernels) … you got the idea.

Eric Haines just reminded me of the fact that this is also described in ShaderX2 – Tips and Tricks on page 451. You can find the -now free- book at

http://www.gamedev.net/reference/programming/features/shaderx2/Tips_and_Tricks_with_DirectX_9.pdf

BTW: Eric Haines contacted all the authors of this book to get permission to make it “open source”. I would like to thank him for this.
Check out his blog at

http://www.realtimerendering.com/blog/

Beyond Programmable Shading

August 18th, 2008 Wolfgang Engel No comments

I was on SIGGRAPH to attend the “Beyond Programmable Shading” day. I spent the whole morning there and left during the last talk in the morning.
Here is the URL for the Larrabee day:

http://s08.idav.ucdavis.edu/

The talks are quite inspiring. I was hoping to see actual Larrabee hardware in action but they did not have any.
I liked Chas Boyd’s DirectX 11 talk because he made it clear that there are different software designs for different applications and having looked into DirectX 11 now for a while it seems like there is a great API coming up soon that solves some of the outstanding issues we had with DirectX9 (DirectX 10 will be probably skipped by many in the industry).

The other thing that impressed me is AMD’s CAL. The source code looks very elegant for the amount of performance you can unlock with it. Together with Brook+ it lets you control a huge number of cards. It seems like Cuda will be able to easier handle many GPUs at once soon too. PostFX are a good candidate for those APIs. CAL and CUDA can live in harmony with DirectX9/10 and DirectX 11 will even have a compute shader model that is the equivalent to CAL and CUDA. Compute shaders are written in HLSL … so a consistent environment.

Categories: SIGGRAPH Tags:

ARM Assembly

July 31st, 2008 Wolfgang Engel No comments

So I decided to increase my relationship to iPhone programming a bit and bought an ARM assembly book to learn how to program ARM assembly. The target is to figure out how to program the MMX like instruction set that comes with the processor. Then I would create a vectorized math library … let’s see how this goes.

PostFX – The Nx-Gen Approach

July 29th, 2008 Wolfgang Engel No comments

More than three years ago I wrote a PostFX pipeline (with a large number of effects) that I constantly improved up until the beginning of last year (www.coretechniques.info .. look for the outline of algorithms in the PostFX talk from 2007). Now it shipped in a couple of games. So what is nx-gen here?
On my main target platforms (360 and PS3) it will be hard to squeeze out more performance. There is probably lots of room in everything related to HDR but overall I wouldn’t expect any fundamental changes. The main challenge with the pipeline was not on a technical level, but to explain to the artists how they can use it. Especially the tone mapping functionality was hard to explain and it was also hard to give them a starting point where they can work from.
So I am thinking about making it easier for the artists to use this pipeline. The main idea is to follow the camera paradigm. Most of the effects (HDR, Depth of Field, Motion Blur, color filters) of the pipeline are expected to mimic a real-world camera so why not make it use like a real-world camera?
The idea is to only expose functionality that is usually exposed by a camera and name all the sliders accordingly. Furthermore there will be different camera models with different basic properties as a starting point for the artists. It should also be possible to just switch between those on the fly. So a whole group of properties changes on the flip of a switch. This should make it easier to use cameras for cut scenes etc.

Categories: General GPU Programming, PostFX Tags:

iPhone development – Oolong Engine

July 29th, 2008 Wolfgang Engel No comments

Just read that John Carmack likes the iPhone as a dev platform. That reminds me of the fact how I started my iPhone engine Oolong Engine in September 2007. Initially I wanted to do some development for the Dreamcast. I got a Dreamcast devkit, a CD burner and all the manuals from friends to start with this. My idea behind all this was to do graphics demos on this platform because I was looking for a new challenge. When I had all the pieces together to start my Dreamcast graphics demo career, a friend told me the specs of the iPhone … and it became obvious that this would be even a better target :-) … at the time everyone assumed that Apple will never allow to program for this platform. This was exactly what I was looking for. What can be better than a restricted platform that can’t be used by everyone that I can even take with me and show it to the geekiest of my friends :-)
With some intial help from a friend (thank you Andrew :-) ) I wrote the initial version of the Oolong engine and had lots of fun figuring out what is possible on the platform and what not. Then at some point Steve Jobs surprised us with the announcement that there will be an SDK and judging from Apple’s history I was believing that they probably won’t allow to develop games for the platform.
So now that we have an official SDK I am surprised how my initial small scale geek project turned out :-) … suddenly I am the maintainer of a small little engine that is used in several productions.

Light Pre-Pass – First Blood :-)

July 29th, 2008 Wolfgang Engel No comments

I was looking for a simple way to deal with different specular values coming from different materials. It seems that one of the most obvious ways is the most efficient way to deal with this. If you are used to start with a math equation first -as I do- it is not easy to see this solution.
To recap: what ends up in the four channels of the light buffer for a point light is the following:

Diffuse.r * N.L * Att | Diffuse.g * N.L * Att | Diffuse.b * N.L * Att | N.H^n * N.L * Att

So n represents the shininess value of the light source. My original idea to apply now different specular values in the forward rendering pass later was to divide by N.L * Att like this:

(N.H^n * N.L * Att) \ (N.L * Att)

This way I would have re-constructed the N.H^n term and I could easily do something like this:

(N.H^n)^mn

where mn represents the material specular. Unfortunately this requires to store the N.L * Att term in a separate render target channel. The more obvious way to deal with it is to just do this:

(N.H^n * N.L * Att)^mn

… maybe not quite right but it looks good enough for what I would want to achieve.

Categories: Deferred Lighting Tags:

Stable Cascaded Shadow Maps

June 13th, 2008 Wolfgang Engel No comments

I really like Michal Valient’s article “Stable Cascaded Shadow Maps”. It is a very practical approach to make Cascaded Shadow Maps more stable.
What I also like about it is the ShaderX idea. I wrote an article in ShaderX5 describing a first implementation (…. btw. I re-wrote that three times since than), Michal picks up from there and brings it to the next level.
There will be now a ShaderX7 article in which I will describe a slight improvement to Michal’s approach. Michal picks the right shadow map with a rather cool trick. Mine is a bit different but it might be more efficient. So what I do to pick the right map is send down the sphere that is constructed for the light view frustum. I then check if the pixel is in the sphere. If it is I pick that shadow map, if it isn’t I go to the next sphere. I also early out if it is not in a sphere by returning white.
At first sight it does not look like a trick but if you think about the spheres lined up along the view frustum and the way they intersect, it is actually pretty efficient and fast.
On my target platforms, especially on the one that Michal likes a lot, this makes a difference.

Categories: Shadow Maps Tags:

Screen-Space Global Illumination

June 13th, 2008 Wolfgang Engel No comments

I am thinking a lot about Crytek’s Screen-Space Ambient Occlusion (SSAO) and the idea of extending this into a global illumination term.
When combined with a Light Pre-Pass renderer, there is the light buffer with all the N.L * Att values that can be used as intensity and then there is the end-result of opaque rendering pass and we have a normal map lying around. Doing the light bounce along the normal and using the N.L*Att entry in the light buffer as intensity should do the trick. The way the values are fetched would be similar to SSAO.

Categories: Global Illumination Tags:

UCSD Talk on Light Pre-Pass Renderer

May 28th, 2008 Wolfgang Engel No comments

So the Light Pre-Pass renderer had its first public performance :-) … I talked yesterday at UCSD about this new renderer design. There will be a ShaderX7 article as well.

Pat Wilson from Garagegames is sharing his findings with me. He came up with an interesting way to store LUV colors.
Renaldas Zioma told me that a similar idea was used in Battlezone 2.

This is exciting :-)

The link to the slides is at the end of the March 16th post.

Categories: Deferred Lighting Tags:

DX 10 Graphics demo skeleton

May 15th, 2008 Wolfgang Engel No comments

I setup a google code website with one of my small little side projects that I worked on more than a year ago. To compete in graphics demo competitions you need a very small exe. I wanted to figure out how to do this with DX10 and this is the result :-) … follow the link

http://code.google.com/p/graphicsdemoskeleton/

What is it: it is just a minimum skeleton to start creating your own small-size apps with DX10. At some point I had a particle system running in 1.5kb this way (that was with DX9). If you think about the concept of small exes there is one interesting thing I figured out. When I use DX9 and I compile HLSL shader code to a header file and include it to use it, it is smaller than the equivalent C code. So what I was thinking was: hey let’s write a math library in HLSL and use the CPU only with the stub code to launch everything and let it run on the GPU :-)

Categories: D3D10 Tags:

Today is the day: GTA IV is released

April 29th, 2008 Wolfgang Engel No comments

I am really excited about this. This is the second game I worked on for Rockstar and it is finally coming out …

Categories: Rockstar Games Tags:

RGB -> XYZ conversion

April 21st, 2008 Wolfgang Engel No comments

Here is the official way to do it:

http://www.w3.org/Graphics/Color/sRGB

They use

// 0.4125 0.3576 0.1805
// 0.2126 0.7152 0.0722
// 0.0193 0.1192 0.9505

to convert from RGB to XYZ and

// 3.2410 -1.5374 -0.4986
// -0.9692 1.8760 0.0416
// 0.0556 -0.2040 1.0570


to convert back.

Here is how I do it:
const FLOAT3x3 RGB2XYZ = {0.5141364, 0.3238786, 0.16036376,
0.265068, 0.67023428, 0.06409157,
0.0241188, 0.1228178, 0.84442666};

Here is how I convert back:
const float3x3 XYZ2RGB = { 2.5651,-1.1665,-0.3986,
-1.0217, 1.9777, 0.0439,
0.0753, -0.2543, 1.1892};

You should definitely try out different ways to do this :-)

Categories: General GPU Programming Tags:

Ported my iPhone Engine to OS 2.0

April 14th, 2008 Wolfgang Engel No comments

I spend three days last week to port the Oolong engine over to the latest iPhone / iPod touch OS.

http://www.oolongengine.com

My main development device is still a iPod touch because I am worried about not being able to make phone calls anymore.

Accepted for the iPhone Developer Program

April 8th, 2008 Wolfgang Engel No comments

Whooo I am finally accepted! I have access to the iPhone developer program. Now I can start to port my Oolong Engine over :-)

Some Great Links

March 26th, 2008 Wolfgang Engel No comments

I just came accross some cool links today while looking for material that shows multi-core programming and how to generate an indexed triangle list from a triangle soup.
I did not know that you can setup a virtual Cell chip on your PC. This course looks interesting:

http://www.cc.gatech.edu/~bader/CellProgramming.html

John Ratcliff’s Code Suppository is a great place to find fantastic code snippets:

http://www.codesuppository.blogspot.com/

Here is a great paper to help with first steps in multi-core programming:

http://www.digra.org/dl/db/06278.34239.pdf

A general graphics programming course is available here:

http://users.ece.gatech.edu/~lanterma/mpg/

I will provide this URL to people who ask me about how to learn graphics programming.

Categories: General GPU Programming Tags:

Light Pre-Pass Renderer

March 17th, 2008 Wolfgang Engel 13 comments

In June last year I had an idea for a new rendering design. I call it light pre-pass renderer.
The idea is to fill up a Z buffer first and also store normals in a render target. This is like a G-Buffer with normals and Z values … so compared to a deferred renderer there is no diffuse color, specular color, material index or position data stored in this stage.
Next the light buffer is filled up with light properties. So the idea is to differ between light and material properties. If you look at a simplified light equation for one point light it looks like this:

Color = Ambient + Shadow * Att * (N.L * DiffColor * DiffIntensity * LightColor + R.V^n * SpecColor * SpecIntensity * LightColor)

The light properties are:
- N.L
- LightColor
- R.V^n
- Attenuation

So what you can do is instead of rendering a whole lighting equation for each light into a render target, you render into a 8:8:8:8 render target only the light properties. You have four channels so you can render:

LightColor.r * N.L * Att
LightColor.g * N.L * Att
LightColor.b * N.L * Att
R.V^n * N.L * Att

That means in this setup there is no dedicated specular color … which is on purpose (you can extend it easily).
Here is the source code what I store in the light buffer.

half4 ps_main( PS_INPUT Input ) : COLOR
{
half4 G_Buffer = tex2D( G_Buffer, Input.texCoord );

// Compute pixel position
half Depth = UnpackFloat16( G_Buffer.zw );
float3 PixelPos = normalize(Input.EyeScreenRay.xyz) * Depth;

// Compute normal
half3 Normal;
Normal.xy = G_Buffer.xy*2-1;
Normal.z = -sqrt(1-dot(Normal.xy,Normal.xy));

// Computes light attenuation and direction
float3 LightDir = (Input.LightPos – PixelPos)*InvSqrLightRange;
half Attenuation = saturate(1-dot(LightDir / LightAttenuation_0, LightDir / LightAttenuation_0));
LightDir = normalize(LightDir);

// R.V == Phong
float specular = pow(saturate(dot(reflect(normalize(-float3(0.0, 1.0, 0.0)), Normal), LightDir)), SpecularPower_0);

float NL = dot(LightDir, Normal)*Attenuation;

return float4(DiffuseLightColor_0.x*NL, DiffuseLightColor_0.y*NL, DiffuseLightColor_0.z*NL, specular * NL);

}

After all lights are alpha-blended into the light buffer, you switch to forward rendering and reconstruct the lighting equation. In its simplest form this might look like this

float4 ps_main( PS_INPUT Input ) : COLOR0
{
float4 Light = tex2D( Light_Buffer, Input.texCoord );
float3 NLATTColor = float3(Light.x, Light.y, Light.z);
float3 Lighting = NLATTColor + Light.www;

return float4(Lighting, 1.0f);
}

This is a direct competitor to the light indexed renderer idea described by Damian Trebilco at Paper .
I have a small example program that compares this approach to a deferred renderer but I have not compared it to Damian’s approach. I believe his approach might be more flexible regarding a material system than mine but the Light Pre-Pass renderer does not need to do the indexing. It should even run on a seven year old ATI RADEON 8500 because you only have to do a Z pre-pass and store the normals upfront.

The following screenshot shows four point-lights. There is no restriction in the number of light sources:
The following screenshots shows the same scene running with a deferred renderer. There should not be any visual differences to the Light Pre-Pass Renderer:

The design here is very flexible and scalable. So I expect people to start from here and end up with quite different results. One of the challenges with this approach is to setup a good material system. You can store different values in the light buffer or use the values above and construct interesting materials. For example a per-object specular highlight would be done by taking the value stored in the alpha channel and apply a power function to it or you store the power value in a different channel.
Obviously my intial approach is only scratching the surface of the possibilities.
P.S: to implement a material system for this you can do two things: you can handle it like in a deferred renderer by storing a material id with the normal map … maybe in the alpha channel, or you can reconstruct the diffuse and specular term in the forward rendering pass. The only thing you have to store to do this is N.L * Att in a separate channel. This way you can get back R.V^n by using the specular channel and dividing it by N.L * Att. So what you do is:

(R.V^n * N.L * Att) / (N.L * Att)

Those are actually values that represent all light sources.

Here is a link to the slides of my UCSD Renderer Design presentation. They provide more detail.

Categories: Deferred Lighting Tags:

iPhone SDK so far

March 7th, 2008 Wolfgang Engel No comments

Just setup a dev environment this morning with the iPhone SDK … overall it is quite disappointing for games :-) . OpenGL ES is not supported in the emulator but you can’t run apps on the iPhone without OS 2.0 … and this is not realeased so far. In other words, they have OpenGL ES examples but you can’t run them anywhere. I hope I get access to the 2.0 file system somehow. Other than this I have now the old and the new SDK setup on one machine and it works nicely.

Now I have to wait until I get access to the iPhone OS 2.0 … what a pain.

Predicting the Future in Game Graphics

March 3rd, 2008 Wolfgang Engel No comments

So I was thinking about the next 1 – 2 years of graphics programming in the game industry on the XBOX 360 and the PS3. I think we can see a few very strong trends that will sustain over the next few years.

HDR
Rendering with high-dynamic range is realized in two areas: in the renderer and in the source data == textures of objects
On current high-end platforms people run the lighting in the renderer in gamma 1.0 and they are using the 10:10:10:2 format whereever available or a 8:8:8:8 render target format that uses a non-standard color format that supports a larger range of values (> 1) and more deltas. Typically these are the LogLuv or L16uv color formats.
There are big developements for the source art. id Software published an article on a 8-bit per pixel color format -stored in a DXT5 format- that has a much better quality than the DXT1 format with 4-bit per pixel that we usually use. Before that there were numerous attempts by using scale and bias values in the hacked DXT header to use the available deltas in the texture better for -e.g.- rather dark textures. One of the challenges here was to make all this work with gamma 1.0.
On GDC 2007 I suggested during Chas. Boyds DirectX Future talk to extend DX to support a HDR format with 8-bit that also supports gamma 1.0 better. It would be great if they could come up with a better compression scheme than DXT in the future but until then we will try to make combinations of DXT1 + L16 or DXT5 hacks scenarios work :-) or utilize id Software’s solution.

Normal Map Data
Some of the most expensive data is normal map data. So far we are “mis-using” the DXT formats to compress vector data. If you generally store height data this opens up more options. Many future techniques like Parallax mapping or any “normal map blending” require height map data. So this is some area of practical interest :-) … check out the normal vector talk of the GDC 2008 tutorial day I organized at http://www.coretechniques.info/.

Lighting Models
Everyone is trying to find lighting models that allow to mimic a wider range of materials. The Strauss lighting model seems to be popular and some people come up with their own lighting models.

Renderer Design
To render opaque objects there are currently two renderer designs on the end of the spectrum. The so called deferred renderer and the so called forward renderer. The idea of the deferred renderer design came up to allow a higher number of lights. The advantage of a higher number of lights has to be bought by having lower quality settings in other areas.
So people now start to research new renderer designs that have the advantages of both designs but none of the disadvantages. There is a Light indexed renderer and I am working on a Light pre-pass renderer. New renderer designs will allow more light sources … but what is a light source without shadow? …

Shadows
Lots of progress was made with shadows. Cascaded Shadow maps are now the favorite way to split up shadow data along the view frustum. Nevertheless there are more ways to distribute the shadow resolution. This is an interesting area of research.
The other big area is using probability functions to replace the binary depth comparison. Then the next big thing will be soft shadows that become softer when the distance between the occluder and the receiver becomes bigger.

Global Illumination
This is the area with the biggest growth potential currently in games :-) Like screen-space ambient occlusion that is super popular now because of Crysis, screen-space irradiance will offer lots of research opportunities.
To target more advanced hardware, Carsten Dachsbacher approach in ShaderX5 looks to me like a great starting point.

Categories: General GPU Programming Tags:

Android

February 14th, 2008 Wolfgang Engel No comments

Just read the FAQ for Android. Here is the most important part:
—————–
Can I write code for Android using C/C++?
No. Android applications are written using the Java programming language.
—————–
So no easy game ports for this platform. Additionally the language will eat up so many cycles that good looking 3D game do not make much sense. No business case for games then … maybe they will start thinking about it :-) … using Java also looks quite unprofessional to me but I heard other phone companies are doing this as well to protect their margin and keep control over the device.

Categories: Handheld Development Tags:

gDEbugger for OpenGL ES

January 18th, 2008 Wolfgang Engel No comments

So I decided to test the gDEbugger from graphicREMEDY for OpenGL ES. I got an error message indicating a second hand exception and there was not much I can do about it. I posted my problem in their online forum, but did not get any response so far.
I guess with the decreasing market of OpenGL there is not much money in providing a debugger for this API. In games, less companies do PC games anymore and OpenGL is not used by any AAA title anymore on the main PC platform Windows.
I was hoping that they target the upcoming OpenGL ES market, but this might be still in its infanty. If anyone knows a tool to debug OpenGL ES similar to PIX or GcmReplay, I would appreciate a hint. To debug I would work on the PC platform … in other words I have a PC and an iPhone version of the game :-)
Update January 19th: graphicRemedy actually came back to me and asked me to send in source code … very nice.

Categories: Handheld Development Tags:

San Angeles Observation on the iPhone

January 8th, 2008 Wolfgang Engel No comments

I ported the San Angeles Observation demo from Jetro Lauha to the iPhone (www.oolongengine.com). You can find the original version here

http://iki.fi/jetro/

This demo throws about 60k polys onto the iPhone and runs quite slow :-( . I will double check with Apple that this is not due to a lame OpenGL ES implementation on the phone.
I am thinking about porting other stuff now over to the phone or working on getting a more mature feature set for the engine … porting is nearly more fun, because you can show a result afterwards :-) … let’s see.

Oolong Engine

December 30th, 2007 Wolfgang Engel No comments

I renamed my iPhone / iPod touch engine to Oolong Engine and moved it to a new home. Its URL is now

www.oolongengine.com

I will add now a 3rd person camera model. This camera will be driven by the accelerometer and the touch screen.

Animating Normal (Maps)

December 26th, 2007 Wolfgang Engel No comments

There seems to be an on-going confusion on how to animate normal maps. The best answer to this is: you don’t :-) .
The obvious problem is to stream in two normal maps and then to modulate two normals. If you are on a console platform you just don’t want to do this. So what would be a good way to animate a normal? You modulate height fields. Where both height fields have peaks, the result should also have a peak. Where one of the height fields is zero, the result should be also be zero, independent of the other height field.

Usually, a normal map is formed by computing a relief of a height field (bump map) over a flat surface. If is the bump map’s height at the texture coordinates, the standard definition of the normal map is

The height fields are multiplied to form a combined height field like this

To determine the normal vector of this height field according to the first Equation, one needs the partial derivatives of this functions. This is a simple application of the product rule:

And similarly for the partial derivative with respect to v. Thus:

BTW: to recover the height field’s partial derivatives from the normal map we can use:

Categories: General GPU Programming Tags:

About Raytracing

December 26th, 2007 Wolfgang Engel No comments

My friend Dean Calver published an article about raytracing that is full of wisdom. The title says it all Real-Time Ray Tracing: Holy Grail or Fool’s Errand?. This is straight to the point :-)

Categories: General GPU Programming Tags:

LogLuv HDR implementation in Heavenly Sword

December 26th, 2007 Wolfgang Engel No comments

Heavenly Sword stores HDR data in 8:8:8:8 render targets. I talked to Marco about this before and saw a nice description in Christer Ericson’s blog here

I came up with a similar idea that should be faster and bit more hardware friendly with a new compression format that I call L16uv. The name more or less says it all :-)

Categories: General GPU Programming Tags:

Normal Map Data II

December 26th, 2007 Wolfgang Engel No comments

Here is one interesting normal data idea I missed. It is taken from Christer Ericson in his blog:

One clever thing they do (as mentioned on these two slides) is to encode their normal maps so that you can feed either a DXT1 or DXT5 encoded normal map to a shader, and the shader doesn’t have to know. This is neat because it cuts down on shader permutations for very little shader cost. Their trick is for DXT1 to encode X in R and Y in G, with alpha set to 1. For DXT5 they encode X in alpha, Y in G, and set R to 1. Then in the shader, regardless of texture encoding format, they reconstruct the normal as X = R * alpha, Y = G, Z = Sqrt(1 – X^2 – Y^2).

A DXT5-encoded normal map has much better quality than a DXT1-encoded one, because the alpha component of DXT5 is 8 bits whereas the red component of DXT1 is just 5 bits, but more so because the alpha component of a DXT5 texture is compressed independently from the RGB components (the three of which are compressed dependently for both DXT1 and DXT5) so with DXT5 we avoid co-compression artifacts. Of course, the cost is that the DXT5 texture takes twice the memory of a DXT1 texture (plus, on the PS3, DXT1 has some other benefits over DXT5 that I don’t think I can talk about).

Categories: General GPU Programming Tags:

Normal Data

December 25th, 2007 Wolfgang Engel No comments

Normal data is one of the more expensive assets of games. Creating normal data in Z Brush or mudbox can easily make up for a few million dollars.
Storing normal data in textures in a way that preserves the original data with the lowest error level is an art form that needs special attention.
I am now aware of three ways to destroy normal data by storing it in a texture:
1. Store the normal in a DXT1 compressed texture
2. Store the normal in a DXT5 compressed texture by storing the x value in alpha and the y value in the green channel …. and by storing some other color data in the red and blue channel.
3. Store the normal in its original form -as a height map- in one color channel of a DXT1 compressed texture with two other color channels.
They all have a common denominator: the DXT format was created to compress color data so that the resulting color is still perceived as similar. Perceiving 16 vectors as similar follows different rules than perceiving 16 colors as similar. Therefore the best -so far- solutions to store normals is to
- not compress them at all
- store y in the green channel of a DXT5 compressed texture and red in the alpha channel and color the two empty channels black
- use the DXN format that consists of two DXT5 compressed alpha channels
- store a height map in an alpha channel of a DXT5 compressed texture and generate the normal out of the height map.
The DXT5 solutions and the DXN solution occupy 8-bit per normal. The height map solution occupies 4-bit per normal. It is probably not as good looking as the 8-bit per normal solutions.
There are lots of interesting areas regarding normals other than how they are stored. There are challenges when you want to scale, add, modulate, deform, blend or filter them. Then there is also anti-aliasing … :-) … food for thought.

Categories: General GPU Programming Tags:

Renderer Design

December 19th, 2007 Wolfgang Engel No comments

Renderer design is an interesting area. I recommend starting with the lighting equation, splitting it up in a material and light part and then move on from there. You can then think about what data you need to do a huge number of direct lights and shadows (shadows are harder than lights) and how you do all the global illumination part. Especially the integration of global illumination and many lights should get you thinking for a while.
Here is an example:
1. render shadow data from the cascaded shadow maps into a shadow collector that collects indoor, outdoor, cloud shadow data
2. render at the same time world-space normals in the other three channels of the render target
3. render all lights into a light buffer (only the light source properties not the material properties)
4. render all colors and the material properties into a render target while applying all the lights from the light buffer (here you stitch together the Blinn-Phong lighting model or whatever you use)
5. do global illumination with the normal map
6. do PostFX

Categories: General GPU Programming Tags: