Monday, September 16, 2013

109 – Who best shape? Circle best shape!

The reason I'm stretching out the features of Snapshot 10 to 3 snapshots is that I can give the final set of polish to some existing features that I want to consider complete and also enhance some features with some sorely missing functionality, while otherwise being in a semi feature freeze environment. So snapshot 10-12 will not really have any new features, but iterations on the existing ones.

One of the problematic features is the algorithm that keeps track of a grid of objects and manages their streaming. I want my engine to support very large terrains. While storage space concerns will keep me from having truly huge maps (more than 256 square kilometer map will eat up probably a blue-ray's worth of data), the algorithm should be able to handle any sized map. And currently it does, but the cost of the algorithm is dependent on the map size. This was a problem when I first implemented grass, because the grid size was so big that I lost about 30-50 FPS on a top-end PC just by the CPU work of managing grass visibility, without even streaming in anything. Then I did a quick hack, but now it is time for a proper solution.

Because we are going to rewrite a system from scratch, a task that must be done but doesn't really push the project ahead in way that screams "new feature" or "gameplay", we are going to be try and be as clever as possible when implementing the new feature. What I mean by that is that I'm going to implement and algorithm that tries to solve our problem (decouple the map size form the time it takes to manage the world in the vicinity of the character) and only that, but by the way we solve it, we will implicitly get a bunch of strong benefits that are almost implicit features on their own.

For this we'll consider an algorithm that works on a grid like structure, where each element in the grid has a property that tells it if it should be loaded and one that tells it if it is loaded or not (an other properties). And we'll consider 3 lists: one that keeps track of the currently loaded chunks (the loaded chunk list L), one that will store the chunks that are in range and are inside the view frustum (the priority chunk list P) and one that stores the chunks that are in range and are not in the view frustum (the secondary chunk list S). In order to determine if a chunk is in range, we'll use a bunch of circles that share a center, have increasing radii and are constructed in such a way that by overlapping all circles, all the chunks inside the radius of the largest circle are exactly in one and only one one of the circles' border. This way if we go from the radius of 1 chunk to V, we can iterate over all the chunks inside that radius, but in the order of the increasing circles and we only process each chunk once. So here is the algorithm.
  1. Determine the chunk you are standing on and let that one be the center off all the circles. Dismiss the priority list P and the secondary list S. These are computed each frame. The loaded chunk list L is persistent and it's content is updated when needed from frame to frame.
  2. Go over the list of currently loaded chunks and set "ShouldBeLoaded" to false.
  3. From v = 1 to V, where V is the current view radius, go over all the chunks that are inside the circle with the current given radius v. For each chunk set "ShouldBeLoaded" to true. If the chunk is within the view frustum, add it to priority chunk list P and if not add it to the secondary list S.
  4. Go ever the L list again, and if you find a chunk that is loaded but it should not be loaded, free the chunk, ideally adding the freed resources to the pool.
  5. If the current frame allows for a chunk load (I base that on time, but other mechanisms can be used), try to determine how many chunks you could load and keep your framerate. It is simplest to consider this 1 and thus load 1 chunk every DELTA milliseconds. Try to load the required number of chunks from the priority list. If you have gone over all the items in the priority list and have not reached your desired number, go over the secondary list.
  6. If the above steps have not updated the active list L, you can update it here in a final loop.

The above algorithm has the property that it does not depend on the world size, only on your view radius. If you have a world with 1 chunk or 1 million chunks, it will finish in approximately the same amount of time. The old algorithm that this one replaces did not have this property and the CPU time of the run increased very fast based on the world size. As said before, I lost 30-50 FPS on a large map. The new algorithm has no such problems, especially since it is not only constant time, but pretty fast. The cost of the algorithm is negligible to the cost of streaming in a single chunk.

So let's see what clever implicit behaviors we managed to squeeze out of this implementation:
  • It favors chunks in the view frustum. I don't start you off with a completely un-streamed map at a map load, so this is hard to notice. But if I were to drop you in an empty map and you'd have to wait for it to stream in from zero, you'd notice that the terrain in front of you get's streamed in first. If you were to turn around very fast, you might notice that there is nothing behind you. This is actually a good behavior and the desired one if you combine it with a generous pre-streaming so that only distant chunks in front of you get streamed in.
  • It gives implicit approximate sorted view-depth for chunks. Since the priority list is created based on circles with increasing radius, the closest chunks will be streamed in first. But not only that, but they will be one rendered first. This has generally a better Z (depth buffer) behavior. I can't test this under a setup with very high framerate. If I run the engine on a computer strong enough to go with 150+ FPS, it will fluctuate unpredictably between 150-170. So I tested in 720p under a weak computer that can't handle more than 30-FPS. Just by switching the algorithm and taking into consideration only the benefit from a better sorted depth buffer write, I go a minimum of 1-2 FPS bonus. Sometimes as much a 5, but this was when staring directly at a steep cliff side form close up. Hey, free FPS. What is there not to like?
  • It gives implicit approximate sorted view-depth for objects in the chunk pool. Since we are rendering based on chunk proximity, the chunks that are closer are rendered first and so are the items in the chunks, thus increasing the change of a larger closer object to hide smaller objects behind it because of the place where the draw call was made on the frame render time axis. 
  • The relationship between items in a chunk and having two lists of chunks based on the frustum intersections allows us to pre-cull a big portion of the item list without touching the item list.
  • The order we construct the lists are very cache friendly.
  • We use a bitmap to give us the circles so we can easily change the shape of the circles, or if needed replace it with rectangles or some other shape. Anything goes, including the shape of two overlapping owl if that is needed, but I'll go with circles:

This is the current list of benefits what are actively used when rendering. There is one more, a pretty big on I have not implemented yet:
  • It allows for single-cache approach when rendering the scene multiple times. If I were to cache all the CPU computations for the items that will be rendered, this allows for a render procedure that only reads from a cache, without doing any computations. This is very helpful if you are rendering the scene more than once for different effects, because you are doing the CPU matrix calculations and CPU lighting bounds calculations only once and you can render as many times as you wish with different shaders. I'll add this in the future.

Implementing this was not as easy as it seams based on the description. To test that it works correctly, I implemented a primordial debug map:

I should have added the map before developing the algorithm, not after. That would have made things a lot easier and faster :)).

The debug view is updated in real time, and color coded. Red are the problematic chunks, the ones that are technically leaks if they persist. It is normal for them to appear, but in a few frames they should disappear. If they hang around that is an alarm signal of a memory leak. Currently there are no know leaks and red dots will almost never show up, and if they do they last just one frame. Gray marks chunks that should be leaded, but the streamer did not have time yet to schedule that loading. You can't see it in that screenshot and in real life scenarios gray dots only appear at the distant margins of your view radius. Cyan are the chunks that are loaded. Dark cyan/gray signals chunks outside the view frustum, light cyan/gray signals chunks inside. This color coding greatly helps with navigation because you know which direction you are heading.

This debug map will eventually evolve into the mini-map for the game. The scale must be changed first. That mini-map seems deceptively small, like if the current map was very small. This is because the chunk size is pretty big and the map that you see there is 16 square kilometers big. For the mini-map to be useful, it shouldn't render the whole map, but more like an area between 4 and 9 chunks big, centered on the character. 

No comments:

Post a Comment