Scaling in Games

Scaling in games & virtual worlds

Online games and virtual worlds have familiar scaling requirements, but don’t be fooled: everything you know is wrong.

Jim Waldo, Sun Microsystems Laboratories

I USED TO BE LIKE YOU.

I used to be a systems programmer,working on infrastructure used by banks, telecom companies, and otherengineers. I worked on operating systems. I worked on distributedmiddleware. I worked on programming languages. I wrote tools. I did allof the things that hard-core systems programmers do.

And I knew the rules. I knew that throughput was the real test ofscaling. I knew that data had to be kept consistent and durable, andthat relational databases are the way to ensure atomicity, and thatloss of information is never an option. I knew that clients weregetting thinner as the layers of servers increased, and that the bestclient would be one that contained the least amount of state andallowed the important computations to go on inside the computing cloud.I knew that support for legacy code is vital to the adoption of any newtechnology, and that most legacy code has yet to be written.

But two years ago my world changed. I was asked to take on thetechnical architect position on Project Darkstar, a distributedinfrastructure targeted to the massive-multiplayer online-game andvirtual-world market. At first, it seemed like a familiar system. Thegoal was to scale flexibly by enabling the dynamic addition (orsubtraction) of machines to match load. There was a persistence layerand a communication layer. We also wanted to make the programming modelas simple as possible, while enabling the system to use all the powerof the new generations of multicore chips that Sun (and others) wereproducing. These were all problems that I had encountered before, sohow hard could these particular versions of the problems for thisparticular market be? I agreed to spend a couple of months on theproject, cleaning up the architecture and making sure it was on theright track while I thought about new research topics that I might wantto tackle.

The three months have turned into two years (and counting). I’ve foundlots of new research challenges, but they all have to do with findingways to make the environment for online games and virtual worlds scale.In the process, I have been introduced to a different world ofcomputing, with different problems, different assumptions, and adifferent environment. At times I feel like an anthropologist who hasdiscovered a new civilization. I’m still learning about the culture andpractice of games, and it is a different world.

Everything You Know is Wrong

The first thing to realize in understanding this new world is that itis part of the entertainment industry. Because of this, the mostimportant goal for a game or virtual world is that it be fun.Everything else is secondary to this prime directive. Being fun is notan objective measure, but the goal is to provide an immersive,all-consuming experience that rewards the player for playing well, iseasy to learn but hard to master, and will keep the player coming backagain and again.

Most online games center around a story and a world, and the richnessof that story and world has much to do with the success of the game.Design of the game centers on the story and the gameplay. Design of thecode that is used to implement the game comes quite a bit later (and isoften considered much less interesting). A producer heads the team thatbuilds the game or the virtual world. Members of the team includewriters, artists, and musicians, as well as coders. The group with theleast influence on the game consists of the coders; their job is tobring the vision of others to reality.

The computational environment for online games or virtual worlds isclose to the exact inverse of that found in most markets serviced bythe high-tech industry. The clients are anything but thin; game playerswill be using the highest-end computing platforms they can get, or gameconsoles that have been specially designed for the computational rigorsof these games. These client machines will have as much memory as canbe jammed into the box, the latest and fastest CPUs, and graphicssubsystems that have supercomputing abilities on their own. Theseclients will also have considerable capacity for persistent storage,since one of the basic approaches to these games is to put as muchinformation as possible on the client.

The need for a heavyweight client is, in part, an outcome of theevolution of these games. Online games have developed from stand-aloneproducts, in which everything was done on the local machines. This ismore than entropy in the industry, however; keeping as much as possibleon the client allows the communication with the server to be minimized,both in the number of calls made to the server and in the amount ofinformation conveyed in those calls. This communication minimization isrequired to meet the prime directive of fun, since it is part of theway in which latency is minimized in these games.

Latency is the enemy of fun—and therefore the enemy of online games andvirtual worlds. This is especially interesting in the case of onlinegames, where the latency of the connection between the client and theservers cannot be controlled. Therefore, the communication protocolneeds to be as simple as possible, and the information transmitted fromthe client to the server must fit into a single packet wheneverpossible. Further, the server needs to be designed so that it is doingvery little, ensuring that whatever it is doing can be done veryquickly so a response can be sent back to the player. Some interestingtricks have been developed to mask unavoidable latency from the player.These include techniques such as showing prerecorded clips during theloading of a mission or showing a “best guess” immediately at theresult of an action and then repairing any differences between thatguess and the actual result when the server responds.

The role of the server is twofold. The most obvious is to allow playersto interact with each other in the context of the game. This role isbecoming more important and more complex as these games and worldsbecome increasingly elaborate. The original role of the server was toallow players to compete with each other in the game. Now games andvirtual worlds are developing their own societies, where players maycompete but may also cooperate or simply interact in various ways.Virtual worlds allow users to try out new personalities; games letplayers cooperate to do tasks that they would be unable to completeindividually. In both, players are finding that a major draw of thetechnology is using it to connect to other people.

The second role of the server is to be the arbiter of truth between theclients. Whether the client is running on a console or on a personalcomputer, control rests in the hands of the player. This means that theplayer has access to the client program, and the competitive nature ofthe games gives the player motivation to alter the client in theplayer’s favor. Even in virtual worlds, where there is only socialcompetition, the desire to “enhance the opportunity” of the individualplayer (also known as “cheating”) is common. This requires that theserver, which is the one component that is not under the control of theplayers, be the arbiter of the true state of the game. The game serveris used both to discourage cheating (by making it much more difficult)and to detect cheating (by seeing patterns of divergence between thegame state reported by the client and the game state held by theserver). Peer-to-peer technologies might seem a natural fit for thefirst role of the game server, but this second role means that few ifany games or worlds trust their peers enough to avoid the servercomponent.

Current Scaling Strategies

The use of the singular term server in the previous section representsa conceptual illusion of the system structure that can be maintainedonly by the clients of the game or world. In fact, any online game orvirtual world will involve a large number of servers (or will havefailed so miserably that no one either can or wants to remember thegame or world). Using multiple servers is a basic mechanism for scalingthe server component of a game to the levels that are being seen in theonline world today. World of Warcraft has reported more than 5 millionsubscribers with hundreds of thousands active at any one time. SecondLife reports usage within an order of magnitude of World of Warcraft,and there is some evidence that sites such as Webkinz or Club Penguinare even more popular. A single server is not able to handle such load,no matter how efficient the representation. Even if a single servercould deal with this load, such a server would be far too expensive forthe smaller loads that are encountered (sometimes by the same games orworlds) at times of low demand (or in parts of the product’s life cyclewhen demand has decreased).

Having multiple servers means that part of building the game isdeciding how to partition the load over these servers. Two techniquesare commonly used in both online games and virtual worlds. Sometimesonly one of the techniques is used, sometimes both, depending on thenature of the game or world.

The first technique is to exploit the geography of the game or world,decomposing the game into different areas, each of which can be mappedto a hosting server. For example, an island in Second Life correspondsto a physical server running the code for the shared reality of theworld. Similarly, different areas of the World of Warcraft universe arehosted on different physical machines. Anyone who is in the area willconnect to the same server, and interactions among the players on thatserver can be localized (and optimized). Actions happening in adifferent part of the world are not likely to affect those in this partof the world, so the communication traffic between servers can be keptsmall.

The second technique is known as sharding. A shard is a copy of a partof the game or virtual world. Different shards reside on differentservers, and players who are assigned to one shard can interact withthe world and other players in the shard, but will not see (or be ableto interact with) players or objects in other shards. Shards not onlyallow more players to be supported in the world, but also permitindependent explorations into the world by different sets of players.Thus, when a new quest or mission is added to a game, it will often bereplicated with multiple shards so that more than one player (or groupof players) can experience the quest or mission in its original state.

Although sharding and geographic decomposition allow multiple serversto be used to handle the load on a single game or world, they dopresent the developer with significant challenges. By creatingnoninteracting copies of parts of a world, shards isolate the playersin different shards from each other. This means that players who wantto share their experience of the world or game need to become aware ofthe different shards that are being offered, and arrange to be placedin the same shard. As the number of players who want to be in the sameshard increases (some guilds—groups of players who cooperatively playin a single game over an extended period of time—have hundreds ofmembers), the difficulty of coordinating placement into shardsincreases and interferes with the experience of the world. While shardsallow scale, they do so at the price of player interaction.

Geographic decomposition does not limit player interaction, but doesrequire that the designers of the game be able to predict the size of ageographic area that will be the correct unit of decomposition. If onegeographic area becomes very popular, play on that area will slow downas the server associated with the area is overloaded. If a geographicarea is less popular than originally predicted, computer hardware (andmoney) will be wasted on that section because not enough players arethere. Since the geographic decomposition is hardwired into the code ofthe game or world, changing the decomposition in response to observeduser behavior requires rewriting part of the game or world itself. Thistakes time, can introduce bugs, and is very costly. While this is beingdone, gameplay can be adversely affected. In extreme cases, this canhave a major financial impact. When World of Warcraft was introduced,the demand for the game so outstripped its capacity that subscriptionshad to be closed off for months while the code that distributed thegame was rewritten.

Changing Chip Architectures

Scaling over a set of machines is a distributed computing problem, andthe game and virtual-world programming culture has had littleexperience with this set of problems. This is hardly the only placewhere scaling requires the game programmer to learn a new set ofskills. A change in the trend of chip design also means that theseprogrammers must learn skills they have never had to exercise before.

With the possible exception of the highest end of scientific computing,no other kind of software has ridden the advances of Moore’s law asaggressively as game or virtual-world programs. As chips have gottenfaster, games and virtual worlds have become more realistic, morecomplex, and more immersive. Serious gameplayers invest in the verybest equipment that they can obtain, and then use techniques such asoverclocking to push even more performance out of those systems.

Now, however, chip designers have decided to exploit Moore’s law in adifferent way. Rather than increasing the speed of a chip, they areadding multiple cores to a chip running at the same (or sometimesslower) clock speed. There are many good reasons for this, fromsimplified design to lower power consumption and heat production, butit means that the performance of a single program will notautomatically increase when you run the program on a new chip. Overallperformance of a group of programs may increase (since they can all runin parallel) but not the single program (unless it can be broken intomultiple, cooperating threads). Games are written as single-threadedprograms, however.

In fact, games and virtual worlds (and especially the server side ofthese programs) should be perfect vehicles to show the performancegains possible with multicore chips and groups of cooperating servers.Games and virtual worlds are embarrassingly parallel, in that most ofwhat goes on in them is independent of the other things that arehappening. Of the hundreds of thousands of players who are active inWorld of Warcraft at any one time, only a very small number will beinteracting with any particular player. The same is true in Second Lifeand nearly all large-scale games or worlds.

The problem is that the culture that has grown up around games andvirtual worlds is not one that understands or is overly familiar withthe programming techniques that are required to exploit the parallelisminherent in these systems. These are people who grew up on a single(PC) machine, running a single thread. Asking them to master theintricacies of concurrent programming or distributed systems takes themaway from their concentration on the game or world experience itself.Even when they have the desire, they don’t have the time or theexperience to exploit these new technologies.

Project Darkstar

It is for these reasons that we started Project Darkstar (http://www.projectdarkstar.com),a research effort attempting to build a server-side infrastructure thatwill exploit the multithreaded, multicore chips being produced andscaled over a large group of machines, while presenting the programmerwith the illusion that he or she is developing in a single-threaded,single-machine environment. Hiding threading and distribution is, inthe general case, probably not a good idea (see http://research.sun.com/techrep/1994/abstract-29.htmlfor a full argument). Game and world servers tend to follow a veryrestricted programming model, however, in which we believe we can hideboth concurrency and distribution.

The model is a simpleevent-based one in which input from the client is received by theserver, which then sets off a task in response to that event. Thesetasks can change the state of the world (by moving a player, changingthe state of an object, or the like) and initiate communication. Thecommunication can be to a single client or to a group of clients thatare all subscribed to the same communication channel.

We chose this model largely because this is the way most game andvirtual-world servers are already structured. The challenge was then tokeep this model and allow servers written in this style to be scaledover multiple cores (running multiple threads) and multiple servers. Wewere not trying to take existing code and allow it to run within oursystem. This would have made the task much more difficult and would nothave corresponded to the realities of the game and virtual-worldculture. Game and world servers are written from scratch for each gameor world, perhaps reusing some libraries but rarely, once running,being rehosted into a different environment. Efforts to bring differentplatforms into the game are restricted to the client side, where newconsoles bringing in new players may be worth the effort.

Darkstar provides a container in which the server runs. The containerprovides interfaces to a set of services that allow the game server tokeep persistent state, establish connections with clients, andconstruct publish/subscribe channels with sets of clients. Multiplecopies of the game server code can run in multiple instances of theDarkstar container. Each copy can be written as if it were the only oneactive (and, in fact, it may be the only one active for small-scalegames or worlds). Each of the servers is structured as an eventloop—the main loop listens on a session with a client that isestablished when the client logs in. When a message is delivered, theevent loop is called. The loop can then decode the message anddetermine the game or world action that is the appropriate response. Itthen dispatches a task within the container.

Each of these tasks can read or change data in the world through theDarkstar data service, communicate with the client, or send messages togroups of other game or world participants via a channel. Under thecovers, the task is wrapped in a transaction. The transaction is usedto ensure that no conflicting concurrent access to the world data willoccur. If a task tries to change data that is being changed by someother concurrent task, the data service will detect that conflict. Inthat case, one of the conflicting tasks will be aborted andrescheduled; the other task should run to completion. Thus, when theaborted task is retried, the conflict should have disappeared and thetask should run to completion.

This mechanism for concurrency control does require that all tasksaccess all of their data through the Darkstar data service. This is adeparture from the usual way of programming game or world servers,where data is kept in memory to decrease latency. By using results fromthe past 20 years of database research, we believe that we can keep thepenalty for accessing through a data service small by caching data inintelligent ways. We also believe that by using the inherentparallelism in these games, we can increase the overall performance ofthe game as the number of players increases, even if there is a smallpenalty for individual data access. Our data store is not based on astandard SQL database since we don’t need the full functionality such adatabase provides. What we need is something that gives us fast accessto persistently stored objects that can be identified in simple ways.Our current implementation uses the Berkeley Database for this,although we have abstracted our access to it to provide the opportunityto use other persistence layers if required.

Concurrency control is not the only reason to require that all databe accessed through the data store. By backing the data in a persistentfashion rather than keeping it in main memory, we gain some inherentreliability that has not been exhibited by games or worlds in the past.Storing all of the data in memory means that a server crash can causethe loss of any change in the game or world since the last time thesystem was checkpointed. This can sometimes be hours of play, which cancause considerable consternation among the customers and expensivecalls to the service lines. By keeping all data persistently, webelieve we can ensure that no more than a few seconds of game or worldinteraction will be lost in the case of a server crash. In the bestcase, the players won’t even notice such a crash, as the tasks thatwere on the server will be transferred to another server in a fashionthat is transparent to the player.

The biggest payoff forrequiring that all data be kept in the data store is that it helps tomake the tasks that are generated by the response to events in the gameportable. Since the data store can be accessed by any of a cluster ofmachines that are running the Darkstar stack and the game logic, thereis no data that cannot be moved from machine to machine. We do the samewith the communication mechanisms, ensuring that a session or channelthat is connecting the game and some set of clients is abstractedthrough the Darkstar stack. This allows us to move the task using thesession or channel to another machine without affecting the semanticsof the task talking over the session or channel.

This task portability means we can dynamically balance the load on aset of machines running the game or virtual world. Rather thansplitting the game up into regions or shards at compile time, virtualworlds or games based on the Darkstar stack can move load around thenetwork of server machines at runtime. While the participant might seea short increase in latency during the move, the overall latency willbe decreased after the move. By moving tasks, we not only can balancethe load on the machines involved, but also try to collocate tasks thatare accessing the same set of data or that are communicating with eachother. All of these mechanisms allow us to determine, while the game isbeing played, which tasks (and which users) should be placed on thesame server.

The project is in its early stages of development and deployment. It isbased on an open-source licensing model and community, so we arerelying on our users to educate us about the needs of the communitythat will build the games and worlds that will run on theinfrastructure. The research is part computer science and partanthropology, but each of the cultures has an opportunity to learn muchfrom the other.

Even at this early stage, it is clear that this is going to be acomplex venture. While early experience with the code has shown thatthe programming model does relieve the game or world server programmerfrom thinking about threads and locking, it has also shown that thereare places where they do have to understand something about theunderlying concurrency of the system. The most obvious of these is inthe design of the data structures. One of the earliest users of ourcode was getting terrible performance from the system. When we lookedat the code, we discovered that a single object was written to on everytask, updating a global piece of game state. By designing the server inthis way, this user effectively serialized all of the tasks that wererunning in the system, making it impossible for the server to get anyadvantage from the inherent parallelism in the game. Some minorredesign, breaking the single object into many (much smaller) objects,removed this particular bottleneck, with resulting gains in overallperformance. This experience also taught us that we need to educateusers of the system in the design of independent data structures thatcan be accessed in parallel.

Our own implementation has not been without some excitement. When wemoved from a multithreaded server that ran on a single machine to animplementation that runs on multiple machines, we expected somedegradation in the performance of the single-machine system. We weredelighted to find that the single-node system degradation was notnearly as large as we thought it would be, but we found that additionalmachines lowered the capacity of the overall system. When presentedwith these measurements, this was not all that surprising tounderstand—the possibility for contention on multiple machines isgreater than that on a single machine, and discovering and recoveringfrom such contention takes longer. We are working on removing the chokepoints so that adding equipment actually adds capacity.

Measuring the performance of the system is made especially challengingby the lack of any clear notion of what the requirements of the targetservers are. Game developers are notoriously secretive, and the notionof a characteristic load for a game or virtual world is not somethingthat is well documented. We have some examples that have been writtenby the team or by people we know in the game world, but we cannot besure that these are accurate reflections of what is being written bythe industry. Our hope is that the open-source community that isbeginning to form around the project will aid in the production ofuseful performance and stress tests.

Seen in a broader light, the project has been and continues to be aninteresting experiment in building levels of abstraction for the worldof multithreaded, distributed systems. The problems we are tackling arenot new. Large Web-serving farms have many of the same problems withhighly variable demand. Scientific grids have similar problems ofscaling over multiple machines. Search grids have similar issues indealing with large-scale environments solving embarrassingly, but notcompletely, parallel problems.

What makes online games and virtual worlds interestingly different arethe very different requirements they bring to the table compared withthese other domains. The interactive, low-latency environment is verydifferent from grids, Web services, or search. The growth from theentertainment industry makes the engineering disciplines far differentfrom those others, as well. Solving these problems in this newenvironment is challenging, and adds to our general knowledge of how towrite software on the emerging class of multithreaded, multicore,distributed systems.

And best of all, it’s fun.

JIM WALDO is a Distinguished Engineer with Sun MicrosystemsLaboratories, where he conducts research on large-scale distributedsystems. Prior to (re)joining Sun Labs, he was the lead architect forJini, a distributed programming system based on Java. He spent eightyears at Apollo Computer and Hewlett-Packard, where he led the designand development of the first object request broker and was instrumentalin getting that technology incorporated into the first OMG CORBAspecification. Waldo is an adjunct faculty member at HarvardUniversity, where he teaches distributed computing in the department ofcomputer science. He has a Ph.D. in philosophy, holds M.A. degrees inboth linguistics and philosophy, and has never taken a real computerscience course.

Originally published in Queue vol. 6, no. 7—

本站仅提供存储服务，所有内容均由用户发布，如发现有害或侵权内容，请点击举报。