Past EED rants

Labels

Live leaderboard

Poker leaderboard

Voice of EED

Monday 4 September 2006

What's wrong with WoW? [Lurks]

It's always been cause for concern that Blizzard, due to their publisher connection, chose to run World of Warcraft's servers out of France. A country not exactly renown for their hard core Internet expertise. Still, deal with it they did but more or less as you'd expect. Every patch-time downtime was longer than the US. Servers in general are less reliable than the US. Latency to the servers is generally higher than that to the US (from folks in the US) and so on.
However at time of writing, things have taken a dramatic turn for the worse. The game is essentially unplayable in the context of 40-man raiding which is basically most of what we're doing in WoW these days, since we've been playing it since it came out. It takes an enormous amount of effort to put together a raid. People have to sign on our web raid calendar, an officer makes selections. Then people show up, someone passes out potions from the raid bank, substitutions are made, player stuff and potions are crafted/brewed on the spot, groups are set up, tactics explained, everyone sorted on teamspeak and you're ready to go.
Only to have half the raid disconnect as soon as you pull. Meaning a wipe, which means everyone has to run back and buff up and you do it all over again. Blizzard are nothing if not a modern e-commerce firm in that they make it pretty much impossible to engage them in a one on one way unless, of course, it's a billing issue. So people have had to turn to the technical support forum. That filling up with crap they finally made available WoW*ConcernsEU*blizzard.com (delete first, replace second with an @) to email. So I sent them an email, largely in futility, but the nuts and bolts are:

At the moment you're chucking a party every night and everyone shows up to discover there's no beer. Tell us there's no beer and we'll go to another party instead.

Scarily we have about 130 accounts in our guild which is well over a grand (in proper money) a month they're getting from us. Nice work if you can get it and it would be, you'd think, worth a simple communication to tell us what the hell is up and when they might fix it. The only thing they do say is that it's a network issue but generally our latency is fine. We just get disconnected.
This leads me to speculate on what's going on in a purely non productive and fun sort of way. Some of which I can deduce with evidence and some of which is just a guess based on a pretty rudimentary knowledge of how to do server stuff but mostly guesswork. I don't often speculate so please take this with a large pinch of salt.
We know that they run the continents seperately. Perhaps two game server processes on the same box. We also know that they have a seperate instance server. It's not clear how much hardware but it wouldn't need to be on the same box. Certainly the fact that they've implemented cross-realm PvP would seem to indicate that they use different boxes for instance servers or at least PvP.
There's three general 'lag' issues you see in WoW near as I can tell. First of all your standard network latency. This is a bit retarded since the in-game display is averaged over a very long time so if your latency comes good, it still stays read for ages. Which means people disconnect and reconnect just to make it green, thinking they've fixed their latency when in fact all they did was clear the latency measuring buffer. They really ought to fix that.
The second lag issue is what I'll call server lag. This is basically the game coming back and acknowledging an action you've performed. If you've got sever lag, you'll see the start of a casting animation for example, but not any more. The action wont complete. Also this manifests as chat lag too. This lag comes and goes all the time even when the servers are working okay-ish.
Thbe third lag is what I'll call database lag. Basically if you do anything that results in a write to yours or anyone elses character data, it's pretty obvious this requires a write to a database. The most obvious way this manifests is in 'trade lag'. You trade someone a simple item and it takes ages to complete after you've both hit accept. Warlocks also know this lag because unfortunately a db write is triggered when you soul drain just before killing a mob to get a soul shard. In fact eventually the game often gives up and you don't get your shard.
So, imagening a server topography, per 'server' like Runetotem, we have two realm server processes which act like game servers we're familiar with like Quake. This just handles people walking around and talking. The processes cover the two big island continents in WoW. Presumably they will add a third for the expansion. We also have a big ass database which, I think we can reasonably assume, is common to everything on the server.
Note: Further rampant speculation. For cross-server PvP, I think each PvP server game instance box has read/write access to the databases for each game server. When you zone in, it takes a copy of your current character and when you zone out, it writes it back to your server DB. This wasn't working very well post the patch and I suspect hardware upgrades were associated with this somehow.
We also have instance server processes. We don't know if they're on the same box or not. Potentially this is a really obvious way of spreading the server load since instances are by their very nature quite bandwidth intensive compared to the regular old world server. So I suspect they are on a different box but we can't know that. On the realms instance box, instances are, I guess, simply spawned as a processes when people enter.
Which brings us to the issue we're having right now. We all zone into the instance and people disconnect. Why? I don't know. I think it can be shown that WoW disconnects people when the server is trying to send them a lot of data and they can't keep up either there's a network problem along the way or it becomes a client issue. There has to be a client component to that because some people are more susceptable than others and some have influenced how likely you are to disconnect by removing mods. Which is quite strange.
But see here's the thing that's not immediately obvious to people running WoW. When you run a mod, they insert triggers based on events and when the event happens, functions are run and lua code starts doing stuff. Any time it does stuff, what it tends to do is use some predefined functions to check the status on this or that, normally just to initialise variables to use for later. I might not be explaining this very well. It's a bit like having a person that may be called upon telling you how many red objects are in the room. The only thing is, this person is in a black box. He has to gather information about the scene one time only before we close the lid on the box. WoW mods are like that. They don't run all the time, many of them are actually being triggered on instance zoning in. You can see them all trigger in your combat/general logs scrolling in the little windows.
No problem there you think? Well actually there sort of might be. I don't think WoW actually pulls down every bit of data about everything, in fact I know it doesn't. If you trigger a function which checks on, say, the reputation of a given faction of the targetted player, deep within WoW's binary it has to inject a request for that data into the upstream communication with the server. Then it has to get a result back (stalling your lua script and often your level load/zone incidentally) before the lua function can return the data your script is asking for. Why is this a problem?
Well, we run a LOT of mods. All of them are basically trying to memorise quite a lot of stuff from the instance and all of them end up injecting those requests into your WoW network communication to the server, and then sit back waiting for results back from the server before the lua scripts can exit. This takes awhile and it's quite a lot of data per client. At some point the WoW instance server process appears to become unhappy with this and disconnects you. Under exactly what conditions it's impossible to tell.
So why is it worse now than it was before? I have no idea. However, if I had to guess I'd say that the instance server is running on the cross-realm PvP server hardware. it might be being loaded by players from several servers now and is hence busier than it was. That's the only real change as of the last patch and that's what would reasonably require a hardware upgrade, which was the change that immediately preceeded the disaster we're currently seeing.
An instance server ends up facing the same sort of challenges that a busy web server does. Lots of people connecting wanting stuff. The more people that do it, the longer it takes to serve a connection and at some point you end up failing miserably to cope and the number of current connections all slowed to crawl becomes so long, you have no choice but to time them out or run out of memory. Also you want to time some out, to free up the process to try serve someone else. On a web server you get a "too many users" message, on WoW you disconnect.
You can, I think, improve things a bit by removing mods because you're asking for less data from your level load for the reasons I mentioned above, but really it's a straight capacity issue and a software design flaw.
Where it gets really strange, and which subverts the above theory to some extent, is that the disconnections issue does tend to resolve itself when you've reduced the amount of alive mob in the instance such as the famous Blackwing Lair 'suppression rooms' which are instantly crammed with many hundreds of monsters. I say subverts because your mods don't have any data to query regarding mob in your view other than one you have targetted. For the quantity of mob to be an issue, we start getting onto an issue which I do know a little bit about.
Network code in action multiplayer games. Normally games tend to try cull things you can't see if they are beyond a certain distance as the crow flies. They do tend to send you data of entities which you can't see who are within a certain distance because you might hear them and it cuts down on the computational cost of working out if you can see something or not.
I think it can be empirically proven that WoW sends the basic location and movement vectors of every mob in an instance regardless of whether you can see them or not or how far away you are. (Well okay, I think there's probabily a maximum visibility distance but it's a long way). You can often break WoW a little and target something through a wall from a mile away. AQ40 is a good example, you can target something much later in the incident by looking left around the left point of that rock outcrop. BWL is a particularly bad instance in this regard because it's built a bit like a cube. As the crow flies, you're very close to a huge quantity of mob right as you zone in from the front door.
Does it subvert my point or does it back it up? So now not only do we have your scripts and your general WoW client querying a load of variables and issuing server requests through your netcode connection, but also the damn instance server is trying to send you updates of every single mob in the instance at the same time. It's worse than that. Even the most basic netcode, and we'll assume WoW is exactly that, doesn't send 3D coordinates for ever mob and facing direction for every update. Rather it's somewhat compressed and while there are various schemes, basically it says 'mob x moved left 3 meters' or, if it's smarter, 'mob x is moving left at 3 meters per second'. The client then takes the previous know position and makes those changes.
There is one vital exception to this. When you zone. Your client has no idea where the mob are so it has to get full 3D positions, model ids and modification attributes, alignments, animation stage positions and so on all when you zone in. Only when it's done that can it send you those deltas in the compact form later.
So what we have is you're zoning in and the instance server is telling you about every mob in the suppression rooms and probably fecking Nefarian as well, and your scripts are screaming out for updates at the same time. The absolute data that was sent to you was right only for a bit and if mob is moving around (and it is in the suppression rooms) then you've got a bunch of stacking up delta data as well which you need to get before you're properly 'zoned in'.
This comes back to my original conclusion of bad or at least incomplete software design. Much of this issue has been solved by people making multiplayer computer games in the past and Blizzard kind of made things worse for themselves by having this event triggering system for interface add-ons. I shudder to think, with the amount of people playing, the impact it would have in having a modern network code strategy on the combined bandwidth. Significant, I think, highly fucking significant.
The strange thing is, Blizzard are no stranger to optimisation. You may not recall how bad IF was at one point. How long it took to load in to IF in particular. They changes the code to basically zone you in and kind of stream the entity updates to you after you'd loaded so players popped into the world. They do not, however, appear to have applied that strategy to instances and mob and now we're paying for it. There might be a reason, I guess, such as you could zone in and get past some mob near the door because it hasn't popped into your world yet. I would say the easiest fix would be to apply the incremental entity loading and just fix player position until it's complete and the world has caught up.
At any rate, I hope the above serves to illustrate that I actually believe the problems we're facing are less about levels of hardware and network performance issues than software design. That being the case you start to understand why they're not saying anything about the time frame. Or indeed anything at all.

10 comments:

  1. I recon you're pretty spot on, but I think there may be some clustering and sharing bollocks going on. I doubt it's easy to equate metal boxes with realms like you suggest, and that theres a bunch of sharing going on. The way they've bunched up cross realm pvp suggests this too.
    Also, they use oracle, and that costs a fortune per server, so it'll probably be a big database cluster rather than a db per server.
    As to the issues following the patch, one other thing they added, which is a real big fucking deal is the new raid channel. In the past, some of us joined channels, some of us didn't. Some mods communicated with each other, some didn't. Now everyone is in a raid channel, and those mods are chatting away right away. Might have some bearing?
    As to the mob location thing, fuck knows how they do that. Outside there must be some clever proximity shit going on, as the zones are just too big to get all the entity data. Perhaps instances work differently?

    ReplyDelete
  2. You must be right about the databases. And as we discussed later, we both think that the famous Tuesday Lag is a backup prior to the patch. Since the backup has to be active from a specific point in time to avoid duplication of items etc, which means it has to do some mad stuff to do with journaling all of the current updates so that what's written out is a snapshot backup. I'm having a hard time explaining the 'server lag' (see description of the three forms of lag) being so high when they're running the backup unless it becomes one of the local network bandwidth getting hammered? Hmm.
    As for outside, yeah as I said there's a visible distance limit where it doesn't bother giving you data and I assume that's inside as well. You can see that flying around on a griff but it's a pretty big distance and it's not really appropriate for the scales of distance and mob density in an instance. Actually, thinking about this more, they do cull entities more agressively in an instance. Think the top of the Sartura cave in AQ40. You can't see the bugs down the tunnels at the bottom. If it was outside, you would be able to. Course with BWL, the bloody suppression rooms are right near by. Is it coincidence that all newer instances are much more spread out rather than being stacked like BWL? I doubt it.
    So how could they fix it? The smart thing would be to have the suppression rooms pop once you had killed vael. That should be almost far enough that the game has basically distance culled all of the entities until you get a bit closer like you've cleared through the trash.

    ReplyDelete
  3. your right mate on the Lag lessening slightly as an instance progresses. Last nights molten core was exactly the same, crap lag early on, loads of people DC'ing etc. Once we got past half way, for me at least, the lag disappeared.... now im sure this is partly(mostly?) to do with the fact that we had cleared half of the mobs out of the instance, but it wasnt a gradual subsidance for me, it was literally lag fest, kill boss, ZERO lag.??
    I'm a little like your self Cal, i have a rudimentary understanding of how Game Boxes work and the like, but i cant fathom the miraculous lag dissappearing act from one moment to the next. I then enjoyed a lag free MC for the remainder of the instance. I could appreciate why the improvement happened if it had been gradual, but it wasnt, it was instant, for me at least.
    not sure what they could do to rectify the lag problem tbh. I fear its a problem that will never be solved.
    We (gnashy, elhomo, anu, vendetta(or twat if you like), and a few others used to play Star Wars Galaxies fairly heavily, and it too suffered the same kind of lag problems. The reason for this lag was widely known to be caused partly by the updates/patches they released and by lack lustre servers/boxes.
    the servers/boxes side of things doesnt really need explaining, as its an inherent problem in most online games, if your hardware sucks, so will the lag.
    Now the interesting side of the equation has been publiscised by the swgemu teams and various coders who dissassembled the star wars galaxies installs/updates and found that sony took an easy route to upadte their games. When something gets updated, changed, nerfed, tweaked or whatever, any changes at all... they dont ever remove the old code, all they do is fill in, patch over and add to, and leave any of the old redundant code in there, lurking, they claim its useless code, and that it wont cause any problems.But the problem in the main is, that most of the alledged improvements to the code was to do with the networking, and to the brokers(auction houses) etc. The general opinion of the communities coder base is that sony have basically decided to just paper over the cracks with sloppy patches, leaving more stuff broken, or bugged by their inept patching, and also along the way they have knackered the net code to a certain degree as lag and pings continues to increase with every update they make live. Basically SWG is broken by the ineptness of not only the latest game designs(combat upgrades and radical game changes are never a good thing ) but by laziness of the software house themselves. One thing to bear in mind though is that SWG has a much bigger database to handle than WoW, and the playing areas are also much bigger, and the auction houses(not to mention the player houses & player vendors) hold literally 1000's upon 1000's of items, dare i say it in total across the galaxies and all the player houses and vendors(these are live not instanced) there will be millions of items to track(including all player held items, house items, vendor items and bank items.) without even touching on stuff like mobs/npc's etc etc. maybe the game itself in terms of scope and scale is the downfall, too ambitious? anyway, food for thought.
    Now im not sure how blizzard handle code replacement in patches, but im hoping they dont do a 'sony' and leave all this useless broken code in in there lurking. it may or may not be a contributery factor, but it isnt going to help things. I mention this mainly because not all the lag i get when things are going crappy is net lag, although i run a reasonable PC, and normally dont get any FPS lag, there are times when my FPS are floored, for seemingly no reason, and this only happens in some of the raids, and strangely enough, i may get no fps problems in BWL most of the time, once in a while on a crappy night of lag, the fps dwindles alarmingly. so what exactly is it thats doing all of this ? i dunno.
    enough rambling anyway, back to you guys. bored you enough lol

    ReplyDelete
  4. As you know, I have been playing over a satellite internet connection for almost all of my 60 levels of WOW, and my ping outside Europe is always, every day, between 900ms and 15000ms (values above 15k tend to result in a timeout disconnect). I find that:The lag-o-meter is as Lurks says or means to say, a lying sack of shit. It does not update in real-time.Mailboxes, the Bank, NPCs, PvP Trading and the AH are all largely unusable above ~5000ms because every user input event requires a client-server-client response. Roaming NPCs are bad at a better ping (Caretaker Alan in EPL!) because they walk off!When flying on a griffon, at ~2000ms you may see some mobs down below you on the ground. At about 4000ms+ you will see no mobs on the ground, or they will appear and fade out almost immediately as they leave your range again (this isn't draw-distance radius, this is about how the game prioritises the mobs it downloads and how far away they are).This might be most relevant to you: In a 5 or 10-man instance, I never had a situation where I can't see all the mobs. So this is very different to Ironforge, where I start with shadows, then my faceplate and my character appears (at first without guild tabard), then some characters with UNKNOWN nameplates stand on the shadows, then everyone arrives as intended. I have never had this behaviour in an instance - so no gradually populating cave, or no turning a corner on a lag spike and finding that the mobs are on holiday. When instancing, I see everything, and never anything else.

    ReplyDelete
  5. If I track dragonkin in BWL, the minimap is filled with red dots from suppression room when standing at gate.

    ReplyDelete
  6. Re: Tuesday night lag.I always assumed the process was something like:Checkpoint the DB.Backup the checkpoint.Bring down the servers for maintenance.If you are running heavy duty db's and have them set up properly my understanding is that checkpointing should be quick and cause no issues. Backing up will hit your disk arrays hard, very hard, my belief is that this is what causes the lag as the hardware can not handle the I/O needed for a live database and the backup at the same time. It may be possible to throw some more money at the problem and fit a faster SAN, but my guess it that they are now at the point where they need to reduce the number of realm clusted on a single database and buy some more databases which will not be cheap.
    Re: Lag clearing. I'm guess here that this may have something to do with working sets/paging/trashing. It could be that certain mod are checking all mobs in an instance? at the start this causes your server to trash memory but beyond a certain point the working set gets small enough and the server stops thrashing and everything runs smoothly. In theory this is easy to test but I bet the devs aren't allowed near the production servers and probably don't have the testing facilities to simulate a BWL raid.
    These are just my musing I could well be very wrong.

    ReplyDelete
  7. Ah but not being able to see bugs isn't the same as the entity data not being there. Not being able to see them will just be your client clipping the distance draw.
    If KL's saying the dragonkin are on track from the gate, then the entity data is being sent as soon as you zone into bwl, multiplied by 40. Obviously the traffic goes down the more you kill.
    Spawning the dragonkin after you kill vaels a good solution, yes.

    ReplyDelete
  8. Mods aren't able to check mob in an instance. In fact mods can only check status of a mob you have targetted. However the clients are being sent all the data - proven by KL mentioning the dots on the radar - which is basically the problem. I don't think the server is actually thrashy, it's a case of not being able to get all the requested data to the game client fast enough before something times out and results in a disconnect.

    ReplyDelete
  9. I don't understand why the server is sending instance data to clients in the first place on zoning in.
    Should all that shit not be on your local drive, in the main, in the first place? Then it's only the vectors of the pre-defined mobs that needs updating, not having to dump a whole slew of shit onto your client?

    ReplyDelete
  10. The talk about that it is the bandwith that can't handle the traffic between the realms in BG3 seems utter rubbish to me. First, Blizzard would see that happening damn quick and fix it. Secondly that would mean all members in raid would have the same problem.
    One observation I made, and this is NOT a flame war, is that neither me or Mustikka have ever had a DC in BWL, I rarerly have any lag at all in BWL apart from CT related lag (someone releases when dead). Both me and Mustikka use Macs, I think it's a client side protocol problem, most probable a time-out setting.
    And i am not sure the entity data is sent about the distant mobs only because I can track them. In fact, If I track them, they might spawn. I do remember faintly that I have tracked mobs on radar and seen them on there before I have seen them spawn in view. That happens quite frequently with herbs, I see the yellow dot, go there, and it spawns in front of me with a nice glimmering effect.
    Also, the spawn of mobs is dependant on your viewing direction, if you track/look down from a griff looking forward, you can see quite alot, but if you look backwards, you see almost nothing on ground. No, I don't suggest we walk backwards into BWL, but It would be interesting to see if we get more/less/same DC's if we all face the portal.

    ReplyDelete