I will hold a Town Hall tomorrow to discuss the stability and performance issues raised in both Project Open Letter and additional questions posted in the feedback forum. In this post, I cover some of the questions raised in the letter, but first two comments.

First, thank you for the effort you put into Project Open Letter. We are working hard to maintain effective channels for communication and from a developer’s perspective, a clear listing of specific problems passed along so politely is much appreciated.

Second, I am sorry for the problems you are experiencing. I understand the time and energy you are putting into Second Life and am making every effort to ensure that Second Life is what you need it to be.

Inventory loss

Inventory and other agent specific data is distributed across multiple MySQL databases we refer to as inventory servers. The name is historical since a lot of data beyond inventory is stored there. This partitioning scheme provides scalability for much of the agent data but currently relies on manual rebalancing after periods of heavy Second Life growth. The scripts that drive rebalancing, while heavily tested and in use for over a year at this point, had a serious bug related to error handling. This resulted in inventories potentially being damaged during the transfer. The good news is that bug has been fixed and new transfer scripts deployed.

There are other errors that could manifest themselves as inventory loss, such as dataserver stalls, failure to complete transactions correctly, failure to properly cache the inventory on the client, or failure of the asset system to deliver an asset pointed to by inventory. Two separate teams are examining these problems, and there are numerous fixes either recently rolled out or in QA. The first team is focused on fixing existing bugs, while the second team is developing a more robust, distributed transaction system. We are also reviewing the design with outside developers familiar with building transaction systems. As we get farther into the design process on the next generation transaction system, we will publish the spec as well.

Inventory limits

We do not have any projects underway to limit inventory sizes, although it has been discussed. We have been doing a large amount of data collection in order to make better decisions about where to set limits, predict equipment purchases, and to understand inventory use. Also, note inventory size is not a source of the problems you have been seeing, except in a second order way because large inventories cause us to purchase inventory servers a bit faster. As such, errors such as the previously discussed transfer script error will impact slightly more people when we rebalance after new equipment is installed.

Inventory backup/local storage

No current project on this either, although most of the groundwork has been laid to allow it, so we could bump the priority for Q3. The design issues here are generally non-technical in nature and are instead related to metadata preservation and permissions. Since backups and local storage are effectively making extra copies of assets, they are a real opportunity to leverage Creative Commons and other licensing schemes that support copying.

Friend lists

The in-world friend list was broken with the 1.15 release. Simulator level caching of friend data was accidentally removed, resulting in a large increase in load on backbone. This overloaded backbone and agent presence stopped functioning. While debugging the problem, the web friend list was deactivated in order to reduce some of the load. Although some of the backbone problems were fixed last week, a small API change in those fixes again broke the web friend list. That was finally fixed with the web push yesterday.

Despite all those fixes, it appears that a bug still remains whereby the viewer does not receive an accurate friend state snapshot on login. Since the presence information is cached and only updates on changes, this means that until friends log in or out, the information on the viewer could be out of date. We are chasing that bug and any additional information or repeatable test cases would greatly help, so please head over to the Public Jira if you have anything to add.

In the long run, presence is another project that we need to build out in a far more scalable manor. A core part of our next generation design is presence, and like transactions, we are currently reviewing our designs with external experts who have built related systems.

Find

We have an intermittent crash in MySQL related to find queries. As a result, we have had irregular Find outages due to this crash. We are chasing the problem but have not been able to solve it yet, since the query in question is often issued safely. We are also in the midst of an upgrade to MySQL 5.0 as part of overall infrastructure upgrades. Since 5.0 has superior debugging information, we expect that even if we haven’t found the problem before the upgrade, we’ll have more information about it afterwards.

Moreover, improving in-world search is a major Q2 project for us. We feel this is important not only because it will improve the user experience for everyone but because search is currently hitting a central MySQL database, so as part of our broad improvements to search, we are also building out a more scalable solution. More details on search will be forthcoming soon.

Grid stability and performance

As an aside, problems like teleport failures and inventory issues are not related to either Havok or Mono. While both will bring improvements to individual sim node’s performance and stability, they have no appreciable impact on problems related to back end systems. Havok 4 is in testing prior to hitting the Beta grid and the Mono project has fixed the major blockers for us, so we are waiting for resources to free up from other projects there.

Teleport failures could be the results of many different problems, and are definitely exacerbated by problems in agent presence. We have a team currently investigating this problem. Again, additional data points and reproducible cases would help them a lot.

Build tools

Our studio focused on live grid problems is taking a look at reproducing these. Obviously the build tools are critical to content creation so hopefully we should be able to get these fixed quickly.

In terms of overall development effort, four of our internal studios are focused on issues in the letter. Currently, of the 54 people in development, program management, QA, and web development, 37 are directly working on these problems, or 69%. We have 5 developers hired who have not started yet, and all of them will go onto these bugs initially, raising it to 42 out of 59, or 72%. Since not all tasks are equally suited to all developers and additional projects need to move forward, I feel this a very appropriate level of attention.

I hope this helped answer at least some of your questions and I look forward to talking about them more tomorrow.