AO3 News

Post Header

Published:
2012-07-16 16:09:58 UTC
Tags:

We've been talking a lot recently about how much the AO3 has expanded over the last few months. One easy statistic for us to lay our hands on is the number of registered accounts, but this only represents a tiny portion of site activity. Our awesome sys-admin James_ has been doing some number crunching with our server logs to establish just how much we've grown, and provided us with the following stats (numbers for June not yet available). Thanks to hele for making them into pretty graphs!

Visitors to the AO3

Line graph showing the number of visitors to the AO3 per month, December 2010 to May 2012. The line progresses steadily upwards with a significant spike from 1,197,637 in April 2012 to 1,409,265 in May 2012.

The number of unique visitors to the site has increased almost every month since December 2010 (each unique IP address is counted as one visitor). There are a few points where the rate of increase gets more dramatic: there was a jump of 244,587 across December 2011 and January 2012, compared to one of 137,917 over the two months before that. This can probably be accounted for by the fact that during December and January, holiday challenges such as Yuletide bring more people to the site. This theory is borne out by the fact there was a slight dip in the number of visitors during February 2012, indicating that some of the extra traffic in the previous two months were 'drive by' visitors who didn't stick around.

May 2012 saw a steep increase in the number of visitors: there were 211,628 more visitors to the site than there had been the month before! The rapid increase in visitors was not without its price: this was the month of many 502 errors!

Traffic to the AO3

Line graph showing AO3 traffic in GB per month, December 2010 to May 2012. The line progresses steadily upwards with a significant spike from 2192 GB in April 2012 to  2758 GB in May 2012.

The increase in the number of visitors to the site has also been accompanied by an increase in overall site traffic (how much data we're serving up). Again, there's a significant spike during December/January. Interestingly, there's no dip in traffic for February 2012, showing that even though there were some 'one time' visitors over the holiday period, there were also plenty of people who stayed and continued to enjoy fanworks on the site.

The increase in traffic to the site clearly accelerated in 2012. Between January and May 2011 traffic increased by just 159.92 GB; the same period in 2012 saw an increase of 1,870.26 GB! In fact, with an increase of 566 GB during May 2012, that month alone saw almost as big a jump in traffic as the whole of the previous year (595.63GB)!

And the other stuff

With these kinds of numbers, it's not surprising that there've been a few bumps along the way. For information on how we're dealing with the growth in the site you can check out our posts on performance and growth and accounts and invitations.

Many thanks to our dedicated volunteers for their hard work dealing with the growth of the site, and to our fabulous users for their patience with our growing pains - and for creating the awesome fanworks so many people are flocking here to see!

Comment

Post Header

Published:
2012-07-15 09:27:22 UTC
Tags:

The demand for AO3 accounts has recently exploded! Unfortunately, the rapid increase in users has also created some site performance issues (you can read more about these and what we're doing about them in our post on Performance and Growth).

We use an invitation system in order to help manage the expansion of the site, and to help guard against spam accounts. Until recently, demand for invitations was low enough that the system didn't result in people waiting for a long time. However, because so many people signed up at the same time, the queue is now really long, and the waiting times are months rather than days or weeks. We know this really sucks for the people waiting for accounts (especially those who are concerned that their work may be deleted from other sites).

We really really want to cut down on waiting times, but we also need to ensure the site remains stable. It's a bit difficult to tell exactly how much difference more registered users will make, because we know that many of the people waiting for an account are already using the site as logged-out users, so to some extent we're already dealing with their extra load, However, every time someone new creates an account, it is a little bit more load on the servers, both because account holders have access to more features (personalisation, subscriptions, history, etc) and because if they're posting works, they also tend to attract more new people to the site who want to access their fanworks! Logged-in users also have more personalised pages, which makes it harder for us to serve them pages from the cache (which puts less load on the site.) At the moment of writing, there are 56,203 registered users on the site and 28,863 requests in the queue: this means we're looking at adding more than half as many users again. That's a pretty massive potential increase, so much as we'd love to, we can't issue invitations to everyone who wants one right away.

What's the plan for issuing more invitations?

Once we got our big performance issues under control, we cautiously increased the number of new invitations being issued every day from 100 to 150. We didn't make an announcement about this right away in case we needed to decrease it again (although lots of people noticed their wait time had decreased!). However, we've been keeping on eye on the site and it seems to be coping happily with the increase.

We need to install some more RAM (see our post on performance) which should be happening very shortly. Once we've done that we'll increase the numbers being issued to the queue again - by another 50 per day initially, and possibly more after that if we don't see any warning signs that it's causing problems.

Even if we up the number of invitations to 300 per day, it's still going to mean that some people will have to wait up to three months. Unfortunately, because we're looking at so much demand, there's a limit to how far we can tackle this. :(

What about invitations for friends?

We used to allow users to request invitations to give out to their friends (just to clear up one source of confusion - user accounts never came with a 'friend invitation' by default, although we did give out some unsolicited ones on our first birthday). However, as the demand increased, we were getting more and more requests of this nature, and it became difficult to keep track of it in a fair way. We decided that in order to make sure we knew exactly how many invitations were being issued, and in order to make it as fair as possible, we'd restrict invitations to the queue only for now. This means that it's 'first come, first served', whether or not you know anyone on the AO3. We know people would really like to build their communities on the site, and we will reenable the option in the future, but only when we're sure the performance situation will allow it.

Can I pay to get an account quicker?

No, not at this time. The AO3 is funded by donations to our parent, the Organization for Transformative Works, but donating doesn't give you an account on the Archive.

When we started the AO3, we decided not to have paid accounts for a few reasons. First of all, we wanted to make it something that you could use whether or not you had any financial resources: we know many fans can't afford paid services or don't have a way of paying for them. Secondly, we wanted to add a layer of protection for fans' real life identities: if you pay for your fan name account with a real name credit card, there is an increased chance that the two identities could be linked (either by accident or via a legal demand for our records; we are committed to fighting for your privacy but can't guarantee that we'd win such a battle). Finally, adding paid features to the Archive itself would have increased the complexity of what our coders had to code, especially if we had some features available only to paid accounts. For all these reasons, we decided to fund the AO3 indirectly via donations to the OTW, which couldn't be linked to your account on the Archive, allowed us to provide the site to everyone whether or not they could pay, and made it easier for us to use existing payment systems which we didn't have to code ourselves.

It's possible that in the future we may consider some form of paid service, if we developed other ways of dealing with the above concerns. However, it's not something we're considering right now - if it does become a more real possibility in the future, we'll post and ask for discussions about it.

I'm worried my works on another site are at risk! What can I do?

We've recently had a lot of direct requests for invitations from users who are worried about their works being deleted from Fanfiction.net. This creates a dilemma for us, because protecting at-risk fanworks is a fundamental part of what we do. However, right now the volume of those requests is so high that there's simply no fair way to prioritise them, which is why we're only issuing invitations via the queue. We're very sorry about this. If you're worried your work is at risk, we recommend you back it up to your own computer (Fanfiction.net has a 'download as HTML' option, or you may wish to use a a tool such as Fanfiction Downloader or Flagfic), so that you can upload it to the AO3 or any other site at a later date. You may also wish to manually back up your reviews and comments - sadly these can't be transferred to the AO3 even if you have an account, but you may want to keep a record of this important part of your fan life for yourself.

We're pleased that so many people want to be here, but very sorry we can't accommodate everyone right away. Thank you for your patience and support.

Comment

Post Header

Published:
2012-07-15 09:23:30 UTC
Tags:

Everyone at the Archive of Our Own has been working hard dealing with the recent site expansion and performance problems. Now that we've been able to deal with the immediate issues, we wanted to give everyone a bit more detail on what's happening and what we're working on.

The basics

Our recent performance problems hit when a big increase in users happened, putting pressure on all the bits of the site not optimised for lots of users. We were able to make some emergency fixes which targeted the most problematic points and thus fixed the performance problems for now. However, we know we need to do quite a bit more work to make sure the site is scalable. The good news is there are lots of things we know we can work on, and we have resources to help us do it.

Some users have been concerned that the recent performance problems mean that the site is in serious trouble. However, we've got lots of plans in place to tackle the growth of the site, and we're also currently comfortable about our financial prospects (we'll be posting about this separately). As long as we are careful and don't rush to increase the number of users too fast, the site should remain stable.

The tl;dr details

What level of growth are we experiencing?

The easiest aspect of site growth for us to measure is the number of user accounts. This has definitely grown significantly: since May 1 almost 12,000 new user accounts have been created, which means a 25% increase in user numbers in the past two months. However, the number of new accounts created is only a small proportion of the overall increase in traffic.

We know that lots more people are using the site without an account. There are currently almost 30,000 people waiting for an invitation, but even that is a very, very partial picture of how many people are actually visiting the site. In fact, we now have approximately one and a half million unique visitors per month. That's a lot of users (even if we assume that some of those visitors represent the same users accessing the site from different locations)!

A bit about scalability

The recent problems we've been experiencing were related to the increase in the number of people accessing the site. This is a problem of scalability: the requirements of a site serving a small number of users can be quite different to those of a site with a large userbase. When more users are accessing a site, any weak points in the code will also become more of a problem: something which is just a little bit slow when you have 20,000 users may grind to a halt entirely by the time you hit 60,000.

The slightly counterintuitive thing about scalability is that the difference between a happy site and an overwhelmed one can be one user. Problems tend to arise when the site hits a particular break point - for example, a database table getting one more record than it can handle - and so performance problems can appear suddenly and dramatically.

When coding and designing a site, you try to ensure it is scalable: that is, you set up the hardware so that it's easy to add more capacity, you design the code so it will work for more users than you have right now, etc. However, this is always a balancing act: you want to ensure the site can grow, but you also need to ensure there's not too much redundancy and you're not paying for more things than you need. Some solutions simply don't make any sense when you have a smaller number of users, even if you think you'll need them one day in the future. In addition, there are lots of factors which can result in code which isn't very scalable: sometimes it makes sense to implement code which works now and revise it when you see how people are using the site, sometimes things progress in unexpected ways (and testing for scalability can be tricky), sometimes you simply don't know enough to detect problem areas in the code. All of these factors have been at work for the AO3 at one time or another (as for most other sites).

Emergency fixes for scalability

When lots and lots of new users arrived at the Archive at once, all the bits of the site which were not very scalable began to creak. This happened more suddenly than we were anticipating, largely because changes at the biggest multifandom archive, Fanfiction.net, meant that lots of users from there were coming over to us en masse. So, we had to make some emergency fixes to make the site more able to cope with lots more users.

In our case, we already knew we had one bit of code that was extremely UNscalable - the tag filters used to browse lists of works. These were fine and dandy when we had a very small number of works on the Archive, but they had a big flaw - they were built on demand from the list of works returned when a user accessed a particular page. This made them up-to-the-minute and detailed, but was a big problem once the list of works returned for a given fandom were numbering in the thousands - a problem we were working around while we designed a new system by limiting the number of returned works to 1000. It was also a problem because building the filters on demand meant that our servers had to redo the work every time someone hit a page with filters on it. When thousands of people were hitting the site every minute, that put the servers under a lot of strain. Fortunately, the filters happen to be a bit of code that's relatively easy to disable without hitting anything else, so we were able to remove them as an emergency measure to deal with the performance problems. Because they were such a big part of the problem, doing this had a dramatic effect on the many 502s and slowdowns.

We also did some other work to help the site cope with more users: largely this involved implementing a lot more caching and tuning our servers so they manage their workload slightly differently. All these changes were enough to deal with the short-term issues, but we need to do some more, and more sustained work to ensure that the site can grow and meet the demands of its users.

Scalability work we're doing right now

We've got a bunch of plans for things which will help scalability and thus ensure good site performance. In the short term (approximate timescales included below) we are:

  • Installing more RAM - within the next week. This will allow us to run more server processes at once so we can serve more users at the same time. This is a priority right now because our servers are running out of memory: they're regularly going over 95% of usage, which is not ideal! We have purchased new RAM and it will be installed as soon as we can book a maintenance slot with our server hosts.
  • Changing our version of MySQL to Percona - within the next week. This will give us more information about what our server is doing, helping us identify problem spots in the site which we need to work on. It should also work a bit faster. We've currently installed Percona on our Test Archive and have been checking to see it doesn't cause any unexpected problems - we'll be putting it on the main site in the next week or so. Percona is an open source version of MySQL which has additional facilities which will help us look at our problems. In addition we hope to draw on the support of the company who produce it (also called Percona).
  • Completing the work on our new tag filters - within the next month. These will (we hope!) be much, much more scalable than the old ones. They'll use a system called Elasticsearch, which is built on Solr/Lucene. These are solutions which don't use the MySQL database, so they cut down on a lot of database calls.

Scalability stuff we're doing going forward

We want to continue working on scalability going forward. We've reached a point where the site is only going to get bigger, so we need to be ready to accommodate that. This involves some complex work, so there are a bunch of conversations ongoing. However, this will involve some of the following:

  • Analysis of our systems and code to identify problem spots. We've installed a system called New Relic which can be used to analyse what's going on in the site, how scalable it is, and where problems are occurring. Percona also provides more tools to help us analyse the site. In addition, Mark from Dreamwidth has kindly offered to work with us to take a look at our Systems setup - Mark runs the Systems side of things at Dreamwidth and has lots of experience in scalability issues, so having his fresh eyes on the performance site will help us figure out the work we need to do.
  • Caching, caching and more caching. We've been working on implementing more caching for some time, and we added a lot more caching as part of our emergency fixes. However, there is still a LOT more caching we can do. Caching essentially saves a copy of a page and delivers it up to the next person who wants to see the page, instead of creating it fresh each time. Obviously, this is really helpful if you have a lot of page views: we now have over 16 million page views per week, so caching is essential. We'll be looking to implement three types:
    • Whole page caching. This is the type we implemented as an emergency fix during the recent performance issues. It uses something called Squid, and it's the best performance saver because it can just grab the whole page with no extra processing. Unfortunately, this can also cause some problems, since we have a lot of personalised pages on the site - for example, when we first implemented it, some people were getting cached pages with skins on they hadn't chosen to use. There are ways around this, however, which allow you to serve a cached page and then personalise it, so we'll be working on implementing those.
    • Partial page caching. This is something we already do a lot of - if there are bits that repeat a lot, you can cache them so that everything isn't generated fresh each time. For example, the 'work blurbs' (the information about individual works in a list of search results) are all cached. This uses a system called memcached. We'll be looking to do more, and better, partial caching.
    • Database caching. This would mean we use a secondary server to do complex queries and then put the results on the primary server, so all the primary server is doing is grabbing them.
  • Adding more servers. We’re definitely going to need more database servers to manage site growth, and we’re currently finalising some decisions on that. At the moment, it looks like the way we’re going to go is to add a new machine which would be dedicated to read requests (which is most of our traffic – people looking at works rather than posting them) while one of our older machines will be dedicated to write requests (posting, commenting, etc). Once we've confirmed the finer details (hopefully this week), we expect it to take about two months for the new server to be purchased and installed.

Resources: finances

We'll be posting separately about the financial setup for the AO3, but the key thing to say is that we're currently in a healthy financial state. :D However, as the site gets bigger its financial needs will also get bigger, and we always welcome donations - if you want to donate and you can afford to do so, then donating to the OTW will help us stay on good financial footing. We really appreciate the immense generosity of the fannish community for the support already you've shown us. <3

Resources: people

A lot of supporting the site and dealing with scalability is down to the people. As we grow, we need to ensure we have the people and expertise to keep things running. We are a volunteer-run site and as such our staff have varying levels of time, expertise, and so on. One important part of expanding slowly is ensuring that we don't get into crisis situations which not only suck for our users (like when the 502s were making the site inaccessible) but also cause massive stress for the people working to fix the problems. So, we're proceeding cautiously to try to avoid those situations.

We've been working hard over the last year or so to make it easier for people to get involved with coding and working on the site. We're happy to say this is definitely paying off: we've had eight new coders come on board during the last few months who have already started contributing code. Our code is public on github, and we welcome 'drive by' code contributions: one thing we'd like to do is make that a bit easier by providing more extensive setup instructions so people who want to try running the code on their own machines can do so.

If you'd like to get more involved in our coding teams, then you can volunteer via our technical recruitment form. Please note that at the moment, we're only taking on fairly experienced people - normally we very much welcome absolute beginners as well, but we're taking a brief break while our established team get some of the performance problems under control so that we don't wind up taking on more people than we can support. We love helping people to acquire brand-new skills, but we want to be sure we can mentor and train them when they join us.

Lots of people have asked whether we'd consider having paid employees. It's unlikely that we'll have permanent employees in the foreseeable future, for a number of reasons (taxes, insurance, etc), but we are considering areas where we would benefit from paid expertise for particular tasks. Ideally, this would enable us to offer more training to our volunteers while targeting particularly sticky sections of code. Paying for help has a lot of implications (most obviously, it would add to our financial burden) and we want to think carefully about what makes sense for us. However, the OTW Board are discussing those options.

We're incredibly grateful to the hard-working volunteers who give their time and energy to all aspects of running the AO3. They are our most precious resource and we would like to take the opportunity to say thanks to all our volunteers, past, present and future. <3

Comment

Post Header

Published:
2012-06-13 11:46:29 UTC
Tags:

Yet another update from your tireless archive volunteers! James from our Systems Committee has been making adjustments behind the scenes to stabilize the servers and get the most out of our caching, and we've seen some good improvements there. At the same time, we've been working on improving or scaling back the areas of our code where changes will give us the biggest gains.

Filtering

In order to improve performance further, tag filtering on work listing pages is disabled for the time being, until we roll out our new system. You can read more about this change in our post on disabling filters. We know this is an inconvenience for many users, but the filters are really the 800-pound gorilla sitting on top of our database - the works pages are both the most popular and the slowest on the site, which is a bad combination. We've had plans to fix them for a while, and that's underway. However, we need a few more weeks to finish and deploy the upgrade, since it also affects our search engine and quite a lot of our code. Our top priority is to make sure works remain accessible to users, and that new works and feedback can be posted and accessed. Looking carefully at our code and our stats, we concluded that removing filtering was the best way to ensure these goals in the short-term.

You'll still be able to view all the works for a particular tag, view the works for a user or collection in a particular fandom, and use our search feature to refine your results. Our post on disabling filters includes some handy tips to help you find what you're looking for. We hope to have full functionality restored to you soon! As a bonus side effect of this change, we've been able to remove the 1000 work limit on lists of works. This is because without the filters we can rely on the pagination system to limit the amount that we retrieve from the database at one time. So, while you can't filter your results any more, you CAN go through and read every work posted in your fandom! We hope this will compensate a little for the inconvenience.

Work Stats Caching

We've also done more caching of work stats (all the counts of comments, bookmarks, hits, etc.), so you may notice that these update more slowly on index pages now. The information is all still being recorded; we're just waiting a little longer to go get the counts for each work to spread out the load.

People Listings

The alphabetical people listings on the People page weren't actually that useful for finding users, and they were another performance drain.

We've replaced the full alphabetical listing with a listing of 10 random users, and added emphasis on the search. Note you can use wildcards in the search, so if you're not sure of someone's name you can enter part of it followed by an asterisk to get similar names. For example, entering Steve* would get Steve_Rogers_lover, SteveMcGarrettsGirl, stevecarrellrocks, etc.

Invitation Requests

We've suspended user requests for additional invitations for now as well. If you need invitations urgently for a challenge or for an archive rescue project, please contact Support. We also fixed an issue that potentially allowed users to snoop for other emails in the waiting queue.

Thank you!

Thanks to everyone who has been working hard on these issues, especially James, who has put in lots of hours tweaking the servers, and Elz, who has been doing the heavy lifting on code changes. Thanks also to all of you for your patience and understanding while we work!

And finally...

The great news is that so far, this emergency measure does seem to be having a noticeable effect. Our server load has diminished dramatically since we deployed this change:

Graph showing server load, with a mark showing the time of the deploy. The load drops dramatically from this time onwards.

Comment

Post Header

Published:
2012-06-11 12:12:17 UTC
Tags:

Since last month, we've been experiencing frequent and worsening performance problems on the Archive of Our Own as the site has expanded suddenly and dramatically. The number of new users joining the site doubled between April and May, and we currently have over 17,000 users waiting for an invitation. We've been working hard to deal with the 502 errors and site slowdowns, and we've implemented a number of emergency fixes which have slightly alleviated the issues, but these haven't been as effective as we'd hoped. We're confident that we will be able to fix the problems, but unfortunately we expect the next round of fixes to take at least two weeks to implement.

We know that it's really frustrating for users when the site is inaccessible, and we're sorry that we're not able to fix the problems more quickly. We wanted to give you an update on what's going on and what we're doing to fix it: see below for some more details on the problems. While we work on these issues, you should get better performance (and alleviate the load on the servers) by browsing logged-out where possible (more details below).

Why so many problems?

As we mentioned in our previous post on performance issues, the biggest reason for the site slowdowns is that site usage has increased dramatically! We've almost doubled our traffic since January, and since the beginning of May the pace of expansion has accelerated rapidly. In the last month, more than 8,000 new user accounts were created, and more than 31,000 new works were posted. This is a massive increase: April saw just 4,000 new users and 19,000 new works. In addition to the growing number of registered users, we know we've had a LOT more people visiting the site: between 10 May and 9 June we had over 3,498.622 GB of traffic. In the past week, there were over 12.2 million page views - this number only includes the ones where the page loaded successfully, so it represents a lot of site usage!

This sudden and dramatic expansion has come about largely as a result of changes on Fanfiction.net, who have recently introduced more stringent enforcement of their policies relating to explicit fanworks which have resulted in some fans no longer being able to host their works there. One of the primary reasons the AO3 was created was in order to provide a home for fanworks which were at risk of deletion elsewhere, so we're very keen to welcome these new users, but in the short term this does present us with some challenges!

We'd already been preparing for site expansion and identifying areas of the site which needed work in order to ensure that we could grow. This means some important performance work has been ongoing; however, we weren't expecting quite such a rapid increase, so we've had to implement some changes on an emergency basis. This has sometimes meant a few additional unexpected problems: we're sorry if you ran into bugs while our maintenance was in progress.

What we've done so far

Our sys-admins and coders have implemented a number of things designed to reduce the load on the site over the last week:

  • Implemented Squid caching for a number of the most performance intensive places on the site, including work index pages. For the biggest impact, we focused on caching the pages which are delivered to logged-out users. This is because all logged-out users usually see the same things, whereas logged in users might have set preferences (e.g. to hide warnings) which can't be respected by the cache. We initially implemented Squid caching for individual works, but this caused quite a few bugs, so we've suspended that for now while we figure out ways of making it work right. (You can read more about what Squid is and what it does in Release Notes 0.8.17.
  • Redistributed and recalibrated our unicorns (which deliver requests to the server and retrieve the data) to make sure they're focused on the areas where we need them most. This included setting priorities on posting actions (so that you're less likely to lose data when posting or commenting), increasing the numbers of unicorns, and adjusting the time they wait for an answer.
  • Simplified bookmark listings, which were using lots of processing power. We'll be looking into revamping these in the future, but right now we've stripped them back to the basics to try to reduce the load on the site.
  • Cached the listing of guest kudos so the number doesn't have to be fetched from the database every time there are new kudos (which caused a big strain on the servers)

Implementing these changes has involved sustained work on the part of our sys-admins, coders and testers; in particular, the Squid caching involved a great deal of hard work in order to set up and test. Several members of the team worked through the night in the days leading up to the weekend (when we knew we would have lots of visitors) in order to implement the performance fixes. So, we're disappointed that the changes so far haven't done as much as we'd hoped to get rid of the performance problems - we were hoping to be able to restore site functionality quickly for our users, but that hasn't been possible.

What we're going to do next

Although the emergency fixes we've implemented haven't had as much impact as we'd hoped, we're confident that there are lots of things we can do to address the performance problems. We're now working on the following:

  • New search and browse code. As we announced in our previous post on performance issues, we've been working for some time on refactoring our search and browse code, which is used on some of the most popular pages and needs to be more efficient. This is almost ready to go -- in fact, we delayed putting it onto our test archive in order to test and implement some of the emergency fixes -- so as soon as we have been able to test it and verify that it's working as it should, then we will deploy this code.
  • More Squid caching. We weren't able to cache as many things as we'd initially hoped because the Squid caching threw up some really tricky bugs. We're continuing to work on that and we'll implement more caching across the site once we've tested it more thoroughly.
  • More servers. We're currently looking at purchasing a more robust database server and moving our old database server (aka 'the Beast') into an application slot, giving us three app servers. We'll also be upgrading the database software we use so that we can make the most of this server power.

When we'll be able to implement the fixes

We're working as fast as we can to address the problems -- we poured all our resources into the emergency fixes this week to try to get things up and running again quickly. Now that we've implemented those emergency fixes, we think that we need to focus on making some really substantive changes. This means we will have to slow down a little bit in order to make the bigger changes and test them thoroughly (to minimise the chances of introducing new bugs while we fix the existing problems). Buying servers will also take us some time because we need to identify the right machines, order them and install them. For this reason, we expect it to take at least two weeks for us to implement the next round of major fixes.

We're sorry that we're not able to promise that we'll fix these problems right away. We're working as hard as we can, but we think it's better to take the time to fix the problems properly rather than experimenting with lots of emergency fixes that may not help. Since the AO3 is run entirely by volunteers, we also need to make sure we don't burn out our staff, who have been working many hours while also managing their day jobs. So, for the long term health of the site as a whole, we need to ensure we're spending time and resources on really effective fixes.

Invitations and the queue

As a result of the increasing demand for the site, we're experiencing a massive increase in requests for invitations: our invitations queue now stands at over 17,000. We know that people are very disappointed at having to wait a long time for an invitation, and we'd love to be able to issue them faster. However, the main reason we have an invitations system for creating accounts is to help manage the growth of the site -- if the 16,000 people currently waiting for an invitation all signed up and started posting works on the same day the site would definitely collapse. So, we're not able to speed up issuing invitations at this time: right now we're continuing to issue 100 invitations to the queue each day, but we'll be monitoring this closely and we may consider temporarily suspending issuing invitations if we need to.

Until recently, we were releasing some invitations to existing users who requested them. However, we've taken the decision to suspend issuing invitations this way for the present, to enable us to better monitor site usage. We know that this will be a disappointment to many users who want to be able to invite friends to the site, but we feel that the fairest and most manageable way to manage account creation at present is via the queue alone.

What can users do?

We've been really moved by the amount of support our users have given us while we've been working on these issues. We know that it's incredibly annoying when you arrive at the Archive full of excitement about the latest work in your fandom, only to be greeted by the 502 error. We appreciate the way our users have reached out to ask if they can help. We've had lots of questions about whether we need donations to pay for our servers. We always appreciate donations to our parent Organization for Transformative Works, but thanks to the enormous generosity fandom showed in the last OTW membership drive, we aren't in immediate need of donations for new servers. In fact, thanks to your kindness in donating during the last drive, we're in good financial shape and we're able to buy the new server we need just as soon as we've done all the necessary work.

As we've mentioned a few times over the weekend, we can always use additional volunteers who are willing to code and test. If this is you or anyone you know, stop by Github or our IRC chat room #otw-dev!

There are a few things users can do when browsing which will make the most of the performance fixes we've implemented so far. Doing the following should ease the pressure on the site and also get you to the works you want to see faster:

  • Browse while logged out, and only log in when you need to (e.g. to leave comments, subscribe to a work, etc). Most of our caching is currently working for logged-out users, as those pages are easier to cache, so this will mean you get the saved copies which come up faster.
  • Go direct to works when you can - for example, follow the feeds for your favourite fandoms to keep up with new works without browsing the AO3 directly, so you can click straight into the works you like the sound of.

Support form

Our server problems have caused some problems accessing our support form. If you have an urgent query, you can reach our Support team via the backup Support form. It's a little more difficult to manage queries coming through this route, so we'd appreciate it if you'd avoid submitting feature requests through this form, to enable us to keep on top of bug reports. Thanks!

Thank you

We'd like to say a big, big thank you to all our staff who have been working really hard to address these problems. A particular shoutout to James, Elz, Naomi and Arrow, who have been doing most of the high level work and have barely slept in the last few days! We're also incredibly grateful to all our coders and testers who have been working on fixing issues and testing them, to our Support team, who have done an amazing job of keeping up with the many support tickets, and to our Communications folk who've done their best to keep our users updated on what's going on.

We'd also like to say a massive thank you to all our users for your incredible patience and support. It means so much to us to hear people sending us kind words while we work on these issues, and we hope we can repay you by restoring the site to full health soon.

A note on comments: We've crossposted this notice to multiple OTW news sites in order to ensure that as many people see it as possible. We'll do our best to keep up with comments and questions; however, it may be difficult for us to answer quickly (and on the AO3, the performance issues may also inhibit our responses). We're also getting lots of traffic on our AO3_Status Twitter! Thanks for your patience if we don't respond immediately.

Comment

Post Header

Published:
2012-06-09 05:42:50 UTC
Tags:

Welcome to our third Release in this week! Elz, James, and Naomi contributed code to this release, and Ariana, bingeling, Enigel, Jenn, and Kylie from our testing teams worked it over. Our sysadmins and coders have done more work to address the performance issues that have been affecting the archive as well as several other bugfixes.

PLEASE NOTE: in the name of drastically improving performance, this deploy may have a few side effects that appear at first to be errors or confusing! Please do read over these release notes and make sure that they don't cover a problem you are experiencing before you contact support.

Further efforts to battle the 502 errors!

This release includes caching of most pages for guests using Squid! Squid will serve up saved versions of pages without hitting our database or application, which increases speed and decreases server load for everyone.

The tricky part is making this work with all of the dynamic elements of the site: skins, content that gets updated by users, personalized messages, etc. We have decided to turn squid on quickly to keep the Archive running smoothly but we'll be working on finding the right balance between customization and performance as we go forward, so you may see some tweaks to different aspects of the site as we fine-tune this.

Current issues related to the caching:

  • Site skins have been disabled for logged out users for the time being - if you rely on this feature for accessibility needs, please contact support and we will get you an account ASAP so you can use the skins again.
  • Comments and kudos from guests may not show up at once for other guests. When a guest leaves kudos or posts a comment, they will see the comment/kudos added. If another guest then visits that same page (or the same guest reloads the page), however, they will see the most-recently-cached version, which may not yet show their comment/kudos count.
  • Guests may occasionally see a stray error message or notice appearing at the top of a page that does not appear to be related to anything you've done. We are working to track all of these errors down but it is hard to be sure we've gotten them all. The messages should not affect using the archive.
  • Hits that are handled by Squid (most hits from guests) will not appear in the hit count immediately. The hit counts will be updated once a day from the squid logs.
  • Duplicate hits from the log files (for instance on page reloads by the same guest) will no longer be removed because of technical limitations, so hit counts may increase more quickly in some cases.

Squid will be enabled after we update the code, so you may not notice any changes right away.

For those interested in knowing more about Squid, see the detailed explanation below!

Changes to Subscription emails

We've gotten feedback about how people use their subscription emails and in response we have adjust the subject lines and message content to allow people to identify the content more easily. Emails will now contain subs of one type (author, series, or work) and the name/title of the first one in the subject together with the number of other updates.

Details

  • Subscriptions:
    • Email subjects will now say [AO3] instead of [Archive Of Our Own].
    • Subscriptions will be bundled by type with subject lines of the form [authors] posted [first item] and [#] more, where first item will be one of: [Work Title], [Chapter Title] of [Work Title], [Work Title] in [Series Name].
  • Performance:
    • Skin chooser is turned off for logged out users.
    • Nearly all pages will be cached for logged out users.
    • Comment forms and other forms that are getting data for logged out users will have their details remembered in cookies and filled in by Javascript rather than remembered in the page.
  • Bug Fixes:
    • 500 errors were appearing on some work listings because of an interaction between caching and time zone conversion - this should be fixed now.

Details About Squid

Senior coder Ana has written up some helpful information about Squid for those who are curious:

"Squid is a really powerful tool that does a lot of things, but we’re using it primarily as one thing: a reverse-proxy cache. A reverse-proxy cache is a system designed to cache (that is, store copies of) web pages. It sits between users’ requests and the rest of the site and stores the responses to some requests so that instead of making the server build the page from scratch again, Squid can check to see if someone’s looked at that page recently and pass on the cached version. This is really useful when you want to send the same page to lots and lots of users because it means that instead of forcing the servers to generate the pages over and over, we can store a copy and give that copy out to everyone.

Of course, sometimes pages change: an author edits a story, or someone leaves kudos, so you don’t want to let Squid keep those copies around forever. Right now we let Squid keep copies for 20 minutes, and then it throws them away and gets a new one. This feels like the right balance between keeping things up to date, but not overloading the servers.

In addition, logged in users get customization on every page, in the form of the user bar at the top of the page if nothing else, which means that we don’t want Squid to store or give pages to logged in users. If it did, then every user would see the user bar for whoever made the request that Squid saved, and it would only change every twenty minutes.

This same principle holds true for all on-page customization (such as the skin-chooser), and finding the right balance between customization and cacheability (how suitable a page is for storing and giving out to everyone) is going to be an ongoing project as we try to weigh site performance against nifty features and information."

These release notes written and compiled by Ana, Claudia, Elz, Enigel, Jenn, Lucy, and Naomi.

Comment

Post Header

Published:
2012-06-07 19:21:07 UTC
Tags:

Mini-release notes: Battling the 502s errors and responding to user feedback

Our coders and sys-admins have been working hard to deal with the performance issues we've been experiencing over the last few weeks. Releases 0.8.15 and 0.8.16, only a few days after deploying 0.8.14, see the introduction of several tweaks to our code and server setup that should help alleviate the site slowness we addressed in our recent post, AO3 performance issues.

New code by andreja, Ariana, Elz, Enigel, and Naomi. Tested by Elz, Enigel, Jenn, Kylie, Sarken.

Performance fixes

As noted on our Known Issues page, one of the major bottlenecks were tag listings for very popular fandoms, which would either load very slowly or throw up 502 errors. We've added caching for the first five pages of results - these will expire when a work using that tag is posted or revised, so the listings will still be up to date.

Another significant bottleneck were the bookmarks listings, both the main one and the bookmarks on tags. We've simplified the functionality, as we look into ways of reworking the bookmarks for a much better performance.

With the help of New Relic, a web service that monitors and analyzes site performance in great detail, we found another source of slowness: works with a large amount of (guest) kudos. Showing an updated kudos count for every work as it was accessed was putting an undue strain on the server, so for now the number of kudos on a work will be cached (i.e. not fetched from the database in real time) and only updated every five minutes.

As per a user suggestion (thank you, fydyan), we also looked into ways to prioritize certain user actions over others, so that trying to post a work or a comment would be less likely to throw up an error (potentially taking all input with it) than, say, accessing a user profile or browsing the site. Many thanks to Sidra from Systems for implementing this!

Invitation emails

We had a bug which was preventing notification emails for invitations being sent. We've fixed this bug, and the emails which were affected have been resent. This may result in some people receiving their invitations twice - invitations can only be used once, so please take note of the invitation code and don't pass the extra email along to a friend if it's a code you've already used!

Plans for our next few deploys

We're continuing to work hard on performance fixes and will soon be implementing much more caching across the site. We'll be posting with more details about this shortly.

We've had lots of feedback about our recent changes to notification emails. Unfortunately, we cannot roll back these changes for performance reasons (for more details see our post on Email changes and USER STATS!). However, based on feedback we've received so far, we will be adjusted the way subscription emails are batched and labeled, hopefully in the next deploy. Thanks for bearing with us while we work to improve this!

Details

  • Bug fixes:
    • Invite emails were not being sent; this has now been fixed and the delayed ones resent.
    • The work count on the front page used to vary depending on whether the user was logged in; it now shows the total of all published works.
    • Previewing a work before posting made the wordcount not show; this has been fixed.
  • Improvements:
    • Kudos counts on works are now cached, but will change when new kudos is added by logged in users. Guest kudos counts will update every five minutes. Thus it will cause less of a load on the database.
    • Works listings under tags are now cached up to the fifth page. The listings will update when a work with that tag is posted or revised.
    • We've removed the grouping of bookmarks by work. Also, before it was trying to get the bookmarks using tag synonyms, or bookmarks that were not using the tag directly. The main bookmarks listing now grabs the most recent public bookmarks on the site, and the tag listings show the bookmarks that are tagged directly with the requested tag.
    • Text on the collections page has been changed to clarify what characters are allowed in collection names.
    • Small change to text on the tag edit pages: the label for the synonym field was changed to "Choose an existing tag or add a new tag name here to create a new canonical and make this tag its synonym." for the sake of clarity.
    These release notes written and compiled by mumble, Enigel, Jenn and Lucy.

Comment

Post Header

Published:
2012-06-01 06:38:41 UTC
Tags:

As pretty much all of our users have no doubt noticed, we've been experiencing some problems with Archive loads: slowdowns and the appearance of the dreaded 502 page have become a regular occurrence. We're working on addressing these issues, but it's taking longer than we'd like, so we wanted to update you on what's going on.

Why the slowdowns?

Mostly because there's so much demand! The number of people reading and posting now is overwhelming - we're glad so many people want to be here, but sorry that the rapid expansion of the site is making it less functional than it should be.

We now get over a million and a half pageviews on an average day, often clustered at peak times in the evening (particularly when folks in the Western Hemisphere are home from work and school) - we were using a self-hosted analytics system to monitor site traffic, and we had to disable it because it was too overloaded to keep up. The traffic places high demands on our servers, and you see the 502 errors when the systems are getting more requests than they can handle. Ultimately we'll need to buy more servers to cope with rising demand, but there's ongoing work that we've done and need to continue to do to make our code more efficient. We've been working on long-term plans to improve our work and bookmark searching and browsing, since those are the pages that get the most traffic; right now, they present some challenges because they were designed and built when the site was much smaller. We've learned a lot about scaling over the years, but rewriting different areas of the code takes some time!

What are you doing to fix it?

Our Systems team are making some adjustments to our server setup and databases. Their first action was to increase the amount of tmp space for our MySQL database on the server - this has alleviated some of the worst problems, but doesn't really get at the underlying issues. They're continuing to investigate to see if there are additional adjustments we can make to the servers to help with the problems.

We're also actively working on the searching and browsing code: that's been a big project, and it will hopefully make a significant impact. Because it affects a lot of crucial areas of the site, we want to make sure we get everything right and do as much testing as we can to ensure that performance is where it needs to be before we release it. We're switching from the Sphinx search engine to elasticsearch, which can index new records more rapidly, allowing us to use that for filtering. That will offer us more flexibility, get rid of some of our slower SQL queries, and take some pressure off our main database, and it also has some nice sharding/scaling capabilities built in.

We also try to cache as much data as we can, and that's something we're always looking to improve on. Systems and AD&T have discussed different options there, and we'll be continuing to work on small improvements and see what larger ones we may be able to incorporate.

When will it be fixed?

It's going to take us a few weeks to get through all the changes that we need to make. Our next code deploy will probably be within the next week - that will include email bundling of subscription and kudos notifications, so that we can scale our sending of emails better as well. After that, we'll be able to dedicate our resources to testing the search and browsing changes, and we're hoping to have that out to everyone by the end of June. We rely on volunteer time for coding and testing, so we need to schedule our work for evenings and weekends for the most part, but we're highly motivated to resolve the current problems, and we'll do our best to get the changes out to you as soon as we can.

Improving the Archive is an ongoing task, and after we’ve made the changes to search and browse we’ll be continuing to work on other areas of the site to enable better scalability. We’re currently investigating the best options for developing the site going forward, including the possibility of paying for some training and/or expert advice to cover areas our existing volunteers don’t have much experience with. (If you have experience in these areas and time to work closely with our teams, we’d also welcome more volunteers!)

Thanks for your patience!

We know it's really annoying and frustrating when the site isn't working properly. We are working hard to fix it! We really appreciate the support of all our users. ♥

Comment


Pages Navigation