1. mike-clarke

    Recent service interruptions

    Posted on January 12 by mike-clarke

    Disqus has recently been going through a rough patch with service interruptions affecting many users. We’re very sorry about this. Here are the details of what happened and what we’ve been working on.

    As the new year began, we noticed an uptick in the amount of traffic loading Disqus. When traffic ramped up, our servers responded with our typical sub-second response times. However, we began to see periods of lag in our database cluster, which means data in one database lagged by as much as several hours behind the latest comments and changes.

    User Impact

    Since last Monday, you might have noticed Disqus was behaving in some strange ways. Our support team let moderators and site owners know that they could expect delays when interacting with Disqus. During these periods, users could anticipate:

    • Comment counts and community profile activity did not match the latest user interactions.
    • Latest analytics data appeared was not immediately available to moderators.
    • Notifications of new comments and reactions were delayed.
    • Recent comments and hot threads widgets did not reflect the latest data.
    • The moderation panel did not contain the latest comments from a forum.
    • Actions performed in the moderation panel did not appear to take effect immediately inside the panel.

    To be clear, there are many areas that were not impacted by the system issues:

    • The latest comments are always displayed to users logged in via Disqus.
    • Actions taken in the moderation panel were applied immediately, even if the value itself appeared delayed in the panel.
    • Moderation actions executed through the embed were immediately applied.
    • Comments that were posted appear immediately, and no comments were “lost” during periods of delay.

    For techies

    Let’s talk about some technical details. At Disqus, we use PostgreSQL as the database of choice for storing data. To maximize the performance of these machines, we use Slony-I as the replication tool to mirror all data changes to the read-only slave machines. Beginning last week, we found that the load on these slave machines had reached a critical point that prevented them from making data changes while also responding to clients. There were two solutions to this problem:

    • Reduce the number of changes that need to be replicated
    • Add capacity to reduce load on the read-only machines

    The easiest solution for us to add capacity to these machines was to perform a hardware memory upgrade. On Friday morning (January 7), we attempted to upgrade RAM during off-peak hours. The memory upgrade caused several OS failures during the next 12 hours, specifically occurring at 4:30am, 7:00am, and 5:30pm PST. With one database down, we found peak traffic could be served at a degraded response time (and the extra load contributed further to replication lag).

    There are several steps we have taken to prevent something like this from happening again. Specifically:

    • Moved session data (the source of a significant number of data changes) to a separate database cluster, based on PostgreSQL 9 and the streaming replication feature it offers. This was completed yesterday and has eliminated the replication lag seen last week.
    • Restored N+1 database capacity by adding another server to our pool in the event of hardware failure. We’ve deployed another database in the last 24 hours that will give us this extra layer of redundancy.

    Lessons Learned

    In light of these problems, we’ve realized that N+1 redundancy is a critical yet moving target, and as our traffic continues to grow, our architecture needs to grow as well. We’re taking a closer look at other areas of our infrastructure to ensure we have sufficient redundancy, as well as continuing to find areas for optimization.

    In the coming week(s) we will be launching a new status page to help ensure everyone has a better understanding of what’s going on when these events happen. We plan to include as much information as possible about each service under our umbrella. With this in mind, it should let us communicate future problems much easier, and quicker.

    - Follow @disqus for recent updates and the latest status information.

    If you’re experiencing anything out of the ordinary that was not discussed above, be sure to check our knowledge base for guides and solutions.

  2. giannii

    Updated WordPress plugin 2.43

    Posted on August 12 by giannii

    We’ve made some changes to our WordPress plugin for a small minority of people experiencing comments not loading. If you’ve experienced this, we highly recommend updating to the latest plugin found here: http://wordpress.org/extend/plugins/disqus-comment-system/

    If you have any questions or concerns do not hesitate to contact our general support.

  3. danielha

    Styling issues fixed

    Posted on July 1 by danielha

    We’ve fixed the issues mentioned in the previous status post.

  4. danielha

    Styling issues appearing on some sites; we’re fixing this now

    Posted on July 1 by danielha

    We’re seeing some Disqus-enabled sites experiencing styling issues. We’re working on the fix now. Sorry about this.