Skip to end of metadata
Go to start of metadata

Mail retrival was out of service between Mar 8th, 11:35 p.m. and Mar 10th, 8:20 a.m.

The fault was caused by the mechanism used to prepare the mail store to create a valid backup on Mar 8th, 11:30 p.m.
For an unknown reason the snapshot funcion did not succeed. A probable cause might be a heavy access load at the time of the snapshot process, which continued for at least 5 minutes.

IRC-IT's network monitoring detected the fault in time.
Due to the lack of an on-call service the fault could be cleared not before Monday morning.

All queued mail has been delivered by 9:35 a.m.

14 Comments

  1. Dear Mr. Schmidt,

    is there any way that IT support can be improved in the direction of 24/7 support for services like email, network connectivity etc. While it is normal that problems occur and that downtimes of 1-2 hours might be unnoticeable I find it frustrating that a downtime of >1day can happen even on Sunday. Many people rely on email for their daily tasks. So a whole university having no email for >1day causes a real problem. I saw that the Nagios system reported critical IMAP/POP etc service already on Sat evening, can we not get on call support given that there are 1000+ users that rely on email and other services?

    Thank you for your response!

    Regards,
    Vladislav

    1. Dear Vladislav Marinov,

      I regret the outage of the last weekend, we will see to make the automatic scripts more fault tolerant, though the root cause of the incident is not clear yet.

      The university has 24/7 network support by an external contractor, which is monitoring out network and Internet connectivity and working pro-actively. Unfortunately, we are not able to exceed this service to application layer services.

      The email system will be overhauled during this year, and migrated to a different architecture, leading to other less issues, more features, and higher availability. See Project IT Infrastructure Development (PITID) approved for more information.

  2. Is there a way to prevent cyrillic characters from being sometimes interpreted as viruses?

    1. I assume you mean spam instead of virus... looking into the header of an email classified as spam, I see "malformed header" as a reason for high spam probability, not "cyrillic character":

      BAD HEADER Non-encoded 8-bit data (char C5 hex) in message header 'From'...

      Meaning, the email doesn't conform to email standards. I do not assume, that non-latin characters are an issue, if used properly. If not used properly, they indicate spam. But I am only guesstimating here.

  3. I think the IT department should switch to Gmail for institutions (http://www.google.com/a/help/intl/en/admins/editions_spe.html), which is free for universities. It is much more reliable and cheaper to sustain.

    Kind regards,
    Alin Iacob

    1. We strive for a university-wide email/communication system, including faculty and admin staff. Hosting the university's emails (and contacts, and documents, and...) outside Germany is not an option. Generating ad revenue for third parties is also not beneficial to the university. Managing email accounts from a central account is also not possible.

      The same reasoning goes for the hosted university communication service Microsoft provides, similar to the Google one.

  4. I don't think Google is a good solution; on the other hand though it seems like the university wants to migrate to a Microsoft based solution - which as I strongly believe is not the best decision to be made either.

    Why doesn't the university rely on a Linux backed infrastructure? Entire governmental institutions are changing their back end to Linux solutions, due to efficiency and cost reasons.

    1. The current email infrastructure actually is based on Linux.

      Entire governmental institutions, and industries, and universities, are also changing to Microsoft, so the existence of both migration paths is not a reason to follow either one.

      Actually, I am a friend of Linux, Solaris, and FreeBSD. But in our university context, based on my experience of the last years, and the upcoming developments from CampusNet, and the effort we have to put into keeping systems and applications running and manageable, (e.g., including this discussion (smile)) Windows is the way to go.

      But, after all, it's just infrastructure, on which technology it is built should not have much visible effect on the services provided by it, if the same services are taken into account from both "worlds": an Exchange server can (will) do IMAP4 and POP3 and SMTP, so it won't differ from the client program perspective regarding plain email.

      Except, if you wanted to use additional features which are currently not available, but will be soon: Outlook connects, ActiveSync for mobile device integration, planning common meetings with admin, faculty, and staff, ... you don't have to use it, but you can.

      The goal is to maximize on a common set of IT services for the whole university (including undergraduate and graduate students, faculty, technical and admin staff!), with a minimum of local resources.

  5. Dear Mr. Schmidt, dear IRC-IT team,

    Although all mail server surveillance graphs show no signs of an problem at the moment, and the mailboxes can be accessed (mine by IMAP), it seems that there is no mail delivery taking place since approximately two hours. Could you please investigate this observation of mine and several friends?
    Furthermore let me use this post for one question: Which authentification and encryption methods do the mailservers (IMAP, POP3 and SMTP) support, which would be advisable to set-up in our mail clients (in means of SSL, TLS, or nothing)?

    Thank you,

    Thomas Ponath

    1. I can see some kind of a gap in the sent messages graph for the period March 11th 3-5pm http://mailhost.jacobs-university.de/mailgraph/mailgraph.cgi
      And it seems that I get some (but not all emails) with about 2 hours of delay.

      Vladislav

      1. The absence of peaks in the graph is just the absence of mailing list posts - but of course, these list posts are delayed as well.

        Actually the only point, where one could see, that something fishy is going on, is the MX mailgraph at http://hermes.jacobs-university.de/mailgraph/mailgraph.cgi (location is subject to change in the next days).

    2. There has been a prematurely attempt to flood our mail system around 2:30 p.m. Since then, the amount of mail queued for delivery on the MX machine decreased slowly, because in the forther mail path is an already identified bottleneck - the virus filter/spam scanner.

      I am working on a more scaleable setup, which is du to put in effect Thursday morning. Then the virus filtering/spam scanning task is distributed to multiple machines, thus increasing the number of mails per time laps to be processed.

      Right now the offending mails have been put on hold and will be delivered later. The virus/spam test is a little bit modified for descreased scan times. As far as I can see at 7:40 p.m. there will be another two hours for higher delivery delays.

      Greetings,

      Stefan Schmidt

  6. It is again weekend and the mail system is down. Nagios says it went down around midnight and I assume that it won't get fixed before tomorrow morning. We are all aware of the new infrastructure that Mr. Schmidt already pointed to but before this comes into action can we hope for more reliability in email. What about really getting on call support be it in the kind of a staff member/ student assistant/ contractor?

    Best,
    Vladislav

    1. The supporter would need special knowledge of the installation.
      I do lots of thing in my spare time, like Yesterday, when one of our name service machines crashed. Also, there was a problem with the wireless network controller.
      Before I talk about on-call-service, I talk about salery increase.

      From today on, we forget about consistent backup sets on the mail server. There might be damage to certain mail box settings possible, but the system will run more reliable, as the backup preparation step - shutting down mail system for SMTP, POP3 and IMAP, creating a snapshot of the filesystem and re-starting the mail system - is the critical one.

      Greetings,

      Stefan