Saturday, March 23, 2013

Analyzing wtmpx on Sun Ray servers (part 2)

Following the previous blog entry in this article I'll take a deeper look into wtmpx of a Sun Ray server over time.

At my former company there were hundreds of Sun Ray servers deployed all over the world. In this example I picked one to show a couple of ways what to do with wtmpx data of Sun Ray servers.
Looking at wtmpx data one could get a first impression on how well Sun Ray servers are being used. Of course the sheer existence of a user session does not say much about what is being done in a session, which applications are being run by the user, is the user typing at all or is only the screenlock running but it is a first picture of what is going on on a machine.

All of the graphs below have been created out of one wtmpx file on one particular server in Europe (I have disguised the name).

In the graphs below I'll show certain takes on the data:

  • all sesssions and their durations (displayed as vectors. I chose two colours simply for illustration purposes, red and blue don't have any other meaning).
  • the count of running sessions on a particular day
  • Histograms to show the number of sessions which lasted a particular time. I chose time ranges 1-10, 11-20 etc. and an additional '1' which contains all sessions lasting less than one day.

    I am showing in three tables the graphs

  • for one full year April 2006 - March 2007
  • for one month in that year (December 2006)
  • for one week in that year (first week of December 2006)

    Year view
    April 2006 - March 2007
    All displays
    Each Sun Ray session is assigned a DISPLAY number. Looking at the distribution of DISPLAY numbers the graphs shows a not quite even distribution. From time to time there are gaps. Few high numbered DISPLAYs have been chosen and the assignment is not continuous. Particularly interesting to me are the few above 120. How did they get into place at all?
    This would require some digging into the Sun Ray server details, something I haven't done.
    Displays
    0-100 only

    Same as before but restricted to DISPLAY numbers up to 100.
    Count
    In the latter half of 2006 more Sun Ray servers were deployed in this location which explains the count decrease after August.
    Each week shows a fuzzy peak which is due to additional sessions started and stopped during that week.
    Histogram
    Very clearly the majority of sessions did last less than one day.
    But there were also some very long lasting sessions (several months, look at the 101-110 bin in the next histogram) and in fact there is no technical reason to login/logout the same day: it was company policy to advise people to logout at the end of their working day, one of the reasons being not to leave behind unsaved data in open windows in case one of the rare event of a server crash. Another reason was that some applications did show memory leaks over time (e.g. web browsers) and in order to prevent that one would have needed to restart these applications regularly. The easier advice was to restart the session and thus restart the applications automatically.
    A good portion of the employees did not follow the daily logout advice as can be seen in the next graph.
    Histogram
    more than 10 days

    This shows the sessions which took longer than 10 days (the previous graph without the first two columns).

    As one can see the Sun Ray servers first of all could run for years without having to reboot and could sustain user sessions equally long. The longest user session I could find (not on this machine) was about 400 days i.e. well over a year. In my experience there were more reboots caused by power issues in the building or server room reorganizations than by actual machine or software failures (something which I couldn't say for my Windows machines nor my Macs).

    Month view
    December 2006
    Displays
    0-100 only

    In the month view the sessions (display numbers and duration) are better viewable and distinguishable than in the year view. One can clearly see numerous short lived sessions, weekly sessions but also longer sessions going beyond the end of the month.
    Count
    In this view the weekly pattern becomes more obvious. Above a certain level of seemingly permanent sessions each working week day shows some peaks. At the end of the month after December 23rd there is a significant drop in session activity (many people seem to have logged out for Christmas) but the last working days in December (27, 28 and 29, the 30th was a Saturday in 2006) show similar peaks as before.
    Histogram
    The histograms show the same pattern than in the year view.
    Histogram
    more than 10 days

    Week view
    December 1-8 2006
    Displays
    0-100 only
    Count
    December 2 and 3 2006 were a weekend.
    On December 4 one can see the session increase at around a third of the day column which is approximately 8:00. Later that day there is a drop (I would say at around 17:00) but not to the level of the weekend so out of all the new sessions at the start of the week some logged out again and others stayed logged on.
    At Friday December 7 in the afternoon the number of sessions reaches approximately the level of the weekend before.

    A few more interesting things: during the weekend December 2-3 there is also a slight fluctuation, why? There is a small chance that people had come to the office to do some work but a better explanation is that Sun Microsystems had established at that time a functioning "work from home" model which included provisioning employees with Sun Ray clients at home. Running a user session on a Sun Ray server in any location and accessing it from home was standard working practice, in my case server and home were hundreds of kilometers apart and even situated in different countries. So going to your home Sun Ray client and either starting a new session or switching to your existing one could be done and some people did so over the weekend when needed.
    In this example the weekend fluctuation is two or three sessions.

    Histogram
    Same patterns than in the year and month view: many single day sessions have been run and a few longer ones have been started in this week.

    All of the above relate to one machine. What one could do now is of course a broader analysis. Compare this machine to the other machines in the same location. Compare this location with other locations in the same country and in other countries.
    What are the differences? What are the similarities?
    Increase the time range and look at several years.
    If there are any changes what factors could have influenced them?
    Taking into account people data one could introduce groupings (e.g. by department, office building, ...) and look at these groups separately and again try to compare, find patterns etc.

    About the gnuplot code

    Each of the three types of graphs had a different data source.
  • display duration: the data file as explained in the previous blog
  • count and histogram: their files were generated by parsing the original data file with a little awk

    The graphs with display number limits or different periods of time were simply using new yrange or xrange settings (and of course different titles and output file names).

    The histogram graph for durations greater than 10 used the histogram data file minus the first two lines.

    Code for counts based on data files like this (timestamp and count):

    20060405130501 35 
    20060405135502 34 
    20060405150057 33 
    20060405151213 32 
    20060405152449 33 
    20060405164318 34 
    20060405170129 33 
    20060405170954 32 
    20060405172414 31 
    20060405173537 30 
    
    
    set title "wtmpx - server name\nSession count\n1 year"
    set key off
    set ylabel "Number of sessions"
    set yrange [0:99]
    set timefmt "%Y%m%d%H%M%S"
    set xlabel "Date"
    set xdata time
    set xrange ['20060401000000':'20070401000000']
    set format x "%Y\n%m/%d"
    set output "filename_count.png"
    plot '...filename...' using 1:2 with lines lc rgb "#800000"
    

    Code for histograms based on data files like this (bins and frequencies):

    1 43 
    1-10 24 
    11-20 10 
    21-30 1 
    31-40 1 
    41-50 0 
    51-60 1 
    61-70 0 
    71-80 0 
    
    set title "wtmpx - server name\nHistogram\n1 year"
    set key off
    set ylabel "Frequencey"
    set xlabel "Duration of session\n(in n days or less)"
    set style data histograms
    set style fill solid border -1
    set output "filename_histo.png"
    plot '...filename...' using 2:xtic(1) lc rgb "navy"
    
    

    All of this was done in gnuplot 4.4 on a Solaris 10 machine.

  • Friday, March 22, 2013

    Analyzing wtmpx on Sun Ray servers (part 1)

    Sun Rays are Sun Microsystems (now Oracle) thin clients and they basically consist of a Sun Ray server (which is actually not a particular type of hardware but a piece of software which can run on various platforms. In this article Sun Ray server will refer to both the machine and the software, the context should make it clear.) and the respective Sun Ray clients (which used to be a particular device which came in various flavours and later was complemented by a soft client in order to give e.g. notebook or tablet users a virtual option).

    When users are logging into the Sun Ray server the respective entries in utmpx and wtmpx need to be created and there is a distinction between logins coming from a Sun Ray client (hard or soft) or other sources (e.g. console or remote logins).

    In the next two articles I will take a deeper look into the wtmpx entries created by Sun Ray clients.

    How to identify a Sun Ray client login

    When looking at the members of 'struct futmpx' in /usr/include/utmpx.h there are two entries which identify a Sun Ray client login.
    FIrst of all ut_line is set to dtlocal and secondly the ut_host entry contains the name of the DISPLAY i.e. a colon and a number e.g. ':21'.
    An entry in wtmpx is created for each login but also for each logout action.
    Aside from user name, date and time etc. the login entry is identified by ut_type being set to 7 (USER_PROCESS) and the corresponding logout enty has ut_type set to 8.
    Here is the example of a login/logout pair of lines for user 'joes' but of course these lines are not adjacent since many more entries have been happening after the login (the example is the output as shown by fwtmp).
    joes   dt6q  dtlocal  40881  7 0000 0000 1164719643 0 0 4  :21  Tue Nov 28 14:14:03 2006
    joes   dt6q  dtlocal  40881  8 0000 0000 1167177826 0 0 4  :21  Wed Dec 27 01:03:46 2006
    
    i.e. user 'joes' logged in on Nov 28 with process id 40881 and his DISPLAY was assigned ':21'. He logged out again about one month later.

    There are a couple of cases when no corresponding logout entry can be found in wtmpx: the server crashed unexpectedly (quite rare) or the admins have set up regular backup and re-init of wtmpx files (very often retained as wtmpx.1, wtmpx.2 a.s.o.) so the login entry would be in another file than the logout entry. The code in this article will assume that corresponding login/logout pairs can be found in one file.

    My code will now scan wtmpx for all login/logout pairs of Sun Ray entries and create a data file which can be used by gnuplot to visualize the findings.
    The graph will show the timeline on the x-axis and the DISPLAY numbers on the y-axis.
    A login/logout pair will be represented by a vector which starts at login time and ends at logout time, the length of the vector being the duration of the session. The example above does look like this.

    joes  21    20061128141403 20061227010341 682.83
    
    It contains the username, the DISPLAY number, start and end time and the duration in hours. The graph would be this:

    And here is the gnuplot code:
    set title 'wtmpx - Example'
    set key off
    set grid
    
    set ylabel 'Display number'
    set ytics nomirror
    set yrange [0:30.5]
    
    set timefmt "%Y%m%d%H%M%S"
    set xlabel "Date"
    set xdata time
    set xrange ['20061101000000':'20070101000000']
    set format x "%Y\n%m/%d"
    
    set terminal png small size 600,300
    set output "Example.png"
    
    set style arrow 1 head filled size screen 0.01,20,65 ls 1 lc rgb "red"
    
    plot '...filename...' using 3:2:($5*3600):(0) with vectors arrowstyle 1
    
    Almost the same code will be used to create the graphs for many user sessions over a longer period of time.

    Create the gnuplot data file out of wtmpx

    wtmpx will require data files with entries like this:

    tmnsn    30 20060403075755 20060410153215 175.57
    tt12339  79 20060412085413 20060412180126 9.12
    nm8720    8 20060412095225 20060412180421 8.20
    rr13447  84 20060412141539 20060412183617 4.34
    oo2006  101 20060412085250 20060412201402 11.35
    zpowv    53 20060403091259 20060413010213 231.82
    

    Using the Perl module Convert::Binary::C (which I discussed in a previous article) and the knowledge about which wtmpx entries are Sun Ray entries and which can be skipped the following code creates a valid data file. After declaring the Convert::Binary::C settings (as in my previous wtmpx blog) the while loop which reads the entries follows this logic:
    it stores login entries in some data structures.
    When a corresponding logout entry is found a line of data is printed.
    When a system reboot entry is found all currently stored login entries will transformed into data using the reboot time as the time of logout for all entries.
    After having read the complete wtmpx file the remaining login entries correspond to sessions which are still active. They will be transformed into data lines too.

    use strict;
    use Convert::Binary::C;
    
    my $utmpxh = "utmpx.h";               # include file
    
    # two OS specific settings
    my $struct = "futmpx";
    my $wtmpx  = "/var/adm/wtmpx";        # on Solaris
    
    my $c = Convert::Binary::C->new(
               Include => ['/usr/include', '/usr/include/i386-linux-gnu'],
               Define => [qw(__sparc)]
            );
    $c->parse_file( $utmpxh );
    
    # Choose native alignment
    $c->configure( Alignment => 0 );      # the same on both OSs
    my $sizeof = $c->sizeof( $struct );   # on Solaris (=372)
    
    $c->tag( $struct.'.ut_user', Format => "String" );
    $c->tag( $struct.'.ut_line', Format => "String" );
    $c->tag( $struct.'.ut_host', Format => "String" );
    
    my %start;                       # hash to store session start times per display
    my %user;                        # hash to store session user names per display
    
    open(WTMPX, $wtmpx) || die("Cannot open $wtmpx\n");
    my $buffer;
    
    # Read wtmpx line by line
    while( read(WTMPX, $buffer, $sizeof) == $sizeof ) {
      my $unpacked = $c->unpack( $struct, $buffer);   # Solaris
    
      # We need these 5 members of 'struct futmpx'
      my $ut_user = $unpacked->{ut_user} ;          # user name
      my $ut_line = $unpacked->{ut_line} ;          # looking for 'dtlocal'
      my $display = $unpacked->{ut_host} ;          # display name like ':52'
      my $ut_type = $unpacked->{ut_type} ;          # type of entry 7=login, 8=logout
      my $epoch   = $unpacked->{ut_tv}->{tv_sec};   # the timestamp of the entry in UNIX time
    
      # If a system restart happens then all previous sessions should be
      # printed and variables re-initialized
      if($ut_line eq "system boot" || $ut_line eq "system down" ) {
        foreach my $disp (keys %start) {
          print_row($disp,$epoch);
          delete $start{$disp};
        }
      }
    
      # Skip any entry which is not 'dtlocal'
      next  if( $ut_line ne "dtlocal" );
    
      # Login entries
      if($ut_type==7) {
        # Set the start time and user for 'display' session
        $start{$display}    = $epoch;
        $user{$display}     = $ut_user;
      }
    
      # Logout entries
      if($ut_type==8) {
        # Check if a corresponding and valid 'login' entry exists
        if($start{$display} && $user{$display} eq $ut_user && $start{$display}<=$epoch) {
          print_row($display,$epoch);
          # After printing a line the 'display' can be reused
          delete $start{$display};
          delete $user($display};
        }
      }
    
    }
    close(WTMPX);
    
    # What is left now are all sessions which are still running
    my $epoch       = time();
    foreach my $disp (keys %start) {
      print_row($disp,$epoch,"ongoing");
      delete $start{$disp};
    }
    
    exit 0;
    
    # Convert epoch time to string
    sub epoch_to_date {
      my ($epoch)   = @_;
      my ($seconds, $minutes, $hours, $day_of_month, $month, $year, $wday, $yday, $isdst) = localtime($epoch);
      # return something like  20060428091531 = April 28 2006 09:15:31
      return sprintf("%04d%02d%02d%02d%02d%02d",
        $year+1900, $month+1,$day_of_month,
        $hours, $minutes, $seconds,
      );
    }
    
    # Print a data entry
    sub print_row {
      my ($disp,$epoch,$ongoing)    = @_;
    
      # Session duration in seconds
      my $duration = $epoch - $start{$disp};
    
      # Remove the colon in 'display'
      (my $d = $disp) =~ s/://;
    
      # Set 'end' to either the real end time of the session or to 'ongoing'
      my $end = epoch_to_date($epoch);
      $end = $ongoing    if($ongoing);
    
      # Now print the line
      printf "%-10s %4s %6s %s %s %.2f\n",
            $user{$disp}, $d, epoch_to_date($start{$disp}), $end, $duration/3600;
            ;
    }
    

    With data files now in place it's time to create some graphs in the next blog.

    Saturday, March 9, 2013

    How LC_NUMERIC affects various awk versions

    Many people write scripts to be executed just in one place by one person: on the script writer's machine. Writing portable scripts (to be executed by many people on their machines) poses bigger challenges since one has to think about probable dependencies which could influence the proper execution of scripts.

    In this article I will discuss the relationship between the environment variable LC_NUMERIC and various versions of awk.

    My example is a very simple one: calculate the integer part and the square root of a given value

    echo 5.29 | awk '{ print int($1), sqrt($1) }'
    
    The expected output:
    5 2.3
    

    Now imagine a user in Germany who is using a 'german' environment and part of it is the following setting:

    LC_NUMERIC=de_DE.UTF-8
    
    The output of the code above will be this:
    0 0
    
    What a surprise one might think. Whatever your real code would have been doing with these values it would have ended up somewhere else where you wanted it to be.
    (The examples above have been run in Solaris (10 or 11) with the old original awk.)

    Let's see what we get with Solaris' nawk, MacOS' awk (on 10.5.8) and Linux' (Ubuntu 11) mawk:

    LC_NUMERIC
    not set or e.g. en_USde_DE.UTF-8
    (de_DE.utf8 on Ubuntu)
    Solaris awk5 2.30 0
    Solaris nawk or MacOS awk5 2.35 2,23607
    Ubuntu mawk5 2.35 2.3

    So what you should note:

  • with LC_NUMERIC not set or being set to english everything is ok
  • with LC_NUMERIC being set to german the results do not just differ from the english setting but also differ extremely depending which version of awk one is using
  • the output of square root for Solaris nawk and MacOS awk contains a comma
  • Ubuntu's mawk is the only one to create the same result in both cases

    How can these seemingly strange results be explained?

    Well, languages and cultures differ in many things and one of them is the mark which is used as decimal separator. In the english speaking world the '.' (dot) is being used whereas others like e.g. germans are using ',' (comma). That has important consequences when working in a german environment: a string like '5.29' is not viewed as a real number but as a string. nawk and MacOS awk try to get as much 'number' out of this string as possible which is simply the '5' before the dot and the integer and square root functions are applied to '5' (not to '5.29') and of these two functions the square root is off track.
    The original awk for some reasons treats the whole '5.29' as a string and numeric functions applied to a string lead to zero for both functions.
    Why Ubuntu's mawk does not honor locale settings at all seems strange to me after having seen how the other awks behave. I have not researched though whether this is known, a bug or considered to be a feature and I haven't tested on newer Ubuntu or other Linux versions.

    Conclusion: even when writing seemingly simple scripts the environment (or more precise: locale setting) plays an important role in the well-functioning and portability.
    If one is working for some time on a particular system one tends to take for granted some of the functionality which is not present or different on other systems.

    An input value of '5,29' would conversely work well with the german setting but not the english one.

    Note that also the LANG parameter has the same influence on the proper working of real numbers.

    Food for thought:
    the decimal separator issue (and others as well e.g. date format) can create bigger problems. Think about the simple case of a text file which was created by users entering decimal numbers for this and that. You have one text file from the US and one from Germany and you want to use its contents with one script. But one file delivers dot separated and the other comma separated input fields! A new hurdle.
    Also a CSV file in its original sense 'comma separated' does not work well in a german environment where the field separator would be equal to the decimal separator and thus another non-textual field separator should be chosen (TAB is preferrable anyway).
    Therefore more complex business applications with attached databases or even the Office programs like LibreOffice (calc) store numbers in just one format and they use the presentation layer to show numbers in whichever way the actual user wants to see them. This guarantees that the raw data stored in the databases can be mixed and used together regardless of the language setting of the users.

    And of course there are a couple of more languages out there with other features being represented in the locale settings and thus creating possible pitfalls to the unaware coder.


    Update April 2017

    Having familiarized myself a bit more with gawk the issue is how awk versions interpret input and output (which can be different, at least in gawk). gawk supports a particular option --use-lc-numeric to interpret the output. Below I am comparing inputs in C numeric with dot as a decimal separator and de_DE.utf8 numeric with comma as a decimal separator. For all commands I had set LC_NUMERIC="de_DE.utf8".
    InputCommandOutput
    C5.29echo 5.29 |
    awk '{ print int($1), sqrt($1) }'
    C5 2.3
    C5.29echo 5.29 |
    awk --use-lc-numeric '{ print int($1), sqrt($1) }'
    de_DE.utf85 2,23607
    de_DE.utf85,29echo 5.29 |
    awk '{ print int($1), sqrt($1) }'
    C5 2.23607
    de_DE.utf85,29echo 5.29 |
    awk --use-lc-numeric '{ print int($1), sqrt($1) }'
    de_DE.utf85 2,3

  • Per default gawk assumes that input is in C numeric and it will output in C (row 1).
  • If your input is not in C numeric you need to tell gawk explicitly to consider the LC_NUMERIC setting (row 4). gawk does not honour LC_NUMERIC automatically.
  • All other combinations lead to wrong results (rows 2 and 3)