General UNIX Troubleshooting Information

Last update: 7/18/2006
Very out of date!

Source: http://www.glue.umd.edu/~davida/training/unix-troubleshooting.html

Preface

This page contains general information on troubleshooting UNIX (TerpConnect/Glue infrastructure) information which is very useful to Help Desk (or other) folks trying to trouble-shoot system/user problems, or just to gather certain information.

As a preface to this material, you should familiarize yourself with the concepts in this document:

  UNIX Training

This document is a product of my old UNIX Short Course with some updates for the TerpConnect/Glue systems.

Basic introduction to programs specific to TerpConnect/Glue
How AFS-hosted accounts are laid out
Determining a user's quota and whether their home space is in AFS or is NFS-mounted
How to resolve at/over quota issues
Dealing with user mail issues
How to restore an account to the default login scripts or What to do when no commands work
Determining the host which serves an NFS-mounted home directory

Getting information on a Glue userid
Getting information on a Glue host computer
Departmental contact information

Basic Introduction

The TerpConnect/Glue system uses certain software packages/systems which are an integral part of the system, including:

AFS - Used to stand for the Andrew File System, but now simply stands for "the AFS file system" which controls the majority of the user disk space on space on TerpConnect/Glue. The exception are accounts hosted on departmental servers, usually for the ENG/IPST departments. AFS is a distributed file system, allowing you to have multiple hosts access the same directories/files. It also allows you various levels of permissions; all permissions are done at the directory level (instead of the file level, like standard UNIX file systems), and you can allow or restrict access by individuals or by groups. It also allows for "snapshot" backups, which are done on TerpConnect/Glue at 04:00 each morning. It also uses caching, so network traffic is reduced due to copies being locally cached (stored) and accessed, instead of accessing everything across the network each time it's referenced. Tied in with AFS is the Kerberos authentication system.
Kerberos - A user authentication system. Users are granted Kerberos tickets when they login (or use the renew or kinit commands); these tickets have a specific lifetime (defaults to 25 hours on TerpConnect/Glue), after which they expire and the user must re-authenticate to keep their access rights.
Hesiod - An information database on users and hosts. Lets you determine specifics about users, like to what groups they belong, which determines to which hosts they can login.
Zephyr - A messaging system. Used to locate and communicate between people logged into the system.

(Back to the contents)

How AFS-hosted accounts are laid out

Accounts in the UNIX AFS systems are laid out in a specific manner. The top-level directory for the user resides in:

  /users/USERID

The directories you'll see in this directory are:

home - The user's home directory, where the user is placed when they login. Can also be referred to via the shortcut "/homes/USERID".

mail - The user's mail directory, where incoming mail is placed. Also where ".forward" files are located, vacation scripts, etc. Can also be referred to via the shortcut "/mail/USERID".

pub - The user's pub (public access) directory; usually used for Web pages, but can be used for any file(s) the user wants to make public. Can also be referred to via the shortcut "/pub/USERID".

backup - The user's backup directory; refreshed via the AFS snapshot at 04:00 each morning and backed up to tape during the day. There is no shortcut to this directory, as it's not normally referenced unless something needs to be recovered.

.lli - The directory containing last login information, used by the system.

The /users directory is actually just a link to someplace in the AFS filesystem, depending upon whether they're a "standard" (OSL) user or a departmental user. For example:

  /users/davida

is a reference to the directory:

  /afs/glue.umd.edu/home/oit/d/davida

(Back to the contents)

User quotas and determining whether a users home space is in AFS or NFS

To determine how much disk space a TerpConnect/Glue user in AFS is using, type:

  fs  lq  ~USERID

where USERID is the login ID of the user. If the user is in AFS, you'll get the volume the directory lives in, the quota, amount & percent used, and how full that partition is:

  fs  lq  ~djcarter
  Volume Name                   Quota      Used %Used   Partition
  h.oit.djcarter              1000000      3353    0%         12%

The quota is measured in kilobytes, so 1000000KB is 1GB. If you get the response (on Glue) of:

  fs  lq  ~USERID
  Volume Name                   Quota      Used %Used   Partition
  fs: Invalid argument; it is possible that /homes/USERID is not in AFS.

it means the user's home directory is not in AFS space, and is probably in an NFS-mounted filesystem. (More on this later.)

This can also be used to determine the quota of the current directory. For example, on the Glue system it's possible for someone to have their home directory in departmental NFS space, while their mail directory (/mail/USERID) resides in AFS space. You can check this with:

  fs  lq  /mail/USERID
  Volume Name                   Quota      Used %Used   Partition
  h.glue.USERID                 25000     25020  100%         67%

In this case, while USERID may not have their home directory in AFS space, they do have their mail spool directory in AFS space, and they're over quota.

It can also be used to determine the volume name in which the directory resides:

  fs  lq  .
  Volume Name                   Quota      Used %Used   Partition
  d.oit.us.web.docs.N          150000     16019   11%         46%

for the purposes of releasing volumes when editing web pages.

(Back to the contents)

Determining where a users space is being used, and how to resolve it

If you have determined that a user is near, at or over their quota, the next step is to determine where the space is being used, and how to resolve the situation.

The first step is to go to the top of the user's filespace; this is the directory above their home directory. To get to any user's home directory, simply type:

cd

You'll then want to go to the next directory level up, which you can do via:

  cd  ..

Once you're at the top level of the directory structure the user can write into, the next step is to determine where the space is going. The easiest way to do this is to use the du (disk usage) command, along with sorting and restricting the output. The best set of commands you can use to determine where space is being used is:

  du  -sk  *  .??*  |  sort  -n  |  tail

This will give you disk usage for the userid, "-s" means "give me a summary if it's a directory (rather than giving you individual file information for each file within that directory), "-k" means to present the output values in kilobytes rather than blocks (blocksizes can vary from system to system and can be confusing; our systems tend to use 512 byte blocks, or about .5 KB blocksizes). The "sort -n" part means to sort the output in numeric sort order (default is character sort order); since the output will have numbers in column 1, you need to sort numerically. Lastly, the "tail" command will give you only the last, and therefore largest, ten files/directories in the list. These are usually the ones you're most interested in for reasons of space reduction.

Here's an example of the du command being used in a top-level TerpConnect/Glue directory:

  du  -sk  *  .??*  |  sort  -n  |  tail
  4       .lli
  4       .ver
  138     mail
  3943    pub
  8489    home
  12584   backup

For the cases of determining quota usage, ignore the "backup" directory, that does not impact user quotas. From this list, the place where most of the quota is being used (8.4MB) is the user's home directory, so the next step would be to:

  cd  home

and then re-run the du again there. If you see large amounts of space being taken up in and of the:

  .ntprofile
  .netscape
  .ntnetscape
  .microsoft

directories, one culprit might be left-over files in the browser cache. The simple way to check & take care of this is to run the:

  clearcache

command, which will remove any browser cache files from these directories. If this doesn't help, you'll have to repeat the process by directory until you've resolved the issue (or at least identified the files taking up space).

This is also discussed in the Help Desk web page titled Information On How To Reclaim Disk Space on UNIX Systems.

(Back to the contents)

Dealing with user mail issues

On the TerpConnect/Glue systems, the mail files resided in the directory:

  /mail/USERID

The mail spool file is stored as part of the user's file space, so mail that has not been seen yet still counts against a user's quota.

There should be two basic files in the directory:

USERID - This is the file into which new mail is placed, until the user checks their mail. This is also called the mail spool file. When the user runs a mail program/client, the contents of this file are appended to the "mbox" file, which is then opened (usually) as INBOX.
mbox - This file contains mail that has previously been accessed by a mail client, regardless if it's been seen or not. Messages which have not been read yet are usually marked as "unread"; different mail clients may show this different ways.

Here is an explanation of how files are handled when a user reads their mail. If the user does not have enough free disk space available to perform the copy procdeure, they will probably get a read-only INBOX. One solution for this is to run the:

  catmail

program (which lives in the "/usr/local/scripts" directory), which will perform the copy/append procedure without using the users own disk space.

There may be other files there, like ".forward" mail lock files (either current or old), etc.

When you read your mail with any client, a lock file is created. This is to prevent corruption by two (or more) mail clients writing to the mail folder at the same time. If a mail client detects a lock file when it is launched, it will open the mail file in read-only mode, not allowing the second mail process to write to the file. Occasionally, these files can be left behind if a mail client terminates abnormally. These lock files can take many forms:

  .21460052

  .__afs06F4

  mbox.lock

  mbox.lock.NNNNNNNNNN.PID.imap_server

  USERID.lock

  USERID.lock.NNNNNNNNNN.PID.imap_server

where USERID is the users login ID, PID is the process id of the mail process on the mail server, NNNNNNNNNN is a 10-digit identifier, and imap_server is the name of the IMAP server via which the connection was made.

If you see any of these files with a date earlier than the current date, they are most likely left over, and can be removed. Make sure you check with the user to ensure they're not running a mail client before you remove any lock files which are close to the current date/time. If they have a mail client running and you remove the lock file(s), results will be unpredictable.

In very few cases, the mail spool file is hosted on a departmental non-AFS server, and resides in:

  /usr/spool/mail/USERID

This filesystem is not part of the user's file space, so mail that has not been accessed by a mail client does not count against a user's quota.

(Back to the contents)

How to restore an account to the default login scripts, or

What to do when no commands work

If you have an account where the user has modified or removed some/all of their login scripts (like ".cshrc", ".login", etc.), or if the login does not look right and most (if not all) commands are not found, you can restore the account to it's default login actions.

In some (rare) cases, the problem is that the user has very old login files (pre-1995). It was about this time the system was updated and the login scripts replaced, and while we made a great effort to inform all users to update their files, not all did. You can tell if a user has these very old login scripts, for you get the message:

  machine: command not found

when you login to their account. This is a leftover from the previous script, which needs to be replaced.

The solution to this is to simply run the newdefaults command. For them, you'll probably have to specify the whole pathname:

  /usr/local/scripts/newdefaults

After that, have them log off and back on, and voila, things should be as expected. You should also warn the user against removing any files of which they're not sure of the function.

Note: The one thing you need to remember is to check their quota to make sure the user is not at/over 100%, else the command will fail.

(Back to the contents)

Determining where an NFS-mounted (home) directory resides

To determine where an NFS-mounted home directory is being served from (what host has the directory), you need to use the hesinfo command twice; once to get the group, then again to get the server. First, to get the group, type:

  hesinfo  USERID  homes.amd

This will return a string containing the user's host system:

  hesinfo  bob  homes.amd
  fs:=/home/deans2/bob

Then take the group name (in this case "deans2") and run a slightly different hesinfo to give you the host which serves that group:

  hesinfo  group  home.amd

For example:

  hesinfo  deans2  home.amd
  hostd!=topaz.deans.umd.edu;rhost:=topaz.deans.umd.edu || 
         hostd==topaz.deans.umd.edu;type:=link;fs:=/export/home/deans2

The serving host will be the "hostd" value, in this example "topaz.deans.umd.edu". Once you've determine the host which serves a user's home directory, you can investigate the problem further. For example, the directory above resides under the "/export/home/deans2" filesystem on "topaz.deans.umd.edu".

(Back to the contents)

Information on Glue userids

To find out to which Hesiod groups a user belongs, use the hesinfo command:

  hesinfo  USERID  ngbyuser

For example:

  hesinfo  bob  ngbyuser
  OSL

shows us that user "bob" is a member of the OSL (Open System Labs) group, which means they can login to any publicly-available workstation or telnet/slogin to glue.umd.edu. If you see "Restricted" as one of the values returned, it means their account has been disabled:

  hesinfo  bob  ngbyuser
  OSL,Restricted

This may be due to the person not showing up in campus records, or in a few cases, for disciplinary reasons. If you run accadmin and look at the "Status" and "Notes" fields it should explain why the user is disabled.

To find all members of a particular group, use:

  ngquery group

For example, to find all the members of the group "Elves" you would type:

  ngquery  Elves
  Members of "Elves":
  arensb davida gollum jay kevin mpilar pkd reuss sfuentes sneeri srs sturdiva
  erics jwchurch rmaxwell

To find all hosts members of a particular hesiod group can login to, use:

  grep   group   /:/system/config/hesiod/auto/restrict.db

For example, using "USS" as the groupname:

  grep USS /:/system/config/hesiod/auto/restrict.db
  grace   HS TXT  "allow +@USS,+@USS-students,+@GRACEusers,+@GRACE-fa06-cmsc411-0101,
  +@GRACE-s206-cmsc411-0201,+@GRACE-s106-cmsc420-0101,+@GRACE-s106-cmsc330-0101,-"
  uss     HS TXT  "allow +@EIS-admin,+@USS,-"
  altair.umd.edu  HS TXT  "allow +@USS-admin,+@ATC,+@STAT_lab,+@SLIC,+@ShadyGrove,
  +@RHSmith,+@USS,+@USS-students"

We see that the people in the USS group can login to hosts in the "grace" and "uss"clusters and also on the host altair.umd.edu.

Other ways to collect information on a userid:

To find out where a users mailbox resides:
```
  hesinfo userid mailhost
```

To find out user mail-related information (including mailhost), type:
```
  hesinfo userid pobox
```
(possibly in /var/spool/mail/userid/userid instead of /mail/userid) If the system comes up as "eng.umd.edu" or "glue.umd.edu" (instead of an actual hostname like "altair.umd.edu"), you then need to find the specific system. To do this, type:
```
  dig  hostname  mx  any
```
and look through the resulting information for the systems noted with an "MX", like:
```
  dig  glue.umd.edu  mx  any
     [ ... ]
  glue.umd.edu.   7609    MX      10 distortion.ENG.UMD.EDU.
     [ ... ]
```
There may be more than one MX record for the system; you can use any which are noted with "MX".

To find out which server a user's home directory is on (if in AFS):

  fs whereis ~userid

For example:

  fs whereis ~davida
  File /homes/davida is on host pride.umd.edu

To get the Hesiod password entry for a user (w/out actual pwd), type:
```
  hesinfo userid passwd
```

(Back to the contents)

Information on Glue system host computers

To determine if a UNIX host is part of the Glue system, use the hosti command. The general syntax is:

  hosti  hostname  Hesiod_tag

For example, you can determine which department a computer is controlled by with:

  hosti  syrinx.umd.edu  department
  Default department

  Department "oit" department

  Cluster "uss" department

  Machine "syrinx.umd.edu" department

This shows that the host "syrinx.umd.edu" is owned by the Divison of IT (formerly OIT). Another example of the hosti command is to determine which logical cluster a hostname is part of:

  hosti  syrinx.umd.edu  clustername
  Default clustername

  Department "oit" cluster

  Cluster "uss" cluster

  Machine "syrinx.umd.edu" cluster
  uss

This shows that "syrinx.umd.edu" is part of the "us-consult" cluster. Yet another example of the hosti command lets you determine who is allowed to login to a particular host:

  hosti  syrinx.umd.edu  restrict
  Default restrict

  Department "oit" restrict

  Cluster "us-consult" restrict
  allow +@EIS-admin,+@USS,-

  Machine "syrinx.umd.edu" restrict
  allow +@CLAB,+register,+eileena
  allow.ftp +

This shows that members of the groups "USS", and "CLAB" are allowed to login, as well as the userid "register". Additionally, this host allows ftp connections.

A good reference on what you can use the hesinfo and hosti commands to do is at the URL http://www.glue.umd.edu/admin/hesiod_query.html.

(Back to the contents)

Departmental contact information

The Glue system has was is known as Glue Lab Managers. These are people with a range of expertise, from basic contact people to departmental system administrators. For some issues the users (or their professor or advisor in the case of students) will need to go through the Lab Manager(s). For example, if a user wants their home space moved to departmental space, they'll need to work that out with their Glue Lab Manager. Here is a list of the Glue Lab Managers.

(Back to the contents)