Last update: 7/18/2006
Very out of date!
Source: http://www.glue.umd.edu/~davida/training/unix-troubleshooting.html
This page contains general information on troubleshooting UNIX (TerpConnect/Glue infrastructure) information which is very useful to Help Desk (or other) folks trying to trouble-shoot system/user problems, or just to gather certain information.
As a preface to this material, you should familiarize yourself with the concepts in this document:
UNIX Training
This document is a product of my old UNIX Short Course with some updates for the TerpConnect/Glue systems.
The TerpConnect/Glue system uses certain software packages/systems which are
an integral part of the system, including:
Basic Introduction
(Back to the contents)
renew
or kinit
commands); these
tickets have a specific lifetime (defaults to 25 hours on TerpConnect/Glue),
after which they expire and the user must re-authenticate to keep their
access rights.
Accounts in the UNIX AFS systems are laid out in a specific manner. The
top-level directory for the user resides in:
The directories you'll see in this directory are:
The
is a reference to the directory:
To determine how much disk space a TerpConnect/Glue user in AFS is using,
type:
where USERID is the login ID of the user. If the user is in AFS, you'll
get the volume the directory lives in, the quota, amount & percent used,
and how full that partition is:
The quota is measured in kilobytes, so 1000000KB is 1GB. If you get the
response (on Glue) of:
it means the user's home directory is not in AFS space, and is probably in an
NFS-mounted filesystem. (More on this later.)
This can also be used to determine the quota of the current directory.
For example, on the Glue system it's possible for someone to have their home
directory in departmental NFS space, while their mail directory
(/mail/USERID) resides in AFS space. You can check this with:
In this case, while USERID may not have their home directory in AFS
space, they do have their mail spool directory in AFS space, and they're
over quota.
It can also be used to determine the volume name in which the directory
resides:
for the purposes of releasing volumes when editing web pages.
If you have determined that a user is near, at or over their quota, the next
step is to determine where the space is being used, and how to resolve the
situation.
The first step is to go to the top of the user's filespace; this is the
directory above their home directory. To get to any user's home directory,
simply type:
You'll then want to go to the next directory level up, which you can do
via:
Once you're at the top level of the directory structure the user can write
into, the next step is to determine where the space is going. The easiest
way to do this is to use the
This will give you disk usage for the userid, "-s" means "give
me a summary if it's a directory (rather than giving you individual file
information for each file within that directory), "-k" means to
present the output values in kilobytes rather than blocks (blocksizes can
vary from system to system and can be confusing; our systems tend to use
512 byte blocks, or about .5 KB blocksizes). The "sort -n" part
means to sort the output in numeric sort order (default is character sort
order); since the output will have numbers in column 1, you need to sort
numerically. Lastly, the "tail" command will give you only the
last, and therefore largest, ten files/directories in the list. These are
usually the ones you're most interested in for reasons of space reduction.
Here's an example of the
For the cases of determining quota usage, ignore the "backup"
directory, that does not impact user quotas. From this list, the place
where most of the quota is being used (8.4MB) is the user's home directory,
so the next step would be to:
and then re-run the
directories, one culprit might be left-over files in the browser cache. The
simple way to check & take care of this is to run the:
command, which will remove any browser cache files from these directories. If
this doesn't help, you'll have to repeat the process by directory until you've
resolved the issue (or at least identified the files taking up space).
This is also discussed in the Help Desk web page titled
Information On How To
Reclaim Disk Space on UNIX Systems.
On the TerpConnect/Glue systems, the mail files resided in the directory:
The mail spool file is stored as part of the user's file space, so mail
that has not been seen yet still counts against a user's quota.
There should be two basic files in the directory:
Here is an explanation of
how files are handled when a user reads their mail. If the user does
not have enough free disk space available to perform the copy procdeure,
they will probably get a read-only INBOX. One solution for this is to run
the:
program (which lives in the "/usr/local/scripts" directory), which
will perform the copy/append procedure without using the users own disk
space.
There may be other files there, like ".forward" mail lock files
(either current or old), etc.
When you read your mail with any client, a lock file is created. This
is to prevent corruption by two (or more) mail clients writing to the mail
folder at the same time. If a mail client detects a lock file when it is
launched, it will open the mail file in read-only mode, not allowing
the second mail process to write to the file. Occasionally, these files can
be left behind if a mail client terminates abnormally. These lock files can
take many forms:
where USERID is the users login ID, PID is the process id of
the mail process on the mail server, NNNNNNNNNN is a 10-digit
identifier, and imap_server is the name of the IMAP server via which
the connection was made.
If you see any of these files with a date earlier than the current date, they
are most likely left over, and can be removed. Make sure you check
with the user to ensure they're not running a mail client before you remove
any lock files which are close to the current date/time. If they have a mail
client running and you remove the lock file(s), results will be
unpredictable.
In very few cases, the mail spool file is hosted on a departmental non-AFS
server, and resides in:
This filesystem is not part of the user's file space, so mail that has
not been accessed by a mail client does not count against a user's
quota.
If you have an account where the user has modified or removed some/all of
their login scripts (like ".cshrc", ".login", etc.), or
if the login does not look right and most (if not all) commands are not found,
you can restore the account to it's default login actions.
In some (rare) cases, the problem is that the user has very old login
files (pre-1995). It was about this time the system was updated and the
login scripts replaced, and while we made a great effort to inform all users
to update their files, not all did. You can tell if a user has these very
old login scripts, for you get the message:
when you login to their account. This is a leftover from the previous script,
which needs to be replaced.
The solution to this is to simply run the
After that, have them log off and back on, and voila, things should be
as expected. You should also warn the user against removing any files
of which they're not sure of the function.
Note: The one thing you need to remember is to check their quota to make sure
the user is not at/over 100%, else the command will fail.
To determine where an NFS-mounted home directory is being served from (what
host has the directory), you need to use the
This will return a string containing the user's host system:
Then take the group name (in this case "deans2") and run a slightly different
For example:
The serving host will be the "hostd" value, in this example
"topaz.deans.umd.edu". Once you've determine the host which
serves a user's home directory, you can investigate the problem further.
For example, the directory above resides under the
"/export/home/deans2" filesystem on
"topaz.deans.umd.edu".
To find out to which Hesiod groups a user belongs, use the
For example:
shows us that user "bob" is a member of the OSL (Open System Labs)
group, which means they can login to any publicly-available workstation or
telnet/slogin to glue.umd.edu. If you see "Restricted" as one of
the values returned, it means their account has been disabled:
This may be due to the person not showing up in campus records, or in a few
cases, for disciplinary reasons. If you run
To find all members of a particular group, use:
For example, to find all the members of the group "Elves" you
would type:
To find all hosts members of a particular hesiod group can login to, use: For example, using "USS" as the groupname:
Other ways to collect information on a userid:
For example:
For example, you can determine which department a computer is controlled
by with:
This shows that the host "syrinx.umd.edu" is owned by the Divison of
IT (formerly OIT). Another example of the
This shows that "syrinx.umd.edu" is part of the
"us-consult" cluster. Yet another example of the
This shows that members of the groups "USS", and "CLAB"
are allowed to login, as well as the userid "register".
Additionally, this host allows ftp connections.
A good reference on what you can use the
How AFS-hosted accounts are laid out
/users/USERID
/users
directory is actually just a link to
someplace in the AFS filesystem, depending upon whether they're a
"standard" (OSL) user or a departmental user. For example:
/users/davida
/afs/glue.umd.edu/home/oit/d/davida
User quotas and determining whether a users home space is in AFS or
NFS
fs lq ~USERID
fs lq ~djcarter
Volume Name Quota Used %Used Partition
h.oit.djcarter 1000000 3353 0% 12%
fs lq ~USERID
Volume Name Quota Used %Used Partition
fs: Invalid argument; it is possible that /homes/USERID is not in AFS.
fs lq /mail/USERID
Volume Name Quota Used %Used Partition
h.glue.USERID 25000 25020 100% 67%
fs lq .
Volume Name Quota Used %Used Partition
d.oit.us.web.docs.N 150000 16019 11% 46%
Determining where a users space is being used, and how to resolve it
cd
cd ..
du
(disk usage) command,
along with sorting and restricting the output. The best set of commands you
can use to determine where space is being used is:
du -sk * .??* | sort -n | tail
du
command being used in
a top-level TerpConnect/Glue directory:
du -sk * .??* | sort -n | tail
4 .lli
4 .ver
138 mail
3943 pub
8489 home
12584 backup
cd home
du
again there. If you see
large amounts of space being taken up in and of the:
.ntprofile
.netscape
.ntnetscape
.microsoft
clearcache
Dealing with user mail issues
/mail/USERID
catmail
.21460052
.__afs06F4
mbox.lock
mbox.lock.NNNNNNNNNN.PID.imap_server
USERID.lock
USERID.lock.NNNNNNNNNN.PID.imap_server
/usr/spool/mail/USERID
How to restore an account to the default login scripts, or
What to do when no commands work
machine: command not found
newdefaults
command. For them, you'll probably have to specify the whole pathname:
/usr/local/scripts/newdefaults
Determining where an NFS-mounted (home) directory resides
hesinfo
command twice; once to get the group, then again to get the server. First,
to get the group, type:
hesinfo USERID homes.amd
hesinfo bob homes.amd
fs:=/home/deans2/bob
hesinfo
to give you the host which serves that
group:
hesinfo group home.amd
hesinfo deans2 home.amd
hostd!=topaz.deans.umd.edu;rhost:=topaz.deans.umd.edu ||
hostd==topaz.deans.umd.edu;type:=link;fs:=/export/home/deans2
Information on Glue userids
hesinfo
command:
hesinfo USERID ngbyuser
hesinfo bob ngbyuser
OSL
hesinfo bob ngbyuser
OSL,Restricted
accadmin
and look at the "Status" and "Notes" fields it should
explain why the user is disabled.
ngquery group
ngquery Elves
Members of "Elves":
arensb davida gollum jay kevin mpilar pkd reuss sfuentes sneeri srs sturdiva
erics jwchurch rmaxwell
grep group /:/system/config/hesiod/auto/restrict.db
grep USS /:/system/config/hesiod/auto/restrict.db
grace HS TXT "allow +@USS,+@USS-students,+@GRACEusers,+@GRACE-fa06-cmsc411-0101,
+@GRACE-s206-cmsc411-0201,+@GRACE-s106-cmsc420-0101,+@GRACE-s106-cmsc330-0101,-"
uss HS TXT "allow +@EIS-admin,+@USS,-"
altair.umd.edu HS TXT "allow +@USS-admin,+@ATC,+@STAT_lab,+@SLIC,+@ShadyGrove,
+@RHSmith,+@USS,+@USS-students"
We see that the people in the USS group can login to hosts in the
"grace" and "uss"clusters and also on the host
altair.umd.edu.
(Back to the contents)
hesinfo userid mailhost
hesinfo userid pobox
(possibly in /var/spool/mail/userid/userid instead of
/mail/userid)
If the system comes up as "eng.umd.edu" or "glue.umd.edu"
(instead of an actual hostname like "altair.umd.edu"), you then
need to find the specific system. To do this, type:
dig hostname mx any
and look through the resulting information for the systems noted with
an "MX", like:
dig glue.umd.edu mx any
[ ... ]
glue.umd.edu. 7609 MX 10 distortion.ENG.UMD.EDU.
[ ... ]
There may be more than one MX record for the system; you can use any
which are noted with "MX".
fs whereis ~userid
fs whereis ~davida
File /homes/davida is on host pride.umd.edu
hesinfo userid passwd
Information on Glue system host computers
To determine if a UNIX host is part of the Glue system, use the
hosti
command. The general syntax is:
hosti hostname Hesiod_tag
hosti syrinx.umd.edu department
Default department
Department "oit" department
Cluster "uss" department
Machine "syrinx.umd.edu" department
hosti
command is to determine which logical cluster a hostname is part of:
hosti syrinx.umd.edu clustername
Default clustername
Department "oit" cluster
Cluster "uss" cluster
Machine "syrinx.umd.edu" cluster
uss
hosti
command lets you determine who is allowed
to login to a particular host:
hosti syrinx.umd.edu restrict
Default restrict
Department "oit" restrict
Cluster "us-consult" restrict
allow +@EIS-admin,+@USS,-
Machine "syrinx.umd.edu" restrict
allow +@CLAB,+register,+eileena
allow.ftp +
hesinfo
and
hosti
commands to do is at the URL
http://www.glue.umd.edu/admin/hesiod_query.html.
Departmental contact information
The Glue system has was is known as Glue Lab Managers. These are
people with a range of expertise, from basic contact people to departmental
system administrators. For some issues the users (or their professor or
advisor in the case of students) will need to go through the Lab Manager(s).
For example, if a user wants their home space moved to departmental space,
they'll need to work that out with their Glue Lab Manager. Here is a list
of the
Glue Lab Managers.