CUCC Expedition Handbook

Logbooks Import

Importing the logbook into troggle

This is usually done after expo but it is in excellent idea to have a nerd do this a couple of times during expo to discover problems while the people are still around to ask.

The nerd needs to login to the expo server using their own userid, not the 'expo' userid. The nerd also needs to be in the group that is allowed to do 'sudo'.

The 'parser'

This is rather a grand word for the hacked about spaghetti of regexes in troggle/parsers/logbooks.py . It is not a proper parser, just a phrase recognizer, and is horribly, horribly fragile. On the brightside, we now only have one of these instead of 5.

Ideal situation

Ideally this would all be done on a stand-alone laptop to get the bugs in the logbook parsing sorted out before we upload the corrected file to the server. Unfortunately this requires a full troggle software development laptop as the parser is built into troggle. The expo laptop in the potato hut is set up to do this (2023) but requires more nouse than is convenient to describe here.

However, the expo laptop (or any 'bulk update' laptop) is configured to allow an authorized user to log in to the server itself and to run the import process directly on the server. DON'T DO THIS. The slightest mistake in formatting will killl logbook functionality on the server for everyone.

Importing the Blog

During expo lots of people post text and photos to the UK Caving (rope competition) website. During the winter after expo, an extra nerd task is to fold in all those entries into the main logbook so that the trips are indexed and we can see who was doing what where.

This is sufficiently complicated that it is documented in another page. But read this page first.

Current situation

With the new data entry form we should have far fewer problems with inventive hacks trying to do clever thngs with HTML, but it is entirely possible that the form can be used to input text which will then break the parser, most obviously by putting in a <hr /> which is the separator between entries. This is not clever.

The nerd needs to do this:

Look at the list of pre-existing old import errors at Data Issues
You need to get the list of people on expo sorted out first.
This is documented in the Folk Update process.
Log in to the expo server and run the update script (see below for details)
Watch the error messages scroll by, they are more detailed than the messages archived in the old import errors list
Edit the logbook.html file to fix the errors. These are usually typos, too-clever HTML or unrecognised people. Some unrecognised people will mean that you have to fix them using the Folk Update process first.
Re-run the import script until you have got rid of all the import errors.
Pat self on back. Future data managers and people trying to find missing surveys will worship you.

The procedure is like this. It will be familiar to you because you will have already done most of this for the Folk Update process.

ssh  expo@expo.survex.com
cd troggle
python databaseReset.py logbooks

It will produce a list of errors like these below, starting with the most recent logbook which will be the one for the expo you are working on. You can abort the script (Ctrl-C) when you have got the errors for the current expo that you are going to fix

Loading Logbook for: 2017
 - Parsing logbook: 2017/logbook.html
 - Using parser: Parseloghtmltxt
Calculating GetPersonExpeditionNameLookup for 2017
   - No name match for: 'Phil'
   - No name match for: 'everyone'
   - No name match for: 'et al.'
("can't parse: ", u'\n\n<img src="logbkimg5.jpg" alt="New Topo" />\n\n')
   - No name match for: 'Goulash Regurgitation'
   - Skipping logentry: Via Ferata: Intersport - Klettersteig - no author for entry
   - No name match for: 'mike'
   - No name match for: 'Mike'

Errors are usually misplaced or duplicated <hr /> tags, names which are not specific enough to be recognised by the parser (though it tries hard) such as "everyone" or "et al." or are simply missing, or a bit of description which has been put into the names section such as "Goulash Regurgitation".

When you have sorted out the logbooks formatting and it is no longer complaining, you will need to do a full database reset as this will have trashed the online database and none of the troggle webpages will be working:

ssh  expo@expo.survex.com
cd troggle
python databaseReset.py reset

which takes between 300s and 15 minutes on the server.

Back to Logbooks for Cavers documentation.
Forward to Logbook internal format documentation.
Forward to Importing the UK Caving Blog.