Pre-Jane-athon testing
I wanted to see how things might go on The Day after the inputting, etc., was over, especially re: the use of Windows explorer to join the files from each user/table, so I constructed the following test scenario.
First, I searched these MARC sources to gather some sample data
- LC MARC English (7M records)
- LC MARC Foreign (5M records)
The search was for any occurrence in any field of 'Jane Austen' or 'Austen, Jane'.
I took the matching records, concatenated them together, and deduped them on LCCN.
In the end I assembled a file of 1188 MARC records in a file called 'jane-all-deduped.mrc'
Next I created 11 folders in my RIMMF data area named: jata … jatk
Then I copied Deborah's 75 janebase records into each folder.
Next, I repeated the following steps, once for each folder (11 times):
- opened the rimmf .ini file
- changed the record prefix: (1. jaia … 11. jaik)
- changed the data folder: (1. jata … 11. jatk)
- saved the .ini
- started RIMMF, went to F3
- batchloaded* the next 100 records from the jane.mrc file
- closed the program
Notes:
For the last folder, I batchloaded1) 188 records instead of the usual 100.
The time to run 100 records through F3 ranged from 15-18 minutes (sat. afternoon)
The results of this processing:
folder prefix jbase marc import folder total jata jaia 75 100 380 455 jatb jaib 75 100 336 411 jatc jaic 75 100 369 444 jatd jaqid 75 100 386 461 jate jaie 75 100 326 401 jatf jaif 75 100 338 413 jatg jaig 75 100 326 401 jath jaih 75 100 350 425 jati jaii 75 100 377 452 jatj jaij 75 100 365 440 jatk jaik 75 188 588 663 tot 825 1188 4141 5329
In addition, since it was the main point of the test, I tracked the links too2):
folder prefix J2J O2J total links jata jaia 148 63 211 jatb jaib 148 69 217 jatc jaic 148 87 235 jatd jaqid 148 77 225 jate jaie 148 65 213 jatf jaif 148 92 240 jatg jaig 148 129 277 jath jaih 148 110 258 jati jaii 148 84 232 jatj jaij 148 112 260 jatk jaik 148 178 326 tot 1628 1066 2694
J2J are the number of 'janebase' to 'janebase' links
O2J are the number of 'jai.' to 'janebase' links, i.e. links added to a folder by the MARC import processing
Now run the test of the janeathon 'merge'.
- create a new folder named 'jane1'
- copied the 75 records from the janebase into it
- for each of the 11 'jat.' folders dragged and dropped their contents into 'jane1'.
When asked how to handle the 75 janebase records (ie. overwrite or skip), elect: 'Don't copy.
Next: load 'jane1' into RIMMF.
The first time I thought it was not going to work. It took a long time–a very long time. Well over an hour, maybe an hour and a half. Not looking too good.
But the next day I started looking for bottlenecks. The main one was the 'feature' whereby RIMMF reconstructs missing links when it loads the EI. Completely rewriting that routine knocked off an hour. Getting better. Which then made it possible to step through the code and find more bottlenecks, and fix them.
The stats for the resulting EI are
folder jbase marc import records J2J O2J total links jane1 75 4216 148 1066 1214
The cumulated 'O2J' count seems to indicate that the merge using explorer will work.
Currently, the RIMMF start up time is 2 mins 30 secs. This includes adding all the missing links, plus making the EI and all of the support tables.
This isn't great, but workable. Knock a minute off that time on a fast computer, but what about a laptop? I will try it on my laptop this week and see how much slower it is.
I have also added support for EI caching to RIMMF3 (which we had in RIMMF2, but which has been turned off for RIMMF3 thus far).
Thus, with caching, after the initial build, the program will be ready in about 2 or 3 seconds the next time it is started. The cache files can be distributed with the results if necessary (they are machine-independent)
In addition to the above, note that running the program with the 4200 records is noticeably slower in some areas (all of which need to be addressed):
- F3/Import (the EI dupe-checking routines are slow)
- Loading an RTree that includes Jane Austen
- Discarding a set of imported records from the RTree
I exported the 'jane1' EI as RDF/ntriples (export to RDF process was extremely slow, and was later optimized down to about a minute and a half; but still needs improvement), and tried to run it through various web validators. None of these tools seem to be able to handle a file of any size, let alone produce a visual graph–which is what I was hoping for.
By way of validation I ran it through raptor conversions on our server, submitted it to an RDF distiller, loaded it into an in-house triplestore (allegrograph), etc. There are about 72K statements excluding the reifications and 'confusing' metadatas, but including RDA classes, rimmfIdentifiers, and labels for them.
One problem found during validation was based on importing tag 670 from the LC authority records (which we map to 'sourceConsulted…'). These MARC fields might contain Urls; but in a few cases, the way they are entered in the MARC fields confused RIMMF, so that the resulting RDF object was output as a uri instead of a string containing a url.
Here's an example from the MARC3):
670 $a http://www.english.ox.ac.uk/about-faculty/faculty-members/research-centre-college-staff/byrne-dr-sandie, May 9, 2014:
and the resulting RDF object in RIMMF:
<http://rimmfdata.com/r/jaie00004048> <http://rimmf.com/vocab/sourceConsultedPerson> <http://www.english.ox.ac.uk/about-faculty/faculty-members/research-centre-college-staff/byrne-dr-sandie, May 9, 2014:>
This was fixed (as the validators trip on it) by inserting a label at the head of the text during mapping, so that it now looks like this:
<http://rimmfdata.com/r/jaie00004048> <http://rimmf.com/vocab/sourceConsultedPerson> "Url: http://www.english.ox.ac.uk/about-faculty/faculty-members/research-centre-college-staff/byrne-dr-sandie, May 9, 2014:"
Data available for download
- http://rimmf.com/data/jane-all-deduped.mrc –1188 MARC records about Jane
- http://rimmf.com/data/jane-4k.nt –the jane1 folder exported as RDF
- http://rimmf.com/data/jane-rimmf-test.zip –the 11 inputs and the 'jane1' result folder; unzip to RIMMF3\data
- http://rimmf.com/data/jane1-ei-cache.zip –a pre-built EI for the jane1 folder; unzip to RIMMF3\tables
- http://www.marcofquality.com/sft/setupRimmf3-150119b.exe –RIMMF update that includes the DB improvements mentioned above