Fifteen days was the estimate I gave for completing an analysis on roughly 450,000 files we were holding at Archives New Zealand. Approximately three seconds per file for each round of analysis:
3 x 450,000 = 1,350,000 seconds 1,350,000 seconds = 15.625 days
My bash script included calls to three Java applications, Apache Tika, 1.3 at the time, twice, running the -m and -d flags:
-m or --metadata Output only metadata -d or --detect Detect document type
It also made a call to Jhove 1.11 in standard mode. The script also calculates SHA1 for de-duplication purposes, and to match Archives New Zealand's chosen fixity standard; computes a V4 UUID per file, and outputs the result of the Linux File command, in two separate modes, standard, and with the -i flag to attempt to identify mime-type.
Each application receives a path to a single file as an argument from a directory manifest. The script outputs five CSV files that can be further analysed.
The main function used in the script is as follows:
dp_analysis () { FUID=$(uuidgen) DIRN=$(dirname "$file") BASN=$(basename "$file") echo -e ${FUID} '\t' "$file" '\t' ${DIRN} '\t' ${BASN} '\t' "file-5.11" '\t' $(file -b -n -p "$file") '\t' $(file -b -i -n -p "$file") >> ${LOGNAME}file-analysis.log echo -e ${FUID} '\t' "$file" '\t' ${DIRN} '\t' ${BASN} '\t' "tika-1.5-md" '\t' $(java -jar ${TIKA_HOME}/tika-app-1.5.jar -m "$file") >> ${LOGNAME}tika-md-analysis.log echo -e ${FUID} '\t' "$file" '\t' ${DIRN} '\t' ${BASN} '\t' "tika-1.5-type" '\t' $(java -jar ${TIKA_HOME}/tika-app-1.5.jar -d "$file") >> ${LOGNAME}tika-type-analysis.log echo -e ${FUID} '\t' "$file" '\t' ${DIRN} '\t' ${BASN} '\t' "jhove-1_11" '\t' $(java -jar ${JHOVE_HOME}/bin/JhoveApp.jar "$file") >> ${LOGNAME}jhove-analysis.log echo -e ${FUID} '\t' "$file" '\t' ${DIRN} '\t' ${BASN} '\t' "sha-1-8.20" '\t' $(sha1sum -b "$file") >> ${LOGNAME}sha-one-analysis.log }
What I hadn't anticipated was the expense of starting the Java Virtual Machine (JVM) three times each loop, 450,000 times. The performance is prohibitive and so I immediately set out to find a solution. Either cut down the number of tools I was using, or figure out how to avoid starting the JVM each time. Fortunately a Google search led me to a solution, and a phrase, that I had heard before – Nailgun.
It has been mentioned on various forums, including comments on various OPF blogs, and it is even found in the Fits release notes. The phrase resonated and it turned out that it provided a single and accessible approach to do what we need.
One of the things that we haven't seen yet is a guide on using it within the digital preservation workflow. I'll describe how to make best use of this tool, and try and demonstrate its benefits during the remainder of this blog.
For testing purposes we will be generating statistics on a laptop that has the following specification:
Product: Acer Aspire V5-571PG (Aspire V5-571PG_072D_2.15) CPU: Intel(R) Core(TM) i5-3337U CPU @ 1.80GHz Width: 64 bits Memory: 8GiB OS: Ubuntu 13.10 Release: 13.10 Codename: saucy Java version "1.7.0_21" Java(TM) SE Runtime Environment (build 1.7.0_21-b11) Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode)
JVM Startup Time
First, let's demonstrate the startup cost of the JVM. If we take two functionally equivalent programs, the first in Java and the second in C++, we can look at the time taken to run them 1000 times consecutively.
The purpose of each application is to run and then exit with a return code of zero.
Java: SysExitApp.java:
public class SysExitApp { public static void main(String[] args) { System.exit(0); } }
C++: SysExitApp.cpp:
int main() { return(0); }
The Script to run both, and output the timing for each cycle, is as follows:
#!/bin/bash time (for i in {1..1000} do java -jar SysExitApp,jar done) time (for i in {1..1000} do ./SysExitApp.bin done)
The source code can be downloaded from GitHub. Further information about how to build the C++ and Java applications is available in the README file. The output of the script is as follows:
real 1m26.898s user 1m14.302s sys 0m13.297s real 0m0.915s user 0m0.093s sys 0m0.854s
With the C++ binary, the average time taken per execution is 0.915ms. The execution time of the Java application rises from this to 86.898ms on average. One can reasonably put this down to the cost of the JVM startup.
Both C++ and Java are compiled languages. C++ compiles down to machine code; instructions that can be executed directly by the CPU (Central Processing Unit). Java compiles down to bytecode. Bytecode lends itself to portability across many devices where the JVM provides an abstraction layer handling differences in hardware configuration before interpreting it down to machine code. .
A good proportion of the tools in the digital preservation toolkit are implemented in Java, e.g. DROID, Jhove, Tika, Fits. As such, we currently have to take this performance hit, and optimizations must focus on handling that effectively.
Enter Nailgun
Nailgun is a client/server application that removes the overhead of starting the JVM by running it once within the server and enabling all command-line based Java applications to run within that single instance. The Nailgun client then handles those applications' calls to the server; that might be the command line (stdin) one normally associates with a particular application, e.g. running Tika with the -m flag, and passing it a reference to a file. The application runs and Nailgun directs its stdout, and stderr back to the client which is then output to the console.
With the exception of the command line being executed within a call to the Nailgun client, behaviour remains consistent with that of the standalone Java application. The Nailgun background information page provides a more detailed description of the process.
How to build Nailgun
Before running Nailgun it needs to be downloaded from GitHub and built using Apache Maven to build the server, and the GNU Make utility to build the client. The instructions in the Nailgun README describe how this is done.
How to start Nailgun
Once compiled the server needs to be started. The command line to do this looks like this:
java -cp /home/digital/dp-toolkit/nailgun/nailgun-server/target/nailgun-server-0.9.2-SNAPSHOT.jar -server com.martiansoftware.nailgun.NGServer
The classpath needs to include the path to the Nailgun server Jar file. The command to start the server can be expanded to include any further application classes you want to run. There are other ways it can be modified as well. For further information please refer to the Nailgun Quick Start Guide. For simplicity we start the server using the basic startup command.
Loading the tools (Nails) into Nailgun
As mentioned above, the tools you want to run can be loaded into Nailgun at startup. For my purposes, and to provide a useful and simple overview for all, I found it easiest to load them via the client application.
Applications loaded into Nailgun need to have a main class. It is possible to find if the application has a main class by opening the Jar in an archive manager capable of opening Jars such as 7-Zip. Locate the MET-INF folder, and within that the MANIFEST.MF file. This will contain a line similar to this example from the Tika Jar’s MANIFEST.MF in tika-app-1.5.jar.
Main-Class: org.apache.tika.cli.TikaCLI
Confirmation of a main class means that we can load Tika into Nailgun with the command:
ng ng-cp /home/digital/dp-toolkit/tika-1.5/tika-app-1.5.jar
Before working with our digital preservation tools. we can try running the Java application created to baseline the JVM startup time alongside the functionally comparable C++ application.
MANIFEST.MF within the SysExitApp.jar file reads as follows:
Manifest-Version: 1.0 Created-By: 1.7.0_21 (Oracle Corporation) Main-Class: SysExitApp
As it has a main class we can load it into the Nailgun server with the following command:
ng ng-cp /home/digital/Desktop/dp-testing/nailgun-timing/exit-apps/SysExitApp.jar
The command ng-cp tells Nailgun to add it to its classpath. We provide an absolute path to the Jar we want to execute. We can then call its main class from the Nailgun client.
Calling a Nail from the Command Line
Following that, we want to call our application from within the terminal. Previously we have used the command:
java -jar SysExitApp.jar
This calls Java directly and thus the JVM. We can replace this with a call to the Nailgun client and our application's main class:
ng SysExitApp
We don't expect to see any output at this point, that is, provided no error occurs, it will simply return a new input line on the terminal. On the server, however, we will see the following:
NGSession 1: 127.0.0.1: SysExitApp exited with status 0
And that's it. Nailgun is up and running with our application!
We can begin to see the performance improvement gained by removing the expense of the JVM startup when we execute this command using our 1000 loop script. We simply add the following lines:
time (for i in {1..1000} do ng SysExitApp done)
This generates the output:
real 0m2.457s user 0m0.157s sys 0m1.312s
Compare that to running the Jar, and compiled binary files before:
real 1m26.898s user 1m14.302s sys 0m13.297s real 0m0.915s user 0m0.093s sys 0m0.854s
It is not as fast as the compiled C++ code but it represents an improvement of well over a minute compared to calling the JVM each loop.
The Digital Preservation Toolkit Comparison Script
Up and running we can now baseline Nailgun with the script used to run our digital preservation analysis tools.
We define two functions: one that calls the Jars we want to run without Nailgun, and the other to call the same classes, with Nailgun:
dp_analysis_no_ng () { FUID=$(uuidgen) DIRN=$(dirname "$file") BASN=$(basename "$file") echo -e ${FUID} '\t' "$file" '\t' ${DIRN} '\t' ${BASN} '\t' "file-5.11" '\t' $(file -b -n -p "$file") '\t' $(file -b -i -n -p "$file") >> ${LOGNAME}${FILELOG} echo -e ${FUID} '\t' "$file" '\t' ${DIRN} '\t' ${BASN} '\t' "tika-1.5-md" '\t' $(java -jar ${TIKA_HOME}/tika-app-1.5.jar -m "$file") >> ${LOGNAME}${TIKAMDLOG} echo -e ${FUID} '\t' "$file" '\t' ${DIRN} '\t' ${BASN} '\t' "tika-1.5-type" '\t' $(java -jar ${TIKA_HOME}/tika-app-1.5.jar -d "$file") >> ${LOGNAME}${TIKATYPELOG} echo -e ${FUID} '\t' "$file" '\t' ${DIRN} '\t' ${BASN} '\t' "jhove-1_11" '\t' $(java -jar ${JHOVE_HOME}/bin/JhoveApp.jar "$file") >> ${LOGNAME}${JHOVELOG} echo -e ${FUID} '\t' "$file" '\t' ${DIRN} '\t' ${BASN} '\t' "sha-1-8.20" '\t' $(sha1sum -b "$file") >> ${LOGNAME}${SHAONELOG} } dp_analysis_ng () { FUID=$(uuidgen) DIRN=$(dirname "$file") BASN=$(basename "$file") echo -e ${FUID} '\t' "$file" '\t' ${DIRN} '\t' ${BASN} '\t' "file-5.11" '\t' $(file -b -n -p "$file") '\t' $(file -b -i -n -p "$file") >> ${LOGNAME}${FILELOG} echo -e ${FUID} '\t' "$file" '\t' ${DIRN} '\t' ${BASN} '\t' "tika-1.5-md" '\t' $(ng org.apache.tika.cli.TikaCLI -m "$file") >> ${LOGNAME}${TIKAMDLOG} echo -e ${FUID} '\t' "$file" '\t' ${DIRN} '\t' ${BASN} '\t' "tika-1.5-type" '\t' $(ng org.apache.tika.cli.TikaCLI -d "$file") >> ${LOGNAME}${TIKATYPELOG} echo -e ${FUID} '\t' "$file" '\t' ${DIRN} '\t' ${BASN} '\t' "jhove-1_11" '\t' $(ng Jhove "$file") >> ${LOGNAME}${JHOVELOG} echo -e ${FUID} '\t' "$file" '\t' ${DIRN} '\t' ${BASN} '\t' "sha-1-8.20" '\t' $(sha1sum -b "$file") >> ${LOGNAME}${SHAONELOG} }
Before we define the functions, we load the applications into Nailgun with the following commands:
#Load JHOVE and TIKA into Nailgun CLASSPATH $(ng ng-cp ${JHOVE_HOME}/bin/JhoveApp.jar) $(ng ng-cp ${TIKA_HOME}/tika-app-1.5.jar)
The complete script can be found on GitHub with more information in the README file.
Results
For the purpose of this blog I reworked the Open Planets Foundation Test Corpus and have used a branch of that to run the script across. There are 324 files in the corpus with numerous different formats. The script produces the following results:
real 13m48.227s user 26m10.540s sys 0m59.861s real 1m32.801s user 0m4.548s sys 0m16.847s
Stderr is piped to a file called errorlog.txt using ‘2>’ syntax to enable me to capture the output of all the tools and to avoid any expense of the tool printing to the screen. Errors shown in the log relate to the tools ability to parse certain files in the corpus rather than to do with Nailgun. The errors should be reproducible with the same format corpus and tool set.
There is a marked difference in performance when running the script with Nailgun and without. Running the tools as-is we find that each pass takes approximately 2.52 seconds per file on average.
Using Nailgun this is reduced to approximately 0.28 seconds per file on average.
Conclusion
The timing results collected here will vary quite widely on different systems, even on the same system. The disparity between running applications executing the JVM each time and then running applications using Nailgun should show up with fairly even contrast.
While I hope this blog provides a useful Nailgun tutorial, the concern I have after being able to work in anger with the tools we talk about in the digital preservation community on a daily basis, is in understanding what smaller institutions with smaller IT departments, and potentially fewer IT capabilities, are doing. And whether they are even able to make use of the tools out there given the overheads described.
It is possible to throw more technology and more resources at this issue but it can't be expected that this will always be possible. The reason I sought this workaround is that I can't see that capability being developed at Archives New Zealand without significant time and investment, and that capability can't always be delivered in short-order within the constraints of working within government. My analysis, on a single collection of files, needs to be complete within the next few weeks. I need tools that are easily accessible and far more efficient to be able to do this.
It is something I'll have to think about some more.
Nailgun gives me a good shot-term solution, and hopefully this blog opens it up as a solution that will prove useful to others too.
It will be interesting to learn, following this work, how others have conquered similar problems, or equally interesting, if they are yet to do so.
—
Notes:
Loops: I experimented with various loops for recursing the directories in the opf-format-corpus expecting to find differences in performance within each. Using the Linux time command I was unable to find any material difference in either loop. The script used for testing is available on GitHub. The loop executes a function that calls two Linux commands, ‘sha1sum’ and ‘file’. A larger test corpus may help to reveal differences in either approach. I opted to stick with iterating over a manifest as this is more likely to mirror processes within our organization.
Optimization: I recognize a naivety in my script. Produced to collect quick and dirty results from a test set that I only have available for a short period of time. The first surprise running the script was the expense of the JVM startup. After finding a workaround for that I now need to look at other optimizations to continue to approach the analysis this way. Failing that, I need to understand from others why this approach might not be appropriate, and/or sustainable. Comments and suggestions along those lines as part of this blog are very much appreciated.
And Finally…
All that glisters is not gold: Nailgun comes with its own overhead. Running the tool on a server at work with the following specification:
Product: HP ProLiant ML310 G3 CPU: Intel(R) Pentium(R) 4 CPU 3.20GHz Width: 64 bits Memory: 5GiB OS: Ubuntu 10.04.4 LTS Release: 10.04 Codename: lucid Java version "1.7.0_51" Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) Client VM (build 24.51-b03, mixed mode)
We find it running out of heap space around the 420,000th call to the server, with the following message:
java.lang.OutOfMemoryError: Java heap space
If we look at the system monitor we can see that the server has maxed out the amount of RAM it can address. The amount of memory it is using grows with each call to the server. I haven't a mechanism to avoid this at present, other than chunking the file set and restarting the server periodically. Users adopting Nailgun might want to take note of this issue up-front. Throwing memory at the problem will help to some extent but a more sustainable solution is needed, and indeed welcomed. This might require optimizing Nailgun, or instead, further optimization of the digital preservation tools that we are using.
johan
March 12, 2014 @ 2:59 PM Europe/Berlin
On a related note: the other week I wanted to get a more detailed look at Tika's mimetype detection mechanisms. Not being a Java developer myself, I was curious if there was any way to access the API through Python. Then I stumbled upon this, which uses the PyJnious library:
http://www.hackzine.org/using-apache-tika-from-python-with-jnius.html
Using that code as a starting point, I came up with this simple demo script that does mimetype detection on all files in a directory tree:
https://github.com/bitsgalore/tikadetect
Of course this doesn't apply directly to your use case, but for Python users this looks like an interesting and fairly easy way to use Java libraries directly, without running into the multiple JVM problem. Must add I'm not sure how stable this is (I've only done a few tests so far), but this might be worth further investigation.
andy jackson
February 25, 2014 @ 12:14 PM Europe/Berlin
It's worth noting that 'Java first' is also the approach FITS takes, and in general FITS seems to be very close to what Ross is trying to do.
An alternative option would be to go for 'Java first' but use a JVM scripting language as the glue, e.g. Jython or JRuby. This would allow the aggregator code to be easily modifiable but would mean only a single JVM start-up per invocation.
Recursing the file-system from one JVM process, (or perhaps allowing a stream of filenames on STDIN if you don't trust the recursion), will give you a massive speed-up and allow the JIT optimiser to get to work.
willp-bl
February 25, 2014 @ 10:55 AM Europe/Berlin
Hi Ross,
Very interesting post.
An alternative method for speeding up execution would be to write a simple Java program that traversed a directory tree and just called the Java libraries directly.
* The magic used by `file` can be used in this Java library: https://github.com/openplanets/libmagic-jna-wrapper
* Tika & Jhove can be used as Java libraries (Jhove with some work, an example is here: https://github.com/bl-dpt/drmlint/blob/master/src/main/java/uk/bl/dpt/qa/drmlint/wrappers/Jhove1Wrapper.java)
* Checksums can be calculated with Apache Commons-Codec
This will use just one JVM and will be much faster than the many-JVM approach. It might approach the speeds you are getting with Nailgun.
We are using the Java libraries libmagic/Tika & Droid (via Nanite-core) in a tool for characterising the content of web archives, code here: https://github.com/openplanets/nanite/blob/master/nanite-hadoop/src/main/java/uk/bl/wap/hadoop/profiler/FormatProfilerMapper.java On our cluster this code runs libmagic/Tika/Nanite-core in 1.5ms/file on average (however, there are 28 nodes/mappers). This code currently just gets the mimetypes, but I am adding code to get characterisation information from the Tika parsers (and might add Jhove).
Will
ross-spencer
February 25, 2014 @ 3:10 AM Europe/Berlin
Hi Carl,
Thanks for the comment and the compliment on the post. Asking from a rather naive point of view, if my network is secured, what security concerns should I be most worried about when running Nailgun? What scenarios should I be looking out for?
The performance is hard to ignore, but significant security impact would easily outweight our desire to reap the benefits of that.
Many thanks,
Ross
carl
February 24, 2014 @ 10:47 PM Europe/Berlin
So some interesting stuff here, I've looked at Nailgun a few times. Was unaware of the memory problem but would also mention that it's not particularly secure, as all calls share the priveliges of the Nailgun server process. Since it's written in Java the memory issue might be addressable without too much effort, creating a usable security model may be a bigger challenge. Nice post in all.