atom feed16 messages in org.apache.jackrabbit.dev[jr3] Sequential Node Ids
FromSent OnAttachments
Thomas MuellerDec 22, 2010 12:51 am.png
Michael DürigDec 23, 2010 6:00 am 
Thomas MuellerDec 23, 2010 6:56 am.png
Marcel ReuteggerJan 11, 2011 12:13 am 
Thomas MuellerJan 11, 2011 12:50 am 
Michael DürigJan 11, 2011 2:08 am 
Thomas MuellerJan 11, 2011 2:34 am 
Michael DürigJan 11, 2011 2:51 am 
Thomas MuellerJan 11, 2011 3:13 am 
Jukka ZittingJan 11, 2011 8:51 am 
Thomas MuellerJan 12, 2011 12:29 am 
Thomas MuellerJan 12, 2011 12:58 am 
Jukka ZittingJan 12, 2011 2:03 am 
Thomas MuellerJan 12, 2011 2:40 am 
Jukka ZittingJan 12, 2011 2:59 am 
Thomas MuellerJan 12, 2011 3:36 am 
Subject:[jr3] Sequential Node Ids
From:Thomas Mueller (muel@adobe.com)
Date:Dec 22, 2010 12:51:25 am
List:org.apache.jackrabbit.dev
Attachments:

Currently, Jackrabbit uses random node ids (randomly generated UUIDs). For
Jackrabbit 3, and possibly Jackrabbit 2.x, I propose to use sequential node ids.

The main advantages for random node ids (the default) are: it's easy to develop,
and merging data from multiple repositories is simple.

The disadvantages for random node ids are performance and space. With
performance, I don't mean that randomly generated UUIDs are slower to generate
(they are). I mean performance to read and write data to disk. The main
performance problem is reading and writing (indexing) random keys, which is slow
because adjacent (nearby) nodes data aren't adjacent in the index. This slows
down index access (read and write) a lot, specially if the data doesn't fit in
the buffer cache (file system cache).

I wrote a simple test case for Jackrabbit trunk. I also patched Jackrabbit to
allow sequential UUIDs. Creating 1 million nodes is about 80% faster when using
sequential UUIDs instead of random UUIDs. I don't have the data yet, but I'm
sure the difference is much, much bigger when reading from larger repositories
once the data doesn't fit in RAM (10 million nodes or so). I will expand the
test case to support reading.

The following diagram is the operations per second (bigger is better) after
inserting x nodes (0 to 1 million). With sequential UUIDs, performance gets
better and then stays the same (which is expected). For random UUIDs which are
now the default, performance first goes up and then down. The resulting database
(I used H2 for performance) was about 145 MB for sequential UUIDs and 169 MB for
random UUIDs (the reason for this difference is that the b-tree fill factor is
better for sequential UUIDs). All the data still fits in the buffer cache of the
file system without problem.

[cid:82D27269-8424-4062-B68F-568E482436D3]

Something else I noticed when running the test case was that the Jackrabbit
internals are relatively slow. When profiling the test case, I saw that
database access is not an important factor. Creating the nodes with the JCR API
took 285 / 526 seconds, and re-creating the database from a SQL script took 55
seconds (this includes reading the 200 MB SQL script file and parsing the SQL
statements, one insert statement per row; this is for the random UUIDs). I would
expect Jackrabbit is faster than 55 seconds, because it doesn't have to read the
SQL script from disk and parse statements.

The test case creates 1 million nodes, saves after each 10'000 nodes, and uses a
fan-out of 20 child nodes per node. I disabled the search index in the
repository configuration. Source code:

import java.io.File;

import javax.jcr.Node;

import javax.jcr.Session;

import javax.jcr.SimpleCredentials;

import org.apache.commons.io.FileUtils;

import org.apache.jackrabbit.core.TransientRepository;

import org.apache.jackrabbit.core.id.NodeId;

public class TestLoop {

long start;

Profiler prof;

public static void main(String... args) throws Exception {

new TestLoop().run();

}

void startProf() {

// if(prof != null) {

// prof.stopCollecting();

// System.out.println(prof.getTop(3));

// }

// prof = new Profiler();

// prof.interval = 1;

// prof.depth = 32;

// prof.startCollecting();

}

void run() throws Exception {

System.setProperty("jackrabbit.sequentialUUIDs", "false");

startProf();

FileUtils.deleteDirectory(new File("repository"));

FileUtils.copyFile(new File("repository copy.xml"),

new File("repository.xml"));

TransientRepository rep = new TransientRepository();

Session s = rep.login(

new SimpleCredentials("admin", "admin".toCharArray()));

int nodes = 1000000;

int nodesPerFolder = 20;

int saveEach = 10000;

System.out.println(new NodeId());

int levels = (int) Math.ceil(Math.log(nodes)

/ Math.log(nodesPerFolder));

start = System.currentTimeMillis();

startProf();

createNodes(s.getRootNode(), 0, nodes,

saveEach, levels, nodesPerFolder);

System.out.println("took: " + (System.currentTimeMillis() - start));

s.logout();

}

private int createNodes(Node parent, int x, int nodes,

int saveEach, int depth, int nodesPerFolder)

throws Exception {

for (int i = 0; x < nodes && i < nodesPerFolder; i++) {

if (depth > 1) {

Node n = parent.addNode("f" + x, "nt:unstructured");

x = createNodes(n, x, nodes,

saveEach, depth - 1, nodesPerFolder);

} else {

parent.addNode("n" + i, "nt:unstructured");

x++;

if ((x % saveEach) == 0) {

parent.getSession().save();

long t = System.currentTimeMillis() - start;

startProf();

System.out.println((x * 1000L / t) + " op/s at " + x);

return x;