[darcs-users] performance; clone?

David Roundy droundy at abridgegame.org
Tue Apr 8 21:52:14 UTC 2003


On Tue, Apr 08, 2003 at 02:41:54PM -0400, Dan Egnor wrote:
> Darcs is very interesting.

Thank you!  :)

> We noticed that if you fetch the darcs repository, undo a few dozen
> patches, and then pulled the repository again, it takes a very long 
> time (killed it after running for 45 minutes) -- much longer than it 
> took to get the repository in the first place.  Is this known to be slow?

No, that is not known to be slow... well it would depend on which patches
you undid.  If you undid the most recent patches, it should be
fast... Well, actually that would partially depend on how you performed the
undo.  Ideally you should do a series of unrecords followed by a revert.
If you leave out the revert, it will have to try to resolve a large number
of conflicts in your working directory, which could take some time.
Conflict resolution can take truly nasty amounts of time.  I'd love to make
the conflict resolution faster, but the most important thing is for it to
be correct, which is hard just by itself.  If you did do the revert, darcs
whatsnew should show no changes.

If you undid some of the early patches (assuming here that you did do a
revert), when you pull, it'll have to reorder the patches, which means
downloading pretty much all the patches again, so this should be as slow as
the get was in the first place, and may be slower, since it's not something
I've tried.  Actually, the more I think about it, I think you can get O(mn)
time this way, where n is the number of patches in the repo, and m is the
number of patches you need to pull....  But I think that the prefactor on
the O(mn) term should be small.  At least you should only have to O(n)
downloading.

> Given the storage format, it actually surprises me that anything is
> fast; I can edit a file and run "whatsnew" and it doesn't take a long
> time.  I wonder how that works.

Ah, there is redundant information stored.  Actually, the repo is almost
precisely doubly redundant on the most recent information.  So what you've
overlooked is the contents of the '_darcs/current' directory, which
contains a copy of the current (recorded) version of all the files that are
in the repo.  (btw, this redundant info is what 'darcs check' checks, and
means that in many cases if your repo gets corrupted somehow, it should be
possible to repair it.)

There are a couple of other tricks used to make whatsnew (and of course,
record) fast.  One is that I set the modification times of the files in
'current' equal to those in the working directory if the two are identical,
so I don't have to check again later unless one has changed.

The other trick (well, not really a trick), which turned out to crucial is
that I only traverse the 'current' directory tree, not the working
directory, when looking for which files to compare.  This makes a huge
difference if you have thousands of other files and subdirectories in your
working directory, as a couple of friends of mine had.  It took me a while
to figure out that darcs was taking two minutes just to read all the file
modification times!

> Do you believe that all operations can be made relatively fast with
> the addition of elbow grease (perhaps by maintaining a cached index
> file, updated or reconstructed as needed)?

I think so, although I'm a bit skeptical on the merge conflict issue.  The
problem is that dealing with merge conflicts is a really nasty problem
(took me about three months working probably three to six hours every
morning before work), which is annoyingly complex.  Which basically means
that I don't want to touch a working solution... and unfortunately I can't
even prove that my current solution is correct.  I just can't find any
counterexamples.

But fortunately, the cases in which the merge conflict is truly horrendous
should be rare (I sincerely hope) in real projects.  The problem is when
there are massive numbers of patches to be merged that all modify the same
file, and modify overlapping lines.  I hope that this can be done in a
reasonable time, but haven't done any real timing, largely because I almost
burnt myself out on the merge issue.

For the rest, though, things like pulls, etc. I'm confident that darcs can
get reasonable timings.  If you want optimal speeds in all cases (e.g. when
you're missing the very first patch, which no other patches depend upon,
and perform a pull), it will require a smart server, but I think acceptable
speeds can be achieved with a dumb server.  The worst case pull should be
only as slow as a get (at least as far as download time goes... there'll be
some additional commutation time).

> I was also wondering how difficult it would be to support a "clone"
> mode of operation, where you could "get" or "pull" from a local
> repository (rather than a remote URL) and instead of copying the
> patch files it would create a (symbolic? hard?) link.  That way you
> could create local branches relatively quickly on a shared system,
> and the filesystem buffer cache would be shared between the original
> and the new copy.  (Patch files are never modified after being recorded,
> right?)

That's something I've thought about quite a bit, and am torn on.  If you
guarantee that patch files don't get modified (which currently is the
normal case, and quite achievable), then a clone operation would be great.
The only catch that comes to mind is that the file modification time trick
for speeding up the whatsnew operation wouldn't work, since the two working
directory versions of the same file would have different modification
times.

The problem is that keeping the patch files fixed means keeping the patch
order fixed, and that prevents you from performing some optimizations that
can really make a big difference in the speed of pulls.  For performing a
pull, ideally you'd like to have the set of patches your local repo has in
common with the remote repo reside at the beginning of your patch list.  If
this is the case, the pull simply consists of downloading the patches
you're missing, merging them with the patches remote is missing, and saving
the resulting files.

If you have patches that remote is missing, but these are located towards
the beginning of your repo, you have to commute these patches by all the
patches after them, which could mean reading in your entire repo.  Even
worse, if remote has patches that you are missing, which are located at the
beginning of its repo, as you'll have to download all subsequent patches in
order to commute the missing patches by them.  So the point is, that at
least in some cases, you'd like to be able to reorder your repo so that it
matches a given remote repo.

I guess if hard links are used for the clone mode, and you wanted to
reorder, all you'd have to do is rename the patch files rather than
overwriting them (a good idea anyways), and you wouldn't mess up the other
repo...  Hmmmm.  I think this will work (with hard links)!  :)  And in
fact, I think it may even be safe enough that I could make 'get' itself
create a clone if it notices that it is getting from a repository that is
on the same filesystem.

Soft links would require a promise on the part of the repo linked to that
it wouldn't reorder (or delete patches that might be unrecorded).  "Remote
soft links" (i.e. links to an URL) would be particularly nifty, as they
could be used with a not-yet-implemented feature allowing downloading of a
tarball of a tagged version to allow you to do a 'get' without downloading
and storing the entire history of the repository.  So I may want to add a
'static' flag, or something, which would indicate that the repo promises
not to delete any patches (currently it doesn't delete unrecorded patches
anyways) or reorder patches.

I guess that's about all.  Thanks for the suggestions, and if you can
reproduce (and give me enough information for me to be able to reproduce)
anything that's weirdly slow, I'd love to hear about it.
-- 
David Roundy
http://civet.berkeley.edu/droundy/




More information about the darcs-users mailing list