[darcs-users] so long and thanks for all the darcs

Stephen J. Turnbull stephen at xemacs.org
Thu Mar 29 08:08:46 UTC 2018


Ben Franksen writes:

 > > The refs are supposed to all be copied to refs/remotes/origin,

 > Hm, that may clarify a few things for me. So a "ref" is a file which
 > contains a hash that references an object.

That's how it's made persistent.  However, there are older methods
(symlinks, for example) that you might find in very old repos, and
abstractly a ref is any pointer into the DAG.  There is also a ref
"algebra" for computing relative refs (eg, HEAD^ and HEAD~2) which is
very frequently used in commands.

 > The content of a ref is globally valid and thus can be (and is)
 > copied between repos, but the name of that file (including the
 > directory in which it is located) is a purely local property.
 > Correct?

Yes.

 > If yes, then I begin to understand why as a Darcs user I found it so
 > difficult to become familiar with git. Because this concept of a "ref"
 > has no (user visible) counterpart in Darcs. It doesn't exist because it
 > is not needed (for the user). We /could/ add something like it so we can
 > refer to patches symbolically, but AFAIK nobody has ever found it useful
 > enough to request it as a feature.

That sounds almost right to me.  The exception is a tag, which is
present in Darcs and induces a version via its dependencies, whereas
in a DAG-based VCS it is a ref, and points to a version in the history
graph.

 > Whereas in git the concept is essential because many of the high-level
 > features that make git usable as a tool for day-to-day work are built on
 > it.

In fact, any DAG-based VCS requires refs.  Mercurial and Bazaar add a
sequence number ref type (and in Bazaar it's actually a structured
sequence number which identifies the branch and version relative to
the branch point, recursively \o/).

 > The core of git is sound, simple, and elegant; but the high-level
 > features that build on it have been developed in an ad-hoc manner
 > without an over-arching and similarly elegant abstraction to guide the
 > design, so it remains necessary to understand the mechanism behind them
 > in order to use (and appreciate) them appropriately. I think /this/ is
 > the deeper reason behind git's "bad UI" reputation.

>From my point of view, all you've said is "people don't grok DAGs". :-)
For some reason nobody gets this, but git is just a Scheme for
managing version control.  (Mercurial is a Common Lisp, and Bazaar is
a Common Lisp with everything stuffed into tiny packages recursively).

 > > but only the currently checked out one at remote is linked to a local
 > > branch, and checked out locally. Configuration ("core" options in 
 > > .git/config) comes from your local template, I believe.

 > Okay. I would expect that all local branches are initially linked to
 > their remote counterpart.

This would be really ba-a-ad if you were working on the kernel where
everybody is cloning everybody else's repos all the time.  The git
developers all are kernel developers. ;-)

I would think "link 'em all" is a better default for most projects,
except that in git branch refs are really lightweight, so developers
are likely to have a bunch of obsolete or experimental branches lying
around that you don't want.  So except for really small projects with
very few developers, I think on balance I don't want link 'em all.
YMMV, of course, even with that caveat.

 > > "Relative to a repo URL" *is* a namespace.

 > Exactly. No need for any naming convention, since a perfectly natural
 > namespace already exists. Except that git allows to arbitrarily rename
 > remote branches, circumventing "qualification" with the remote URL, so
 > they look like local ones. This should not be allowed (IMHO).

This is how Subversion works (and CVS before it and Bazaar
"lightweight checkouts" after it).  With that restriction, distributed
development is painful.  Avoiding that restriction is why Arch,
BitKeeper, git, Mercurial, Monotone, Bazaar, ... were developed.
Darcs, too. :-)

Most people prefer working within that restriction to dealing with
concurrency, I admit, but highly concurrent development is really
painful with it.  These systems all force you to rebase all your work
on top of the official version before you can commit even once,
because when concurrent development is taking place, there are
multiple branches with the same name in this model: one local, one
official, and possibly others local to other developers (or even
yourself!)  Determining whether you are synched to official requires a
remote query with an irremediable race condition, and it's impossible
to know if you're synched with third party branches with that name.

Darcs avoids all this by modeling a branch as a history-less set of
patches.  Of course the semantics of text require certain implicit
dependencies (you can't delete a line that doesn't exist in the text).
Also of course you want semantic dependencies (don't add a patch
calling foo() in module bar if you don't have the patch that adds
module foo, for example).  History-based VCSes satisfy the text
requirement automatically, and mostly human programs do satisfy the
semantic requirement too, but of course they also drag in a pile of
spurious dependencies.  Darcs avoids the spurious dependencies at the
cost of requiring explicit specification of semantic dependencies --
but again the natural human tendency to do first things first means
that most of the time you don't need specify them: you won't try to
commute the call to foo() backward past the add of module foo.

So git's practice of creating tracking branches with a specific naming
convention is recognition of two facts about practical distributed
development:

(1) branches that are intended to be a single line of development do
    diverge and must be merged

(2) the same branch has multiple URIs (in git there are git:, ssh:,
    http:, and https: URIs at least, and they frequently have
    different paths), which is why URI naming isn't good enough.

You don't have to like it, but there are strong reasons for doing it
this way if you want your development organization to scale to many
developers working independently on anything they want to.

 > >> No, this is not what I find natural. What I find natural is that in
 > >> my clone the beasts have the same name as in the remote repo from
 > >> which I cloned, at least by default.
 > > 
 > > I don't understand.
 >
 > You seem to associate branches with an owner ("it's not my branch"). An
 > interesting aspect I haven't considered yet. So there is a person at the
 > remote repo who controls (or likes to control) the history of this
 > branch.

My point is subtly different.  It's that people think of a branch as a
single entity (what I call "reification"), and if there's only one
name, only one version can be "the" branch.  Somebody does "own" that
(remote) name, and if you think of this as a "remote" branch,
implicitly that somebody isn't you.

 > He or she is upset if I push a commit to this branch?

I wouldn't talk about "upset".  I think about it like this: There are
about 3 billion Internet users.  Should they all be allowed to push?
If not, who grants permission?  In practice, somebody (or some
organization) does.  If there are multiple people with push
permission, your *VCS* will need a conceptual way of referring to
content that is intended to end up in the "official" branch that
diverges from other content also destined for that branch (or already
incorporated in that branch).

 > I should rather have created my own branch and committed there, so
 > the remote owner of the branch can integrate my changes with a
 > merge?

I'm not saying "you should", I'm saying "you do".  In a DVCS, by
committing locally you *do* create your own branch.  Its content is
*not* identical to the remote branch.  This is just as true in Darcs:
your repo contains a patch not in the upstream repo.  You don't know
that your patch is *the* extension of the branch because of the race
condition.  You may need to rebase or merge (in Darcs, amend the
patch) before the push.  In both systems, evidently we intend a merge
and push, but at the moment of the commit, the fact is that the repos
(including those of third parties we may or may not know about) *are*
divergent.

 > > You don't see how anyone would commit to a branch they didn't
 > > intend to, or you don't see how unintentional commits are a
 > > problem?

 > I don't see how a commit can be problematic even if it was made
 > unintentional. You commit explicitly by issuing a command,
 > presumable after making some changes. This creates a new version
 > (commit object).  If this was indeed unintentional, what's the
 > problem?
 >
 > However, I see that if you accidentally push these changes, this can be
 > problematic

This is what I meant, but did not write.  :-(

 > (if you do not "own" the branch).

Ownership is not important.  It's quite possible to push stuff that is
not ready for prime time or "sensitive" to your own public branch.
This is part of your "three repo" scenario below, in fact.

 > Because apparently in git if you push, then the remote branch ref
 > is updated to where it points to locally. Right?

Yes: "push" implies "to a specific branch".  I don't understand the
"apparently"; what else would you expect as a Darcs user?  When you
push, the remote repo gets updated so that someone who clones or pulls
right away gets the same repo you have, no?  Isn't this communication
of new content to a specific line of development in a controlled
fashion the whole point of push?

 > (Sigh. Push and pull in Darcs have so much simpler semantics...)

Only because you don't have multiple branches in one repo, so URL of
repo == name of branch == only ref that ever matters to you, and it's
mostly trivial to keep track of "here vs there".

 > >>> Second, whatever the name, you don't want to commit to those
 > >>> branches,
 > >> 
 > >> Why not?
 > > 
 > > Because that ref is the local copy of the remote branch's state, 
 > > needed for a rollback if there's a problem.
 >
 > You've lost me there, partly. What is a rollback? In Darcs, rollback
 > means "apply selected parts of selected changes to the working tree in
 > reverse" but apparently in git it means something different.

I mean it in the database sense: rollback the pending transaction
("pending" meaning "not yet pushed").  In Darcs, this would be
"obliterate", I believe.

 > Neither makes the rest of the statement any sense to me: what you
 > committed and how to get back to where you started could be
 > calculated by comparing the local with the remote DAG, right? So
 > what's the problem?

You don't know when the ref has moved in the remote DAG (git doesn't
record timestamps for push, and both author and committer commit
timestamps can be forged at commit time, which is different from push
time), so that's not useful.

You do know it has moved in the local DAG, and you know that it was an
ancestor commit, but *git* doesn't know by how many commits.  You can
perform such a rollback by hand, but git cannot implement a rollback
command.  If you cannot commit to the tracking branch (as in current
git) you can implement a rollback command.  Git doesn't call it
"rollback", it's spelled "git reset origin/master".

 > I never claimed you couldn't make things even more complex...

ROTFL!

Concurrency itself is complex.  My claim is that git's basic set of
operations:

- init
- clone        # exceptional: takes URL argument
- fetch
- push
- commit
- branch
- checkout
- reset
- merge

with no options (except for reset --hard) and only ref arguments
(except for clone) is about as simple as you can get for maintaining a
history DAG under the requirements that development is concurrent, it
is not coordinated, and you want full control of your branch refs (in
modern git these are all local).  Of course you also need things like
diff, log, and tag, and pull is convenient and intuitive.  But the 9
operations above are really all you need for most development, same as
Darcs or Mercurial.

The problem is that git, because it's so simple and unopinionated,
allows you to play all kinds of tricks with the DAG that in other
systems would require applying and unapplying patches, and might
impose ugliness on history.  

Folks who favor Bazaar and Mercurial often argue to the contrary that,
as a matter of pragmatic software engineering practice, frequent use
of rebase results in a mainline that is only tested intermittently, or
perhaps not at all, despite 100% testing of pre-rebased commits.  I
understand the point but I think this is mostly an organizational
workflow problem, and to some extent an issue of testing resources.
Note: this issue was first recognized by Linus himself when he flamed
David Miller for excessive rebasing.

 > This really starts to remind me of all those programming language
 > debates.

Sure.  To my mind, as long as we always remember that everybody's
tastes in tools are valid, and seek to adopt good features from
systems that we don't use or even dislike, these discussions are
valuable.  Sometimes we even change our minds about what systems seem
good to us, and that may be most valuable of all!

 > > git's approach provides less automation than Bazaar, but it's almost 
 > > entirely transparent (you don't need to refer to the tracking
 > > branch, except to link it to a local branch, or peripherally when you
 > > need to refer to a remote, for example when you get push rights and
 > > need to add an ssh URL).

 > I dislike this terminology of "tracking branch". It suggests some
 > sort of magical (behind the scenes) coupling of local and remote
 > branches,

In modern git, it's automatic.  All fetches add new commits to the
DAG, then update the tracking branch ref.  This ref is semantically a
tag (checking out and committing to a tracking branch ref creates a
detached HEAD, leaving the tracking ref unchanged).  A successful push
sends commits to the remote, attempts to update the remote branch ref,
and if it succeeds updates the local tracking branch ref.  All of this
is transparent to the git user who normally never needs to know about
the tracking branch (except to configure the tracking-local link.)  As
I understand it, this is what you would like "tracking" to mean.

A pull first executes a fetch, then checks if the linked local branch
head is an ancestor of the tracking branch head.  If so, it does a
fast forward.  If not, it attempts a merge.  If the merge fails, you
can trivially roll back the breakage with

    git checkout <branch>
    git reset --hard origin/<branch>

(I'm pretty sure there's a single command for this but I didn't bother
to find out.  Also, the <branch> names can differ between local and
tracking but that's unusual.)

 > whereas you explained to me that there is nothing like that going
 > on. In my (Darcs influenced) way of thinking there is merely a
 > default target/source for the push, pull, and send commands (which
 > applies if none is given explicitly).

Sure, and once again this is not going to work as written when you
have multiple branches in one repository, which are going to have
different target/sources.

 > Besides, the grammar contradicts the intended meaning (AFAIU): the
 > local branch is tracking the remote one, so would rather be the
 > "tracking branch", whereas the corresponding remote branch should,
 > if anything, be called the "tracked branch".

That's exactly right.  Remember there are *three* branches here: the
remote branch (which is not called "tracked" because the tracking is
automatic and implicit), the tracking branch ref (which gets out of
synch when the remote branch gets updated by another developer, and
so needs to be updated by fetch), and the local "working" branch.

 > I'd rather use a completely abstract term (like "monad") than one
 > that is catchy but has misleading connotations.]

The categorists I know think that Haskell's use of "monad" is an
abomination, because Haskell doesn't enforce the monad laws; the
ensuing bugs when a programmer fails to enforce them force the
programmer to go back and DTRT. ;-)

 > A second requirement for me would be to fully internalize the
 > namespacing so that remote branches can _only_ be referred to as
 > remote-repo<separator>branch. But this is not how things work in
 > git as I understand (now).

You're right, it doesn't work that way, and because of the multiple
URLs referring to a single repo issue, it never can.  You need a
convention so that git can always do the right thing once the
configuration is what you want.

 > I mix them up in my head because they look the same. And I also
 > detest that I have to register remote repos locally in order to
 > refer to them in commands, giving them some arbitrary local name,
 > when they already have a perfectly good universally valid name (the
 > URL).

s/the/a typically non-unique/ ;-)

But as far as I know, with the exception of diff, all commands where
you want to refer to a remote allow you to use any of the URLs that
refer to it.

 > In Darcs I push and pull between different repos quite often and I
 > would find it extremely annoying if I had to set up remote repo
 > tracking each time. I also rely on command line completion for
 > that. (But I have to admit that this is in part due to different
 > clones representing what in git you would use local branches for.)

Yup.  Use branches: annoyance evaporates! ;-)

 > > Specifically, in my own use I clone, set up fetch all, and I'm done.

 > How does one "set up fetch all"?

The simplest way I know is

cd git-repo
for ref in `git branch -r | grep -v 'HEAD\|master'`; do
    git branch --track `basename $ref` $ref
done

(The "basename" is a cute way of parsing "remote/name".)

 > I can't remember it being mentioned in the introductions to git.

It's not.  git users usually have a bunch of obsolete or experimental
branches lying about, that you would not be interested in tracking.
So I do it by hand because "all" really means "all interesting", and
there's usually only one or two of those.

 > I suppose it means that a fetch not only fetches objects referenced
 > by the corresponding remote branch but all objects?

fetch's basic syntax is "git fetch [options] [repository [ref ..]]".

There's a default repository (usually "origin").  git fetches all
objects referred to by all configured branches for that repository
(usually all of them), unless refs are specified, then it limits to
those remote branches.

 > > I almost never need to refer to a remote or a tracking branch.

 > Suppose you have a local clone of the remote that you share with
 > colleagues and where you may have changes, only some of which you want
 > to share with upstream. (Other changes may be site specific adaptions or
 > configuration). You clone from that and work on it. There are already
 > three repos involved. I don't find this unusual.

Well, for me there would be two repositories (local and remote) and
four branches: the remote master, an explicit mirror of master, my
feature branch for publication, and my local branch for local-only
configuration.

The workflow would be something like

(0) clone remote
(1) update to appropriate base version (which might be something like
    the most recent release branch if it's not 'master')
(2) make the 'feature' branch at master/HEAD
(3) make the 'config' branch at master/HEAD
(4) checkout 'config' and do the local-only changes, and commit
(5) checkout 'feature' and do the work for publication, and commit
(6) rebase 'config' onto 'feature', checkout 'config', and test
(7) if bad: goto (5)
(8) checkout 'master'
(9) pull from upstream
(10) rebase 'feature' on 'master'
(11) push
(12) if done: stop, else: goto (5)

One could also do this with three branches and rebase --interactive,
but that would require remembering which patches are purely local when
updating 'config'.  The way I do it means that (assuming no
inadvertant commits, which are typically easy to identify even if they
happen) everything is completely stylized as above, git does all the
VCS-mongering for me.  Usually the name of the feature branch is
relevant to the work for publication, but what I called "config" to
follow your example in practice is almost always called "local",
except when I am testing with multiple non-default configurations, in
which case they get (hopefully) mnemonic names.

 > > Once or twice a week I use a refs/remotes/origin ref in a diff. Once
 > > a quarter or less I need to look up the incantations for exposing a 
 > > non-default remote branch in my local namespace.

 > Referring to any arbitrary remote should be just
 > 
 >   <remote URL> <separator> <branch>
 > 
 > without having to set up anything. That's my HO, at least.

Well, "origin" *is* an URI, relative to the local repository, if
you're in one.  As for set up, origin is just an alias.  If you want
to fetch or pull with a full URL, you can do that.  Nobody I know
does, though.

Diffing you cannot do that way, because all diffing is done locally.
This is true in all VCSes: you have to copy (download) the content to
do the diff.  The difference with Bazaar and Mercurial, and I guess
Darcs, is that git makes this explicit, and requires a local ref for
diff.

 > I haven't thought about it in depth, yet. The problem is that
 > subrepos (as I would rather name them)

FWIW, that's what Mercurial calls them IIRC.

 > It may be possible to make sense of it in Darcs by adding another
 > kind of primitive patch for adding and removing subrepos, similar
 > to (but distinguished from) adding/removing directories.

But this is quite analogous to what git does.  If a command doesn't
require history metadata for its argument, then you can always use a
tree or a commit that refers to that tree indifferently.  If a
commit's *content* object is a commit, then git recognizes it as a
subrepo, and stops (for most history-using commands), recurses (for
the submodule command), or dereferences (for content-using commands).

It just happened that using a commit instead of designing a new object
suggests pretty much exactly the semantics of "recursive DAG", which
is an immediately plausible way to think about subrepos in git.

 > The ideal candidate for such an integration would be the
 > experimental variant of primitive patch types where we assign UUIDs
 > to all tree objects (files, dirs) as soon as they are created
 > (recorded) for the first time, an idea I previously mentioned in
 > passing. This gives them an identity independent from their name
 > (path).

This is the role played by "blob" objects (for files) and "tree"
objects (for directories) in git.  The UUID is just the SHA1 of the
object's content.  You may prefer a true UUID such as <SHA of
content>-<committer email>-<timestamp>, but this would involve
additional logic and communication to determine whether two objects
with the same SHAs but different disambiguators reference the same
object or not (you need to do a diff), and some algorithm for choosing
one to give precedence to improve the chance that a "canonical" ID
will propagate.  Linus decided to bet that SHA1s won't collide in his
lifetime.

 > This doesn't sound too bad, as a rough idea, but there are many, many
 > details left to fill in.

As far as I can see, it's so analogous to the way git does things that
a straightforward implementation will be, well, straightforward.  For
serious use, you'll probably want to optimize, and that's always tricky.

 > Anyway, they are considering to revert that decision because making
 > changes that overlap multiple submodules is now a lot more painful.

Sometimes making widespread changes painful to do is a good thing! ;-)

 > > git has one, per directory in the working tree, *per commit*.
 > 
 > My first thought was that must cost a huge amount of disk space but of
 > course that's not true since all identical objects are shared in the
 > database, right?

Exactly.  Some people refer to git as a "filesystem" with SHA1s as
inodes (and of course that's why the consistency check command is
called "fsck").

 > Our future (hopefully) UUID based patch theory could have an advantage
 > here.

I don't see why patch theory itself would be UUID-based?  

 > I guess we could drive the "deep integration" to the extreme and
 > even allow to "rename" tree objects from a repo to a subrepo (or
 > the in other direction, or between different subrepos; and what
 > about nested subrepos?).

This is all pretty trivial, it turns out.  git already has very
similar operations.  But you're right about "'could' is not an
imperative, think about 'should' first!"

 > > In Darcs the annoyance would still be present, but the tedium could
 > > be handled with sed: just a matter of moving patches from one to the 
 > > other and then fixing up the paths naming the files involved (the 
 > > roots change). I imagine fixing up the inventories there would 
 > > already be a command ("repair", maybe)?
 > 
 > I am more inclined to consider a "deep integration" of subrepos such as
 > I sketched above, where we track such movements as kind of a primitive
 > patch and thus don't have to "rewrite history", just commute patches as
 > usual.

I think you're right.  When I wrote the above, I was not really
thinking about how Darcs handles moving files.

 > > Described that way, sounds like it would make me nervous, too. :-)
 > > On the other hand, in practice I generally have refs for things I
 > > refer to,

The "nervous people" I'm talking about are frequently not considered
exactly human and rarely are developers, and are worrying about content
that they make the developers find and identify refs for: lawyers. :-)

Regards,
Steve



More information about the darcs-users mailing list