[darcs-users] [patch294] Use cache while getting packed repository

Alexey Levan exlevan at gmail.com
Wed Jul 14 11:04:05 UTC 2010


Hi, sorry for delay with answer.

>
>> hunk ./src/Darcs/Commands/Optimize.lhs 148
>>  optimizeCmd origopts _ = do
>>      when (UpgradeFormat `elem` origopts) optimizeUpgradeFormat
>>      withRepoLock opts $- \repository -> do
>> -    when (OptimizeHTTP `elem` origopts) doOptimizeHTTP
>> +    when (OptimizeHTTP `elem` origopts) $ doOptimizeHTTP repository
>>      if (OptimizePristine `elem` opts)
>>         then doOptimizePristine repository
>>         else do cleanRepository repository
> Just an extra parameter.
>
>> hunk ./src/Darcs/Commands/Optimize.lhs 372
>>        gzs <- filter ((== ".gz") . takeExtension) `fmap` getDirectoryContents "."
>>        mapM_ removeFile gzs
>>
>> -doOptimizeHTTP :: IO ()
>> -doOptimizeHTTP = do
>> +doOptimizeHTTP :: RepoPatch p => Repository p C(r u t) -> IO ()
>> +doOptimizeHTTP repo = do
>>    rf <- either fail return =<< identifyRepoFormat "."
>>    unless (formatHas HashedInventory rf) . fail $
>>      "Unsupported repository format:\n" ++
>> hunk ./src/Darcs/Commands/Optimize.lhs 379
>>      "  only hashed repositories can be optimized for HTTP"
>>    createDirectoryIfMissing False packsDir
>> -  ps <- dirContents' "patches" $ \x -> all (x /=) ["unrevert", "pending",
>> -    "pending.tentative"]
>> +  ps <- mapRL hashedPatchFileName . newset2RL <$> readRepo repo
> We use inventories to get a list of patches instead of listing directory
> contents -- this should be indeed more robust. OK. We could also do
> something like this to get a list of the actually relevant inventories:
> these tend to accumulate a fair amount of garbage, actually, so it may
> be even more useful in that case. Check out readInventories and
> readInventoryPrivate in Darcs.Repository.HashedRepo.  (The latter reads
> a single inventory file, and gives you a list of PatchInfos and a Maybe
> a String (= hash) of the next inventory in the list. Nothing here means
> there are no more inventories. (I should probably turn that into a
> haddock. Hm.)

OK, sounds good.

>>    BL.writeFile (patchesTar <.> "part") . compress . write =<<
>>      mapM fileEntry' ps
>>    renameFile (patchesTar <.> "part") patchesTar
>> hunk ./src/Darcs/Commands/Optimize.lhs 383
>> -  let i = darcsdir </> "hashed_inventory"
>> -  is <- dirContents "inventories"
>> -  pr <- dirContents "pristine.hashed"
>> -  BL.writeFile (basicTar <.> "part") . compress . write =<<
>> -    mapM fileEntry' (i : (is ++ pr))
>> +  is <- sortByMTime =<< dirContents "inventories"
>> +  writeFile (darcsdir </> "tmp-inventories") . unlines $ map takeFileName is
>> +  pr <- sortByMTime =<< dirContents "pristine.hashed"
>> +  writeFile (darcsdir </> "tmp-pristine") . unlines $ map takeFileName pr
>> +  BL.writeFile (basicTar <.> "part") . compress . write =<< mapM fileEntry' (
>> +    [ darcsdir </> "tmp-inventories"
>> +    , darcsdir </> "tmp-pristine"
>> +    , darcsdir </> "hashed_inventory"
>> +    ] ++ reverse pr ++ reverse is)
> This sorts the basic.tar.gz in a newest-first order (pristines first,
> then inventories). It did make me wonder though, whether we actually
> want the inventories in the basic.tar.gz... They are not fetched by
> ordinary lazy get, are they?

Having inventories in basic doesn't mean they are going to be fetched
:) Stopping download before the inventories (in case of get --lazy) is
optimisation waiting to be implemented.

> It seems that tmp-pristine and tmp-inventories come with a list of files
> that are packed in the tarball. I am not sure this is necessary, since
> the tar itself already is a list with filenames in it?

AFAICT filenames in tar interleaved with file contents, and list of
files is needed to early start parallel fetching from cache.

>>    renameFile (basicTar <.> "part") basicTar
>> hunk ./src/Darcs/Commands/Optimize.lhs 393
>> +  removeFile $ darcsdir </> "tmp-inventories"
>> +  removeFile $ darcsdir </> "tmp-pristine"
> Get rid of the temporaries (could these be avoided by constructing these
> two lists in memory and feeding them to Tar?).

Well, yes, but that would mean allocating lots of memory in case of
large repositories. Look like a small tradeoff in favour of
scalability for me.

>> hunk ./src/Darcs/Repository.hs 287
>>    createDirectoryIfMissing False $ toDir </> darcsdir </> "pristine.hashed"
>>    createDirectoryIfMissing False $ toDir </> darcsdir </> "patches"
>>    copySources toRepo fromDir
>> +  Repo _ _ _ (DarcsRepository _ toCache3) <-
>> +    identifyRepositoryFor fromRepo "."
> I am not sure why is this here and what is the effect? What is the
> difference from toCache2?

in copySources the source file is being rewritten, so I re-read repo
to get a new Cache. The difference is that toCache2 doesn't work :) Of
course, if there are some helpers to avoid that that I'm missing, I'll
rewrite it to make the code a bit cleaner.

>>    -- unpack inventory & pristine cache
>> hunk ./src/Darcs/Repository.hs 290
>> -  writeCompressed . Tar.read $ decompress b
>> +  writeBasic toCache3 . Tar.read $ decompress b
>>    createPristineDirectoryTree toRepo "."
>>    -- pull new patches
>>    us <- readRepo toRepo
>> hunk ./src/Darcs/Repository.hs 306
>>    -- get old patches
>>    unless (any (`elem` opts) [Partial, Lazy, Ephemeral]) $ do
>>      putInfo "Copying patches, to get lazy repository hit ctrl-C..."
>> +    _ <- forkIO . fetchFiles toCache3 HashedPatchesDir .
>> +      mapFL hashedPatchFileName $ newset2FL us
>>      writeCompressed . Tar.read . decompress =<< fetchFileLazyPS (fromPacksDir ++
>>        "patches.tar.gz") Uncachable
>>   where
>> hunk ./src/Darcs/Repository.hs 311
>> +  writeBasic c (Tar.Next is (Tar.Next pr (Tar.Next hi xs))) = do
>> +    case map Tar.entryContent [is, pr, hi] of
>> +      [Tar.NormalFile is' _, Tar.NormalFile pr' _, Tar.NormalFile hi' _] -> do
>> +        _ <- forkIO $ do
>> +          fetchFiles c HashedInventoriesDir . lines $ BL.unpack is'
>> +          fetchFiles c HashedPristineDir . lines $ BL.unpack pr'
>> +        BL.writeFile (darcsdir </> "hashed_inventory") hi'
>> +        writeCompressed xs
>> +      _ -> fail "Unexpected non-file tar entry"
>> +  writeBasic _ _ =  fail "Error in basic tar file"
> For all I can tell, this extracts the tmp-{pristine,inventories} lists
> from the tarball and starts to copy them from cache. We do this
> concurrently with unpacking the tarball. Seems that first wins. I am
> however wondering, if the cache we are using here only contains local
> sources? (I.e. caches and possibly repos...) In that case, it would make
> some sense, although I don't see any exception handling in fetchFiles
> (below). The overall idea is interesting, but I don't see how it saves
> any bandwidth (or time): we still need to download the complete
> basic.tar.gz, right?
>
> Oh. I see now that you stop the unpacking and the download once you have
> a *first* hit... I think that's wrong though. It is, if nothing else,
> prone to races: either of the threads could win. If the tarball loses,
> it is not used at all, which could happen even if you don't have
> anything cached. Nevertheless, it is still downloaded, at least until
> the process is terminated, slowing down the process of getting any
> pristine content and patches that may be missing in the cache. Moreover,
> it could also happen that even though you have almost everything in the
> cache, the tarball is fully downloaded.

Well, yes, I'm making an assumption here that connection is fairly
shared between the threads. So from the one end of filelist (tmp-*) a
pack is downloading, from the other files are downloading one-by-one
using cache (if there is one), and when they meet somewhere in
between, download is done. The good thing here is that darcs doesn't
try to be too smart here: for example, the http source might be on
localhost, and "local" cache might be on the network share. I'm not
sure I understand the part about races, could you please explain how
exactly the threads may win/lose here?

> So instead it would be good if we could stop downloading once we have
> all the files. Also, a solution like that wouldn't need the
> tmp-{pristine,inventories} lists nor the concurrency, since it would
> need to parse the directory entries and inventories anyway and could see
> whether anything is still missing. The tarball would come in the mtime
> order as it already does (but hashed_inventory needs to come first, in
> any case). Then, you could just maintain a list of hashes that are still
> missing as you are processing the tarball. Once that list is empty, you
> stop. The list starts with pristine root, which is stored in
> hashed_inventory, and every time an item is removed from the list,
> anything it mentions (this is where you need to parse things) and we
> don't have yet (you look in the cache for those) is added to it. There
> are utilities in Storage.Hashed.Darcs that should help with that.

A bit I'm not sure about here is order in which tarball/cache are
used. Are we downloading a tarball here upto entry we have in cache?
What if there's different branch in cache and it's missing files in
the middle of list?

> Also, anything you unpack from the tarball should be linked into the
> cache if it's not there yet. If it is, the tarball copy should be
> probably ignored and you should hardlink from the cache instead (saves
> space).

OK.

> PS: I haven't had the chance to look at the performance of the patch
> yet. I won't apply it at least until I hear from you regarding the above
> comments -- I will try to benchmark it and see if it works at least in
> the usual cases. Something will nevertheless have to be done about the
> download of the tarball in cases where it is never used.

Unfortunately, I didn't test it much either. I've tried to download
packed darcs repo from localhost, the global cache was used, and the
threads were fairly interleaved.


More information about the darcs-users mailing list