Note: although I hopefully fixed all the conflicts,
some tests are quite broken.
Conflicts:
borg/_chunker.c
borg/archive.py
borg/archiver.py
borg/cache.py
borg/helpers.py
borg/testsuite/archiver.py
refactorings:
- introduced concept of default answer:
if the answer string is in the defaultish sequence, the return value of yes() will be the default.
e.g. if just pressing <enter> when asked on the console or if an empty string or "default" is
in the environment variable for overriding.
if an environment var has an invalid value and no retries are enabled: return default
if retries are enabled, next retry won't use the env var again, but either ask via input().
- simplify:
only one default - this should be a SAFE default as it is used in some special conditions
like EOF or invalid input with retries disallowed.
no isatty() magic, the "yes" shell command exists, so we could receive input even if it is not from a tty.
- clean:
separate retry flag from retry_msg
also: make check in Lock.close more precise, check for "is not None".
note: a lot of blocks were just indented to be under the "with" statement,
in one case a block had to be moved into a function.
this was also the loop contents of hashindex_merge, but we also need it callable from Cython/Python code.
this saves some cycles, esp. if the key is already present in the index.
due to borg's architecture, breaking the repo lock needs first creating a repository object.
this would usually try to get a lock and then block if there already is one.
thus I added a flag to open without trying to create a lock.
this was making us require mock, which is really a test component and
shouldn't be part of the runtime dependencies. furthermore, it was
making the imports and the code more brittle: it may have been
possible that, through an environment variable, backups could be
corrupted because mock libraries would be configured instead of real
once, which is a risk we shouldn't be taking.
finally, this was used only to build docs, which we will build and
commit to git by hand with a fully working borg when relevant.
see #384.
this is so that e.g. cron jobs do not hang indefinitely if yes() is called,
but it will just default to "no" if not tty is connected.
if you need to enforce a "yes" answer (which is not recommended for
the security critical questions), you can use the environment:
BORG_CHECK_I_KNOW_WHAT_I_AM_DOING=Y
subclasses of "Error": do not show traceback
(this is used when a failure is expected and has rather trivial reasons and usually
does not need debugging)
subclasses of "ErrorWithTraceback": show a traceback
(this is for severe and rather unexpected stuff, like consistency / corruption issues
or stuff that might need debugging)
I reviewed all the Error subclasses whether they fit into the one or other class.
Also: fixed docstring typo, docstring formatting
without this, there would be a solid 20 seconds here without any sort
of output on the console, regardless of the verbosity level. this
makes nice incremental messages telling the user that borg is not
stalled (or waiting for a lock, for that matter)
the "processing files" message is a little clunky, as we somewhat
abuse the cache to figure out if we are just starting... but it helps
if there are problems reading the actual files: it tells us the
initialization is basically complete and we're going ahead with the
reading of all the files.
this greatly simplifies the display of those objects, as the
__format__() parameter allows for arbitrary display of the internal
fields of both objects
this will allow us to display those summaries without having to pass a
label to the string representation. we can also print the objects
directly without formatting at all.
- issue #234: handle exception when config file is empty is really not a borg cache config
- there was a unused %s in the Exception string
- error msg was wrong when version check failed - this IS a borg cache, but not of expected version
the heuristics i used are the following:
1. if we are prompting the use, use print on stderr (input() may
produce some stuff on stdout, but it's outside the scope of this
patch). we do not want those prompts to end up on the standard
output in case we are piping stuff around
2. if the command is primarily producing output for the user on the
console (`list`, `info`, `help`), we simply print on the default
file descriptor.
3. everywhere else, we use the logging module with varying levels of
verbosity, as appropriate.
the logging level varies: most is logging.info(), in some place
logging.warning() or logging.error() are used when the condition is
clearly an error or warning. in other cases, we keep using print, but
force writing to sys.stderr, unless we interact with the user.
there were 77 calls to print before this commit, now there are 7, most
of which in the archiver module, which interacts directly with the
user. in one case there, we still use print() only because logging is
not setup properly yet during argument parsing.
it could be argued that commands like info or list should use print
directly, but we have converted them anyways, without ill effects on
the unit tests
unit tests still use print() in some places
this switches all informational output to stderr, which should help
with, if not fixjborg/attic#312 directly
Note: there is a failing archiver test on py33-only now.
It is somehow related to __del__ method usage in Cache
and/or locking code. Could not find out the exact reason
why it behaves like that.
added a check that compares the size of the new chunk with the stored size of the
already existing chunk in storage that has the same id_hash value.
raise an exception if there is a size mismatch.
this could happen if:
- the stored size is somehow incorrect (corruption or software bug)
- we found a hash collision for the id_hash (for sha256, this is very unlikely)
the compression was quite cpu intensive and didn't work that great anyway.
now the disk space usage is a bit higher, but it is much faster and less hard on the cpu.
disk space needs grow linearly with the amount and size of the archives, this
is a problem esp. if one has many and/or big archives (but this problem existed
before also because compression was not as effective as I believed).
the tar archive always needed a complete rebuild (and thus: decompression
and recompression) because deleting outdated archive indexes was not
possible in the tar file.
now we just have a directory chunks.archive.d and keep archive index files
there for all archives we already know.
if an archive does not exist any more in the repo, we just delete its index file.
if an archive is unknown still, we fetch the infos and build a new index file.
when merging, we avoid growing the hash table from zero, but just start
with the first archive's index as basis for merging.
also remove the comment about how good xz compresses - while that was true for smaller index files,
it seems to be less effective with bigger ones. maybe just an issue with compression dict size.
This fix is maybe not perfect yet, but maybe better than nothing.
A comment by Ernest0x (see https://github.com/jborg/attic/issues/232 ):
@ThomasWaldmann your patch did the job.
attic check --repair did the repairing and attic delete deleted the archive.
Thanks.
That said, however, I am not sure if the best place to put the check is where
you put it in the patch. For example, the check operation uses a custom msgpack
unpacker class named "RobustUnpacker", which it does try to check for correct
format (see the comment: "Abort early if the data does not look like a
serialized dict"), but it seems it does not catch my case. The relevant code
in 'cache.py', on the other hand, uses msgpack's Unpacker class.