-*- Text -*- Content ======= * Context * Requirements * Nice-to-have's * Non-goals * Open items / discussion points * Problems in wc-1.0 * Possible solutions * Prerequisites for a good wc implementation * Modularization * Implementation proposals for - metadata storage/access abstraction - BASE tree storage/access abstraction - WORKING tree storage/access abstraction - TARGET & MERGE-END tree storage/access abstraction - transactional manipulation API proposal - delta-application algorithm (in light of metadata, tree and textual conflicts) - Context ======= The working copy library has traditionally been a complex piece of machinery and libsvn_wc-1.0 (wc-1.0 hereafter) was more a result of evolution than it was a result of design. This can't be said to be anybody's fault as much as it was unawareness of the developers at the time with the problem(s) inherent to versioning trees instead of files (as was the usual context within CVS). As a result, the WC has been one of the most fragile areas of the Subversion versioning model. The wc is where a large number of issues come together which can be considered separate issues in the remainder of the system, or don't have any effect on the rest of the system at all. The following things come to mind: * Different behaviours required by different use-cases (users) For example: some users want mtime's at checkout time to be the checkout time, some want it to be the historical value at check-in time (and others want different variants). * Different filesystems behave differently, yet Subversion is a cross platform tool and tries to behave the same on all filesystems (timestamp resolution may be an example of this). When considering the wc-1.0 design, one finds that there are a lot of situations where the exact state of the versioned tree isn't defined. When explicitly considering which trees relate to the working copy at one time or another, the following trees can be found: * BASE: The tree as it was in unmodified form * WORKING: The tree as it is in modified form, based on the administrative information recorded by the transforming 'svn ..' commands Note: This tree will -as far as text bases goes- generally overlap with BASE, but isn't required to; e.g. "add-with-history" * ACTUAL: The tree as it is in modified form on the local disk. This tree may differ from WORKING when having been modified with non-Subversion transforming commands (such as plain 'rm'). In the context of the 'svn update' command: * BASE-TARGET: The tree to which BASE is being updated and for which the changes w.r.t. BASE are integrated into WORKING and ACTUAL * WORKING-TARGET, ACTUAL-TARGET: Trees in which the above mentioned changes have been integrated, but which haven't "gone live" yet; these trees generally represent "in transition" or "intermediary" state with the intent to become the final tree. Additionally, three more trees may be related to the working copy when considering the 'svn merge' command: * START: The tree used as the base state for the 'merge' command * END: The tree used as the ending state for the 'merge' command The difference between these trees will be merged into the WORKING and ACTUAL trees. In the following example 10 == START and 15 == END: $ svn merge -r10:15 http://svn.example.com/svn/ . Please note that the WORKING-TARGET and ACTUAL-TARGET trees also apply to 'svn merge' as they can result in 'add with history' schedules, which will place text bases in the WORKING-TARGET tree. Also note that -since merge is by definition an 'edit' operation- the BASE and BASE-TARGET trees are not concerned with a merge. ###EHU: To which trees do BASE and TARGET refer when we're in a subdir of a replaced tree? And which trees do they refer to in a subdir of a replaced tree which itself is replaced? (Preliminary answer: the base in a replaced subdir should probably be the base as defined by the parent which got copied in, not the base as was deleted, because otherwise it won't be possible to delete files from the replaced subdir: there would be no way to express a deletion against the new dir.) Requirements ============ * Developer sanity From this requirement, a number of additional ones follow: - Very explicit tree state management; clear difference between each of the 5 states we may be looking at - It must be "fun" to code wc-ng enhancements * Speed (Note: a trade off may be required for 'checkout' vs 'status' speed) * Cross-node-type working copy changes * Flexibility The model should make it easy to support - central vs local metadata storage - Last modified timestamp behaviours - .svn-less working copy subtrees - different file-changed detection schemes (e.g. full tree scan as in wc-1.0 as well as 'p4 edit') * Graceful (defined) fallback for non-supported operations When a checkout tries to create a symlink on an OS which supports them, on a filesystem which doesn't, we should cope without canceling the complete checkout. Same for marking metadata read-only. * Gracefully handle symlinks in relation to any special-handling of files (don't special-handle symlinks!) * Clear/reparable tree state Other than our current loggy system, I mean here: "there is a command by which the user can restart the command he/she last issued and Subversion will help complete that command", which differs from our loggy system in the way that it will return the working copy to a defined (but to the user unknown) state. * Transactional/ repairable tree state (with which I mean something which achieves the same as our loggy system, but better). * Case sensitive filesystem aware / resilient * Working copy stability; a number of scenario's with switch and update obstructions used to leave the working copy unrecoverable * Client side 'true renames' support where one side can't be committed without the other (relates to issue #876) * Change detection should become entirely internal to libsvn_wc (referring to the fact that libsvn_client currently calls svn_wait_for_timestamps()), even though under 'use-commit-times=yes', this waiting is completely useless. * Last-modified recording as a preparation for solving issue #1256 and as defined in this mail, also linked from the issue: http://svn.haxx.se/dev/archive-2006-10/0193.shtml * Representing "this node is part of a replaced-with-history tree and I'm *not* in the replacement tree" as well as "... and I'm deleted from the replacement tree" [issues #1962 and #2690] Would-be-very-nice-to-have's ============================ * Multiple users with a single working copy (aka shared working copy) * Ending up with an implementation which can use current WCs (without conversion) * Working copies/ metadata storages without local storage of text-bases (other than a few cached ones) Non-goals ========= * Off-line commits * Distributed VC Open items / discussion points ============================== * Files changed during the window "sent as part of commit" to "post commit wc processing"; these are currently explicitly supported. Do we want to keep this support (at the cost of speed)? * Single working copy lock. Should we have one lock which locks the entire working copy, disabling any parallel actions on disjoint parts of the working copy? * Meta data physical read-only marking (as in wc-1.0). Is it still required, or should it become advisory (ie ignore errors on failure)? * Is issue #1599 a real use-case we need to address? (Loosing and regaining authz access with updates in between) Problems in wc-1.0 ================== * There's no way to clear unused parts of the entries cache * The code is littered with path calculations in order to access different parts of the working copy (incl. admin areas) * The code is littered with direct accesses to both wc files and admin area files * It's not always clear at which time log files are being processed (ie transactions are being committed), meaning it's not always clear at which version of a tree one is looking at: the pre or post transformation versions... * There's no support for nested transactions (even though some functions want to start a new transaction, regardless whether one was already started) * It's very hard to determine when an action needs to be written to a transaction or needs to be executed directly * All code assumes local access to admin (meta)data * The transaction system contains non-runnable commands * It's possible to generate combinations of commands, each of which is runnable, but the series isn't * Long if() blocks to sort through all possible states of WORKING, ACTUAL and BASE, without calling it that. * Large if() blocks dealing with the difference between file and directory nodes * Many special-handling if()s for svn:special files * Manipulation of paths, URLs and base-text paths in 1 function * 'Switchedness' of subdirectories has to be derived from the URLs of the parent and the child, but copied nodes also have non-parent-child source URLs... (confusing) * Duplication of data: a 'copied' boolean and a 'copy_source' URL field * Checkouts fail when checking out files of different casing to a case insensitive filesystem * Checkouts fail when marking working copy admin data as read-only is a non-supported FS operation (VFAT or Samba mounts on Linux have this behaviour) * Obstructed updates leave operations half done; in case of a switch, it's not always possible to switch back (because the switch itself may have left now-unversioned items behind) * Directories which have their own children merged into them (which happens when merging a directory-add) won't correctly fold the children into schedule==normal, but instead leave them as schedule==add, resulting in a double commit (through HTTP, other RA layers fold the double add, but that's not the point) [see issue #1962] * transaction files (ie log files) are XML files, requiring correct encoding of characters and other values; given the short expected life-time of a log file and the fact that we're almost completely sure the log file is going to be read by the WC library anyway (no interchange problems), this is a waste of processing time * No strict separation between public and internal APIs: many public APIs also used internally, growing arguments which *should* only matter for internal use Possible solutions ================== Developer sanity ---------------- Strict separation between modules should help keep code focused at one task. Probably some of the required user-specific behaviours can (and should) be hidden behind vtables; for example: setting the file stamp to the commit time, last recorded time or leaving it at the current time should be abstracted from. Access to 'text bases' is another one of these areas: most routines in wc-1.0 don't actually need access to a file (a stream would be fine as well), but since the files are there, availability is assumed. When abstracting all access into streams, the actual administration of the BASE tree can be abstracted from: for all we know the 'tree storage module' may be reading the stream directly off the repository server. [The only module in wc-1.0 which *requires* access to the files is the diff/merge library, because it rewinds to the start of the file during its processing; an operation not supported by streams... and even then, if these routines are passed file handles, they'll be quite happy, meaning they still don't need to know where the text base / source file is...] In order to keep developers sane, it should be extremely clear at any one time - when operating on a tree - which tree is being operated upon. One way to prevent the lengthy 'if()' blocks currently in wc-1.0, would be to design a dispatch mechanism based on the path-state in WORKING/BASE and the required transformation, dispatching to (small) functions which perform solely that specific task. #####XBC Do please note that this suggests yet another instance of pure polymorphism coded in C. This runs contrary to the developer sanity requirement. Speed ----- wc-1.0 assumes the WORKING tree and the ACTUAL tree match, but then goes out of its way to assure they actually do when deemed important. The result is a library which calls stat() a lot more often than need be. One of the possible improvements would be to make wc-ng read all of the ACTUAL state (concentrated in one place, using apr_stat()), keeping it around as long as required, matching it with the WORKING state before operating on either (not only when deemed important!). Working from the ACTUAL tree will also prove to be a step toward clarity regarding the exact tree which is being operated upon. [This suggestion from wc-improvements also applies to wc-ng:] Most operations are I/O bound and have CPU to spare. Consider the virtue of compressed text bases in order to reduce the amount of I/O required. Another idea to reduce I/O is to eliminate atomic-rename-into-place for the metadata part of the working copy: if a file is completely written, store the name of the base-text/prop-text in the entries file, which gets rewritten on most wc-transformations anyway. Cross node type change representation ------------------------------------- ####EHU To be done Flexibility of metadata storage ------------------------------- There are 3 known models for storing metadata as requested by different groups of users: - in-subtree metadata storage (.svn subdir model, as in wc-1.0) - in-'tree root' metadata storage (working copy central) - detached metadata storage (user-central) A solution to implementing each of these behaviours in order to satisfy the wide range of use-cases they solve, would be to define a module interface and implement this interface three times (possibly using vtables). Note that using within-module vtables should be less problematic than our post-1.0 experiences with public vtables (such as the ra-layer vtable): implementation details are allowed to differ between releases (even patch releases). Transaction duration / memory management ---------------------------------------- The current pool-based memory management system is very good at managing memory in a transaction-based processing model. In the wc library, a 'transaction' often spans more than one call into the library. We either need a sane way to handle this kind of situation using pools, or we may need a different memory management strategy in wc-ng. Working copy stability ---------------------- In light of obstructed updates it may not always be desirable to be able to resume the current operation (as currently is the case): in some cases the user may want to abort the operation, in other cases the user may want to resolve the obstruction before re-executing the operation. The solution to this problem could be 'atomic updates': receiving the full working copy transformation, verifying prerequisites, creating replacement files and directories and when all that succeeds, update the working copy. Full workin' copy unit tests: Exactly because the working copy is such an important part of the Subversion experience *and* because of the 'reputation' of wc-1.0, we need a way to ensure wc-ng completely performs according to our expectations. *The* way to ensure we're able to test the most contrived edge-cases is to develop a full unit testing test-suite while developing wc-ng. This will both be a measure to ensure working copy stability as well as developer sanity: in the early stages of the wc-ng develop- ment process, we'll be able to assess how well the design holds up under more difficult 'weather'. Transactional updates --------------------- .. where 'update' is meant as 'user command', not 'svn update' per se. When applied to files, this can be summarized as: * Receive transformations (update, delete, add) from the server, Prerequisites for a good wc implementation ========================================== These prerequisites are to be addressed, either as definitions in this document, or elsewhere in the subversion (source) tree: * Well defined behaviour for cross-node type updates/merges/.. (tree conflicts in particular) * Well defined behaviour for special file handling * Well defined behaviour for operations on locally missing items (see issue #1082) * Well defined change detection scheme for each of the different last-modified handling strategies * No special handling of symlinks: they are first class versioned objects * Well defined behaviour for property changes on updates/merges/... (this is a problem which may resemble tree conflicts!), including 'svn:' special properties * File name manipulation routines (availability) * File name comparison routines (!) (availability; which compensate for the different ways Unicode characters can be represented [re: NFC/NFD Unicode issue]) * URL manipulation routines (availability) * URL comparison routines (availability; which compensate for different ways the same URL can be encoded; see issue #2490) * Modularization * Agree on a UI to pull in other parts of the same repository (NOT svn:externals) [relates to issue #1167] #####XBC I submit this is a server-side feature that the client (i.e. the WC library) should not know about. * Agree on behaviour for update on moved items (relates to issue #1736) * Case-sensitivity detection code to probe working copy filesystem Modularization ============== Strict separation must be applied to a number of modules which can be recognised. This will help prevent spaghetti code as in wc-1.0 where one piece of code manipulates paths to a working copy file, its URL *and* the path to the base file. For now, these APIs can be separated: - the public API (presumably not to be used by any internal processing, but presents functionality to working copy users) #####XBC This is really required of all our module public APIs. - tree administration API (required for BASE, TARGET and WORKING) Admins which files are part of the tree, which ones map to which repositories and which textbase / propbase files belong to which local files. [should provide checkpointing functionality for use with transactional tree modifications API] - tree access API (required for BASE, WORKING, TARGET and ACTUAL) Gives access to the content of the nodes in a tree - props - text bases (for files) - child nodes (for directories) - transactional tree modifications API (applicable to all trees, ###EHU do we provide the same interface to BASE/WORKING as for ACTUAL?) - tree transformation (required for update/switch/merge updating BASE, WORKING and ACTUAL), meaning all of tree changes, file changes and metadata changes - Working-copy changedness detection API - Metadata access API (used by tree administration module(s)) - Event hooks (in order to be able to implement different timestamp-setting strategies and possibly more) These APIs will be implemented by these (currently known) modules: - tree administration * wc_adm - tree access * wc_acc - transactional tree modifications * wc_log - tree transformation * wc_trans - working copy changedness detection wc_detect vtable-based API implemented by these modules: * tree crawler ('inspired' by wc-1.0) * tree marker (inspired by 'p4 edit') - metadata access API wc_macc vtable-based API implemented by these modules: * tree spread ('inspired' by wc-1.0) * tree root (storing all metadata in the tree root (think darcs)) * central depot (storing 'somewhere' locally, possibly $HOME) this central store would open up the possibility to share text bases/prop bases across checkouts * non-local (retrieving all text and prop-bases from the server, except for a number of cached ones) ###EHU: maybe this is orthogonal to the question where metadata is stored: in all situations, you *could* choose not to keep local copies - Event hooks for the union of all paths in (BASE, WORKING) wc_hook event based single-callback API for e.g. these events: + props updated + base text updated + wc file updated + update completed + lock acquired + lock released (+ lock can't be acquired [in order to 'unprotect' svn:needs-lock protected files which have been removed from the repository?]) to be implemented by these modules: * use-commit-times * versioned-mtimes * versioned-execute-perm * versioned-other-unix-perms (* versioned-windows-perms?) * needs-lock-updater Justification for the large number of modules, with a modest number of different APIs is that the problem is really quite complex as shown earlier in this document. Over the years, a large number of use cases have developed around Subversion where different user groups have shown very valid use cases for conflicting behaviours. Presumably, most of these we want to retain. Some of the unimplemented ones have open issues indicating there's at least an active interest. In order to prevent locking out some of the current use cases adding support for the open issues, we need a flexible modularized model. This model will also prevent that we'll end up duplicating lots of code to support the different use cases. #####XBC Such flexibility will bring the WC to the kind of purgatory the RA layers are in. We promise feature and semantics parity between them, and the result is that even a small change in that layer requires knowledge of three different protocols and four different implementations. Given the assumption of 'little code duplication', the choice for having several modules which implement the same API (vtable) is justifiable. Implementation proposals ======================== Classification of svn_wc_entry_t fields to BASE/WORKING ------------------------------------------------------- [Note: This section is mainly to clarify the difference between the BASE and WORKING trees, it's not here to mean that we actually need all these fields in wc-ng!] Here are the mappings of all fields from svn_wc_entry_t to the BASE and WORKING trees: +-------------------------------+------+---------+ | svn_wc_entry_t | BASE | WORKING | +-------------------------------+------+---------+ | name | x | x (1)| | revision | x | x (2)| | url | x | x (2)| | repos | x | x (3)| | uuid | x | x (3)| | kind | x | x | | absent | x | | | copyfrom_url | | x | | copyfrom_rev | | x | | conflict_old | | x | | conflict_new | | x | | conflict_wrk | | x | | prejfile | | x | | text_time | | = | | prop_time | | = | | checksum | x | x (2)| | cmt_rev | x | x (2)| | cmt_date | x | x (2)| | cmt_author | x | x (2)| | lock_token | x(6)| | | lock_owner | x | | | lock_comment | x | | | lock_creation_date | x | | | has_props | x | x (4)| | has_prop_mods | | = | | cachable_props | x(5)| x (4)| | present_props | x | x (4)| | changelist | | x | | working_size | | = | | keep_local | | = | | depth | x | x | | schedule | | | | copied | | | | deleted | | | | incomplete | | | +-------------------------------+------+---------+ (1) if this one differs from BASE, it must point to the source of a rename (2) for an add-with-history (3) or can we assume single-repository working copies? (4) can differ from BASE for add-with-history (5) why is this a field at all; can't the WC code know? (6) locks apply to in-repository paths, hence BASE The fields marked with '=' are implementation details of internal detection mechanisms, which means they don't belong in the public interface. Fields with no check are to become obsolete. 'schedule', 'copied' and 'deleted' can be deducded from the difference between the BASE and WORKING or WORKING and ACTUAL trees. 'incomplete' should become obsolete when the goal of 'atomic updates' can be realised, in which case the tree can't be in an incomplete yet locked state. This would also invalidate issue #1879. Other sections -------------- remain to be done