2008-11-13 00:00 btw, I'll work for valgrind error of filemap.c before more endian work 2008-11-13 00:00 the lost data? 2008-11-13 00:01 I don't know yet, I just did "make tests" 2008-11-13 00:01 tests: balloctest dleaftest ileaftest btreetest dirtest iattrtest xattrtest filemaptest inodetest 2008-11-13 00:01 with above change 2008-11-13 00:02 filemap_extent_io: need 8 data and 8 index bytes 2008-11-13 00:02 filemap_extent_io: need 16 bytes, 248 bytes free 2008-11-13 00:02 filemap_extent_io: pack 0x0 => 0/1 2008-11-13 00:02 filemap_extent_io: extent 0x0/1 => 0 2008-11-13 00:02 filemap_extent_io: block 0x0 => 0 2008-11-13 00:02 ==17083== Use of uninitialised value of size 8 2008-11-13 00:02 ==17083== at 0x4E69909: (within /lib/libc-2.7.so) 2008-11-13 00:02 ==17083== by 0x4E6C7DB: vfprintf (in /lib/libc-2.7.so) 2008-11-13 00:02 ah, that could easily introduce a flaw if the change isn't make everywhere 2008-11-13 00:02 ==17083== by 0x404C6F: logline (trace.h:15) 2008-11-13 00:02 ==17083== by 0x40D28B: filemap_extent_io (filemap.c:226) 2008-11-13 00:02 ==17083== by 0x40D4FE: filemap_block_write (filemap.c:264) 2008-11-13 00:02 ==17083== by 0x40338A: write_buffer_to (buffer.c:179) 2008-11-13 00:02 ==17083== by 0x4033B2: write_buffer (buffer.c:186) 2008-11-13 00:02 ==17083== by 0x403C1F: flush_buffers (buffer.c:396) 2008-11-13 00:02 ==17083== by 0x40D8B9: main (filemap.c:304) 2008-11-13 00:02 ==17083== 2008-11-13 00:03 good luck, and I apologize for the low quality of that code 2008-11-13 00:04 it seems good basically, so no need to sorry 2008-11-13 00:05 and it's just trance_on() 2008-11-13 00:20 ah 2008-11-13 00:20 + return (from_be_u64(*(be_u64 *)&extent >> 48) & 0x3f) + 1; 2008-11-13 00:20 I think you alread changed it? 2008-11-13 00:20 yes 2008-11-13 00:20 ok, I'd like to merge those 2008-11-13 00:21 I think it's already in the repo 2008-11-13 00:21 oh 2008-11-13 00:22 wait 2008-11-13 00:24 return ((from_be_u64(*(be_u64 *)&extent) >> 48) & 0x3f) + 1; <- current version 2008-11-13 00:24 ok 2008-11-13 00:24 sparse liked it 2008-11-13 00:31 static inline struct extent make_extent(block_t block, unsigned count) 2008-11-13 00:31 { 2008-11-13 00:31 return (struct extent){ to_be_u64(block << 16 | (u64)(count - 1) << 10) }; 2008-11-13 00:31 } 2008-11-13 00:31 static inline unsigned extent_block(struct extent extent) 2008-11-13 00:31 { 2008-11-13 00:31 return from_be_u64(*(be_u64 *)&extent) & ~(-1LL << 16); 2008-11-13 00:31 } 2008-11-13 00:31 static inline unsigned extent_count(struct extent extent) 2008-11-13 00:31 { 2008-11-13 00:31 return ((from_be_u64(*(be_u64 *)&extent) >> 10) & 0x3f) + 1; 2008-11-13 00:31 } 2008-11-13 00:31 static inline unsigned extent_version(struct extent extent) 2008-11-13 00:31 { 2008-11-13 00:31 return from_be_u64(*(be_u64 *)&extent) & 0x3ff; 2008-11-13 00:31 } 2008-11-13 00:31 current dleaf.c is like this? 2008-11-13 00:32 no, that looks like from you 2008-11-13 00:32 yes 2008-11-13 00:33 valgrind seems to like it 2008-11-13 00:33 you are working on a 64 bit system? 2008-11-13 00:33 well, I'd like to see current version of those 2008-11-13 00:33 yes 2008-11-13 00:34 (u64)(count - 1) << 10) <- small think, don't need (u64) here 2008-11-13 00:34 small thing 2008-11-13 00:34 ah, yes 2008-11-13 00:35 it should be for (u64)block 2008-11-13 00:35 I thought putting the block at the low end of the word would make it a little easier to debug when looking at hexdumps 2008-11-13 00:36 ah, i see. I just referenced old one 2008-11-13 00:37 probably, your version after sparse fix, I think it works for valgrind too 2008-11-13 00:37 good, I could see any 64 bit issues 2008-11-13 00:37 count not I mean 2008-11-13 00:39 ah, sparse one is already pushed 2008-11-13 00:40 ok, current hg is works fine 2008-11-13 00:40 good, time for me to sleep 2008-11-13 00:40 just email when you're ready for a pull 2008-11-13 00:40 ok, good night 2008-11-13 00:41 good night 2008-11-13 00:46 night 2008-11-13 00:46 good night 2008-11-13 01:03 -!- pgquiles__(~pgquiles@18.Red-81-39-35.dynamicIP.rima-tde.net) has joined #tux3 2008-11-13 01:14 -!- cydork(~cydoork@122.169.100.164) has joined #tux3 2008-11-13 02:34 -!- pgquiles_(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-11-13 02:58 hey flips 2008-11-13 02:59 how's progress on atomic commits ? 2008-11-13 03:05 -!- pgquiles__(~pgquiles@18.Red-81-39-35.dynamicIP.rima-tde.net) has joined #tux3 2008-11-13 03:20 -!- pgquiles_(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-11-13 03:24 -!- pgquiles__(~pgquiles@18.Red-81-39-35.dynamicIP.rima-tde.net) has joined #tux3 2008-11-13 03:32 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-11-13 03:55 -!- pgquiles_(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-11-13 05:57 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-11-13 06:37 -!- pgquiles__(~pgquiles@18.Red-81-39-35.dynamicIP.rima-tde.net) has joined #tux3 2008-11-13 07:12 -!- pgquiles_(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-11-13 07:25 -!- pranith(~Bobby@122.162.72.220) has joined #tux3 2008-11-13 08:15 -!- Bobby_(~Bobby@122.162.70.93) has joined #tux3 2008-11-13 09:10 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-11-13 09:36 -!- pranith(~Bobby@122.162.70.93) has joined #tux3 2008-11-13 09:40 -!- pgquiles__(~pgquiles@18.Red-81-39-35.dynamicIP.rima-tde.net) has joined #tux3 2008-11-13 09:51 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2008-11-13 10:37 -!- pranith(~Bobby@122.162.70.93) has joined #tux3 2008-11-13 10:42 -!- pranith(~Bobby@122.162.70.93) has joined #tux3 2008-11-13 10:50 -!- pranith(~Bobby@122.162.70.93) has joined #tux3 2008-11-13 10:55 -!- pranith(~Bobby@122.162.70.93) has joined #tux3 2008-11-13 11:02 -!- pranith(~Bobby@122.162.70.93) has joined #tux3 2008-11-13 11:04 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-11-13 11:09 -!- pranith(~Bobby@122.162.70.93) has joined #tux3 2008-11-13 11:17 -!- pranith(~Bobby@122.162.70.93) has joined #tux3 2008-11-13 11:20 -!- pranith(~Bobby@122.162.70.93) has joined #tux3 2008-11-13 11:24 -!- pranith(~Bobby@122.162.70.93) has joined #tux3 2008-11-13 11:32 -!- pranith(~Bobby@122.162.70.93) has joined #tux3 2008-11-13 11:38 -!- pranith(~Bobby@122.162.70.93) has joined #tux3 2008-11-13 11:49 -!- pranith(~Bobby@122.162.70.93) has joined #tux3 2008-11-13 11:54 -!- pranith(~Bobby@122.162.70.93) has joined #tux3 2008-11-13 11:57 -!- pranith(~Bobby@122.162.70.93) has joined #tux3 2008-11-13 11:59 -!- pranith(~Bobby@122.162.70.93) has joined #tux3 2008-11-13 12:02 -!- pranith(~Bobby@122.162.70.93) has joined #tux3 2008-11-13 12:40 Endian conversions are nearly all done 2008-11-13 12:42 ileaf.c not done yet 2008-11-13 12:44 hey 2008-11-13 14:01 ileaf done 2008-11-13 14:02 hmm, is that all the conversions? 2008-11-13 14:24 What's the native endianness for tux3? 2008-11-13 14:43 big 2008-11-13 14:44 -!- ajonat(~ajonat@190.48.98.3) has joined #tux3 2008-11-13 14:44 Linux has traditionally used little endian for filesystems 2008-11-13 14:44 so this is a slight departure 2008-11-13 14:44 but there are advantages to big endian 2008-11-13 14:45 for one thing, the code actually gets tested on x86 2008-11-13 14:45 which nearly all developers use 2008-11-13 14:45 instead of being noops 2008-11-13 14:46 another advantage is, hexdumps are a lot easier to read in big endian 2008-11-13 14:46 ZFS went with the highly doubtful decision to use native endian 2008-11-13 14:47 which means that ZFS filesystem layout is different depending on whether you create the fs with a big or little endian machine 2008-11-13 14:47 I don't know what they were thinking 2008-11-13 14:47 s/thinking/smoking/ 2008-11-13 15:06 I thought a lot of filesystems were bigendian, ext3 is 2008-11-13 15:10 ext3 is little endian 2008-11-13 15:13 Weird 2008-11-13 15:15 little endian is weird period 2008-11-13 15:16 true 2008-11-13 15:17 in the stone age it made some kind of sense for processors that fetch data a byte at a time, the alu could start to work on the first byte before loading the second 2008-11-13 15:19 -!- samlh(~sam@67.129.121.145) has joined #tux3 2008-11-13 15:20 Yeah, but people are more interested with backwards compatibility than better instruction sets 2008-11-13 16:48 -!- rmull(~rmull@acsx01.bu.edu) has joined #tux3 2008-11-13 16:48 Hi guys 2008-11-13 16:49 I just attended a presentation by my school's Sun Microsystems campus ambassador and he reminded my how badly I want ZFS 2008-11-13 16:50 I have a 16 disk fileserver and RAIDZ2, automatic data correction, and the excellent snapshotting are just too cool to ignore 2008-11-13 16:51 I assume a RAIDZ-style implementation will never occur because that's supposed to be the job of md 2008-11-13 16:51 I recall reading on the tux3 announce on LKML that it was mentioned tux3 was supposed to be better than ZFS 2008-11-13 16:51 And my question is: How? 2008-11-13 18:13 rmull, the btrfs guys are putting raid into btrfs 2008-11-13 18:14 I believe that raid does not belong in a filesystem 2008-11-13 18:15 we are not at the benchmarking point yet, but indications are Tux3 will be faster than ZFS, and not suffer from excess memory use like ZFS does 2008-11-13 18:16 the ZFS snapshot vs clone model is clunky, tux3 will have no such restriction 2008-11-13 18:25 rmull, have you ever actually seen ZFS correct some data? 2008-11-13 18:27 he flips 2008-11-13 18:27 hey 2008-11-13 18:33 flips: I've seen demos and heard personal anecdotes from three different individuals 2008-11-13 18:34 The standard "dd urandom over a disk" test, then checksums match before and after 2008-11-13 18:38 But - I've never personally tried it. 2008-11-13 18:38 the ability to make remote incremental backups far outweighs the ability to construct data damaged by random dd's in my opinion 2008-11-13 18:39 Oh, that was another thing I wanted to ask - regarding zumastor 2008-11-13 18:39 the plan is to backport some of the tux3 mechanisms to zumastor 2008-11-13 18:40 to get a way more efficient volume level replicating snapshot 2008-11-13 18:40 Is it possible to donate some money to the tux3 project? 2008-11-13 18:40 certainly 2008-11-13 18:42 you could warm up by donating beer :) 2008-11-13 18:45 The beer is an option, though I have not existed long enough on this celestial boulder to buy alchohol as decided by my democratically elected government 2008-11-13 18:45 -!- ajonat_(~ajonat@190.48.125.206) has joined #tux3 2008-11-13 18:47 ok, next thing to fix are permissions and owners I think 2008-11-13 18:47 ought to be pretty easy 2008-11-13 19:31 whoops, endian stuff broke btree probe 2008-11-13 19:31 maybe I have patch for it 2008-11-13 19:31 that would be nice 2008-11-13 19:32 looks like node->count got set to little endian 1 2008-11-13 19:32 yes, it was warned by sparse 2008-11-13 19:33 ok, just tell me when to pull 2008-11-13 19:33 I wonder how I missed it 2008-11-13 19:33 make clean was needed? 2008-11-13 19:34 maybe. I didn't compile everything under C=1 2008-11-13 19:34 http://userweb.kernel.org/~hirofumi/endian-conversion-fixes.patch 2008-11-13 19:34 for right now 2008-11-13 19:35 I'm reading xattr stuff I completely missed now 2008-11-13 19:36 I also broke tux3graph.c 2008-11-13 19:36 yes, the patch is including it too 2008-11-13 19:36 nice 2008-11-13 19:39 fuse does indeed run again 2008-11-13 19:40 yes, make tests was also passed 2008-11-13 19:42 what do you think about rename "inode" to "atable" for inode in xattr? 2008-11-13 19:43 good 2008-11-13 19:43 thanks, I'll convert it 2008-11-13 19:44 what do you think about changing C to CHECK? 2008-11-13 19:45 both is ok to me, I just used name what linux is using 2008-11-13 19:45 C=1 works for kernel compile? 2008-11-13 19:45 yes 2008-11-13 19:45 gross ;) 2008-11-13 19:46 ;) C=1 and V=1 is available 2008-11-13 19:46 I understand that the price of bytes has gone up lately, but I think we can affort another 4 2008-11-13 19:47 yes 2008-11-13 19:48 oh and the target is .c.o 2008-11-13 19:48 somebody really likes to write obscure code 2008-11-13 19:49 I'm not sure, make may allow %.c as alternative 2008-11-13 19:50 it's ok 2008-11-13 19:53 why does tux3fuse need its own checker target? 2008-11-13 19:53 no target, build rule 2008-11-13 19:53 it needs $(pkg-config ...) addition 2008-11-13 19:54 IOW, it still depends on *.c 2008-11-13 19:55 makes sense 2008-11-13 19:56 -!- RazvanM(~RazvanM@pool-151-196-126-202.balt.east.verizon.net) has joined #tux3 2008-11-13 19:56 Hi! 2008-11-13 19:56 hi 2008-11-13 19:56 hi 2008-11-13 19:56 I remembered this time :D 2008-11-13 19:57 ok, there is the patch for C -> CHECK 2008-11-13 19:58 I have two minutes to get ready 2008-11-13 20:00 can anybody think of a reason why we should keep tux3fs.c, the high level fuse interface? 2008-11-13 20:00 we haven't done anything with it for weeks 2008-11-13 20:00 and tux3fuse.c gets regularly maintained 2008-11-13 20:01 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-11-13 20:01 hi maze 2008-11-13 20:01 hey 2008-11-13 20:01 did I miss anything? 2008-11-13 20:01 can anybody think of a reason why we should keep tux3fs.c, the high level fuse interface? 2008-11-13 20:01 -!- RalucaM(~ral@pool-151-196-126-202.balt.east.verizon.net) has joined #tux3 2008-11-13 20:01 that was all the activity in the last minute 2008-11-13 20:01 hi 2008-11-13 20:01 hi ralucam 2008-11-13 20:02 ACTION is late 2008-11-13 20:02 last week we ended on an exciting note, having looked at vfs->create pretty closely 2008-11-13 20:02 thinking about how defer is going to work 2008-11-13 20:03 so let's think keeping about deferred namespace ops, and go look at the delete side 2008-11-13 20:04 the goal of this exercise is to convince ourselves we can maintain the illusion that a file exists when it is not actually backed by the filesystem 2008-11-13 20:04 did that make sense? 2008-11-13 20:04 yup 2008-11-13 20:04 besides creating and deleting, we also need to worry about rename 2008-11-13 20:04 and the implicit delete that can happen in a rename 2008-11-13 20:05 ok, let's have a url work where the vfs calls the file delete method 2008-11-13 20:05 (this is my standard trick to grab a minute for myself) 2008-11-13 20:05 :-) 2008-11-13 20:07 ok, it was a trick 2008-11-13 20:07 the vfs doesn't call delete, it calls ->unlink 2008-11-13 20:08 inode_operations->unlink 2008-11-13 20:09 http://lxr.linux.no/linux/fs/namei.c#L2216 2008-11-13 20:09 as usual, the fastest way to find out a detail like that is look at fs/ext2 2008-11-13 20:09 http://lxr.linux.no/linux+v2.6.27/fs/namei.c#L2234 2008-11-13 20:09 nearly a tie :) 2008-11-13 20:10 let's take a quick walk through it 2008-11-13 20:10 before calling unlink, the vfs will check for a negative dentry 2008-11-13 20:10 pretty simple what it does 2008-11-13 20:11 after locking the target directory, similar to create 2008-11-13 20:11 (I was referring to vfs_unlink) 2008-11-13 20:11 yes 2008-11-13 20:11 http://lxr.linux.no/linux+v2.6.27.6/fs/namei.c#L1394 2008-11-13 20:11 see the nfs wrinkle 2008-11-13 20:11 sillyrename 2008-11-13 20:12 this is needed because, unlike a local filesystem, and nfs server is expected to be able to reboot and not lose track of an unlinked file 2008-11-13 20:12 because the client may still have it open 2008-11-13 20:13 and NFS, being stateless (that is the _real_ sillyness) does not know that 2008-11-13 20:13 let's have a quick look at may_delete 2008-11-13 20:13 and verify that it checks for a negative dentry 2008-11-13 20:14 1398 if (!victim->d_inode) 2008-11-13 20:14 1399 return -ENOENT; <- maybe this is the check 2008-11-13 20:15 yes 2008-11-13 20:15 that's bad code, it should have a wrapper to show that it's the negative dentry condition 2008-11-13 20:16 ok, this is very important for us 2008-11-13 20:16 because tux3 will not actually remove the underlying dirent until the next delta transition arrives 2008-11-13 20:17 it will just clear the inode field, turning the dentry "negative", and save a pointer to the dentry 2008-11-13 20:17 i see. and we pin it? 2008-11-13 20:17 normally, the dentry gets a dput not too long after this, and tux3 needs to take its own reference count 2008-11-13 20:17 yes 2008-11-13 20:17 "pin it" for short 2008-11-13 20:18 I don't think there is a lot more to look at there 2008-11-13 20:19 Stephen Tweedie in his notes on Ext3, calls delete the hardest operation, by far 2008-11-13 20:19 so if that seemed simple, I guess we just don't understand ;) 2008-11-13 20:20 now let's take a look at rename 2008-11-13 20:20 if tux3 defers create and delete, then it must also defer any other namespace changes, like rename 2008-11-13 20:20 well, wait 2008-11-13 20:20 where did the delete actually happen? 2008-11-13 20:20 we haven't even gone into ext2 code... 2008-11-13 20:20 ah, let's do that 2008-11-13 20:21 we will find that the fs actually invalidates the dentry 2008-11-13 20:21 http://lxr.linux.no/linux+v2.6.27.6/fs/ext2/namei.c#L253 2008-11-13 20:22 thanks 2008-11-13 20:23 2.6.27.6 lxr seems broken 2008-11-13 20:23 http://lxr.linux.no/linux+v2.6.27/fs/ext2/dir.c#L566 2008-11-13 20:23 ext2_delete_entry 2008-11-13 20:24 oh wait 2008-11-13 20:24 no, the fs does not do this 2008-11-13 20:25 removed entry from dirents page, so entry was gone from readdir() 2008-11-13 20:26 2241 d_delete(dentry); 2008-11-13 20:26 http://lxr.linux.no/linux+v2.6.27/fs/namei.c#L2241 2008-11-13 20:27 we can still lookup entry via dcache..., it unhashed from dcache 2008-11-13 20:27 1513 * Turn the dentry into a negative dentry if possible, otherwise 2008-11-13 20:27 1514 * remove it from the hash queues so it can be deleted later 2008-11-13 20:27 1515 */ 2008-11-13 20:28 inode may still be opening 2008-11-13 20:28 be opened 2008-11-13 20:28 1527 dentry_iput(dentry); 2008-11-13 20:28 yes 2008-11-13 20:29 http://lxr.linux.no/linux+v2.6.27/fs/dcache.c#L102 <- dentry_iput 2008-11-13 20:29 looking for the place where the dentry inode gets set to null 2008-11-13 20:30 108 dentry->d_inode = NULL; 2008-11-13 20:30 that "if possible" above worries me 2008-11-13 20:30 we need to be able to rely on this 2008-11-13 20:31 109 list_del_init(&dentry->d_alias); <- notice this happens outside the spinlocks 2008-11-13 20:31 it is protected by the directory i_mutex, I believe 2008-11-13 20:32 there is dentry->d_lock and dcache_lock? 2008-11-13 20:32 let's see where those got taken 2008-11-13 20:33 in d_delete 2008-11-13 20:33 http://lxr.linux.no/linux+v2.6.27/fs/dcache.c#L1517 2008-11-13 20:33 it's not pretty 2008-11-13 20:33 let's answer the question "when is it not possible to turn the dentry into a negative dentry" 2008-11-13 20:34 that is, set the d_inode to null 2008-11-13 20:34 the answer is: when the d_count is not 1 2008-11-13 20:35 so that is a pretty scary little corner of the kernel for our plan 2008-11-13 20:35 do you need to care d_drop()? 2008-11-13 20:35 I don't know 2008-11-13 20:36 well if our dentry gets d_dropped and we have not done the deferred delete yet, we are in trouble 2008-11-13 20:36 so we must convince ourselves that this can never happen 2008-11-13 20:37 [there's also the what happens when you create delete create the same entry... problem] 2008-11-13 20:37 "_drop will just make the cache lookup fail." 2008-11-13 20:38 maze, the vfs should turn the negative dentry into a non-negative directory then call us 2008-11-13 20:39 not hard, but it has to be handled 2008-11-13 20:39 193 * d_drop() is used mainly for stuff that wants to invalidate a dentry for some 2008-11-13 20:39 194 * reason (NFS timeouts or autofs deletes). 2008-11-13 20:40 we are _probably_ ok, but this does have to be investigated pretty closely 2008-11-13 20:40 so, we have opened inode without dentry? 2008-11-13 20:40 after create delete create 2008-11-13 20:40 hirofumi, yes we can have that 2008-11-13 20:40 oh 2008-11-13 20:40 that is common already 2008-11-13 20:41 it's just an open, unliked file 2008-11-13 20:41 unlinked 2008-11-13 20:41 but same dentry is reused 2008-11-13 20:41 possilby even linking to the same inode 2008-11-13 20:42 but that is ok 2008-11-13 20:42 ah 2008-11-13 20:42 yes 2008-11-13 20:42 maybe there is same name unhashed dentry 2008-11-13 20:42 ok, well this d_drop code needs to be read very carefully 2008-11-13 20:43 I don't think it can be allowed to be the same inode 2008-11-13 20:43 I'm pretty sure a delete create has to get a new inode # 2008-11-13 20:43 yes, it should be different inode 2008-11-13 20:43 maze, it's not a delete, it's an unlink 2008-11-13 20:43 there's a difference? 2008-11-13 20:43 yes 2008-11-13 20:44 delete is what happens on the last unlink + last close 2008-11-13 20:44 isn't delete == unlink of something with 1 link? 2008-11-13 20:44 ah ok 2008-11-13 20:44 in that case you would indeed get a new inode in most cases 2008-11-13 20:44 I don't think there is any rule that requires that though 2008-11-13 20:44 well 2008-11-13 20:45 if you got the same inode number, the inode would be reinitialized, if that's what you meant 2008-11-13 20:46 unhashed and reused dentries should have different inodes 2008-11-13 20:47 hirorumi, what will break if that is not the case? 2008-11-13 20:47 possibly some sort of monitoring or backup programs? 2008-11-13 20:47 I don't think though that it can actually be guaranteed... 2008-11-13 20:48 reused is confusible 2008-11-13 20:48 if that is a posix requirement, I didn't know it 2008-11-13 20:48 reallocate and has same name 2008-11-13 20:48 maybe files opened over nfs, surviving deletion/creation of the file? 2008-11-13 20:49 so what guarantees that a different inode is used, in ext2 for example? 2008-11-13 20:49 inode is still live, so it shouldn't same inode? 2008-11-13 20:49 oh, certainly, in that case a different inode will be used 2008-11-13 20:50 yes 2008-11-13 20:50 but if you unlink; fsync parent; create, I think you have a good chance of getting the same inode number again 2008-11-13 20:51 yes, if unlinked file is not open 2008-11-13 20:51 true 2008-11-13 20:51 sounds like a mis-feature 2008-11-13 20:51 I believe we all think the same thing now 2008-11-13 20:51 yes 2008-11-13 20:52 I'm still worried about that d_drop 2008-11-13 20:53 so if anything needs to be researched to determine the feasibility of deferred nameops, it is that 2008-11-13 20:53 yes, I think maybe it will also call by dcache shrinker 2008-11-13 20:53 a core patch might be needed to make it work 2008-11-13 20:54 well, I'm not sure where is call it 2008-11-13 20:54 now, things might still be ok 2008-11-13 20:54 because the vfs still has to call our fs to find out if an entry exists 2008-11-13 20:55 we should not be implementing part of the dentry cache in our fs, but if necessary we can 2008-11-13 20:55 keep a list of dentries that we are in process of deleting 2008-11-13 20:55 in fact, we will always trigger that code 2008-11-13 20:56 that bypasses the dentry_iput 2008-11-13 20:57 if it can care memory balance, it may be ok? 2008-11-13 20:58 can you rephrase that question? 2008-11-13 20:58 deffered entries list can be too big 2008-11-13 20:58 oh, we have control of that 2008-11-13 20:59 we can force a delta transition 2008-11-13 20:59 yes, I think we have to trigger to flush it by memory pressure 2008-11-13 21:00 it's easy to test 2008-11-13 21:00 just delete a million files 2008-11-13 21:01 I worry it may make complex the code 2008-11-13 21:01 s/code/deffered code/ 2008-11-13 21:02 that is a danger 2008-11-13 21:02 it would be worth trying an experiment with tux3/junkfs before commiting to the strategy 2008-11-13 21:03 make sure that the dentry cache does the right thing 2008-11-13 21:03 good idea 2008-11-13 21:04 ok, so for homework... let's know the dentry cache better by next tuesday 2008-11-13 21:04 :-) easy to say ;-) 2008-11-13 21:05 utlk and lxr are your friends 2008-11-13 21:05 on tuesday we can compare notes on this question again, and move on to rename 2008-11-13 21:06 yes 2008-11-13 21:06 I was thinking to myself, why does the vfs leave the instantiation of a dentry under control of the fs, but not the invalidation? 2008-11-13 21:06 I think the answer is: because the vfs is broken 2008-11-13 21:08 this would not be the first time this ever happened, however, the usual way to proceed is to find a disgusting workaround 2008-11-13 21:08 make the the fs work in spite of the awkward api 2008-11-13 21:08 maybe many fs depends on current behaviour, so to change it may become too hard 2008-11-13 21:09 then make a core patch and post it under "see how much nicer this makes this thing we are doing" 2008-11-13 21:09 hirofumi, we change stuff like that on a regular basis 2008-11-13 21:09 sometimes while keeping a legacy interface 2008-11-13 21:09 but more often, just edit everything 2008-11-13 21:10 Linus even sees that as a feature, he likes to break out of tree drivers 2008-11-13 21:10 yes, howver maybe dcache still have many legacy code 2008-11-13 21:10 in this case, a new hook would probably be enough 2008-11-13 21:11 for example, call the fs in the __d_drop 2008-11-13 21:11 and we make more complex it ;) 2008-11-13 21:12 well, probably yes 2008-11-13 21:12 if there is a good efficiency argument its ok 2008-11-13 21:12 yes 2008-11-13 21:13 the nfs-specific hack in vfs_unlink is pretty disgusting, if that could be moved entirely into nfs with the help of a hook that also gives us what we want, that would be an improvement 2008-11-13 21:13 and a simplification 2008-11-13 21:13 yes 2008-11-13 21:14 I have to make sure about dcache stuff though 2008-11-13 21:14 me too 2008-11-13 21:15 deferred nameops is an optimization I really want to do, but we have a way of avoiding it initially 2008-11-13 21:15 ah, I think that's very good 2008-11-13 21:16 that makes it a lot easier to tie it to a core vfs patch 2008-11-13 21:16 we can compare how big win 2008-11-13 21:16 right 2008-11-13 21:16 so that settles that 2008-11-13 21:16 we need both, or we can't prove how cool it is 2008-11-13 21:16 yes 2008-11-13 21:17 ok, should we delete tux3fs.c now? 2008-11-13 21:17 hmm 2008-11-13 21:17 konrad? 2008-11-13 21:17 wasn't there an fs flag about ddrop? 2008-11-13 21:17 ? 2008-11-13 21:17 searching 2008-11-13 21:18 ok, I don't think we have a real problem with this anyway 2008-11-13 21:18 we just keep our own list of deferred deletes 2008-11-13 21:18 100#define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move() 2008-11-13 21:19 not quite d_drop ;-) 2008-11-13 21:19 and when the fs calls the ->lookup, we consult our deferred list to see if we should say "not there" 2008-11-13 21:19 not a lot of code 2008-11-13 21:19 right, d_move 2008-11-13 21:19 well it's nice to know that hook is there 2008-11-13 21:20 because when we look at rename, I am sure there will be issues 2008-11-13 21:20 all for it 2008-11-13 21:20 tux3fuse obsoletes tux3fs 2008-11-13 21:20 ok, done 2008-11-13 21:20 yes 2008-11-13 21:20 and we just wanted to be sure of that 2008-11-13 21:20 which we now are 2008-11-13 21:21 yes 2008-11-13 21:28 ok, the makefile tests that used to run tux3fs now run tux3fuse 2008-11-13 21:28 make testfs and make debug 2008-11-13 21:29 I suppose we could also do something like make testfs DEBUG=1 2008-11-13 21:29 to run the fusefs in the foreground 2008-11-13 21:30 now let's have proper permission handling 2008-11-13 21:34 um.. it seems xattr->size uses 2 for load/store... 2008-11-13 21:39 xattr->atom is u16 and atom_t is u32, and compared those in xcache_lookup() if I'm missing somthing 2008-11-13 21:52 -!- konrad(~konrad@sfr.cs.washington.edu) has joined #tux3 2008-11-13 21:55 hirofumi, yes 2008-11-13 21:55 implying a 2**16 limit for atom numbers 2008-11-13 21:56 nothing is borken, but the limit might be a little small 2008-11-13 21:56 atom_t can simply be delared unsigned, because it is not used for on-disk structures 2008-11-13 21:56 my mistake 2008-11-13 21:58 we could possibly use a 16 bit on-disk form for system xattrs and 32 bits for user defined 2008-11-13 21:58 or just always use 32 form 2008-11-13 21:58 i see 2008-11-13 21:58 and add a size optimization later if it looks like it's worth it 2008-11-13 21:59 it seems xattr->atom should be u32 2008-11-13 21:59 ah, no. atom_t probably 2008-11-13 22:01 they should be PACKED 2008-11-13 22:01 xattr and xcache 2008-11-13 22:03 or use u32 for xattr->size too? 2008-11-13 22:03 oh crap, this is where you can see that mercurial needs a real rename 2008-11-13 22:04 no file has older annotation than the move from user/test to user 2008-11-13 22:04 hirofumi, my thinking is, if the xattr has size bigger than 2**16 we store it in an inode page cache, not in a kmalloc 2008-11-13 22:05 currently, xattrs bigger than 2**12 are not used on linux 2008-11-13 22:05 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-13 22:05 i see 2008-11-13 22:05 ACTION needs to see xattr.c more 2008-11-13 22:05 I think it is worth trying to keep these fields small 2008-11-13 22:06 the biggest use of xattrs is acls 2008-11-13 22:06 which can quickly bloat up the metadata if they are not compact 2008-11-13 22:07 atom number of 32 bits and size of 16 bits is probably right for now 2008-11-13 22:07 i see. also samba uses it? 2008-11-13 22:07 for windows acls, yes 2008-11-13 22:07 but tridge has not been happy with xattr performance on linux 2008-11-13 22:08 so far, ext3 is the best and it is not very good 2008-11-13 22:08 I hope we will do better 2008-11-13 22:08 heh 2008-11-13 22:09 i see. I don't use xattr 2008-11-13 22:09 mercurial can't detect rename? 2008-11-13 22:10 xattr = selinux + extended acls + samba + potentially user stuff 2008-11-13 22:10 yes. I don't use all of those 2008-11-13 22:11 at least for now 2008-11-13 22:11 mercurial fakes the rename as a delete and an add 2008-11-13 22:12 sometimes that is not a very good imitation of a rename, it breaks annotate for one thing 2008-11-13 22:12 yes, but at least git can see history of it, iirc 2008-11-13 22:16 one point for git 2008-11-13 22:16 though I don't think git's metadata design actually supports taht 2008-11-13 22:16 it's a heuristic 2008-11-13 22:17 I'll get rid of this bogus PACKED's now 2008-11-13 22:18 probably 2008-11-13 22:18 it will removed from some places? 2008-11-13 22:19 should only be in tux3.h 2008-11-13 22:19 just getting rid of the attribute 2008-11-13 22:19 that will only cause slower code and has no benefit 2008-11-13 22:20 let's think about size of on-disk atoms for a day or so 2008-11-13 22:20 before changing 2008-11-13 22:20 xattr and xcache? yes 2008-11-13 22:21 I guess we are going to go with 32 bit on-disk atoms at first 2008-11-13 22:21 it's meaning dirent->inum? 2008-11-13 22:22 same size, yes 2008-11-13 22:23 inums are actually supposed to be 48 bits in tux3 2008-11-13 22:23 but ext2 dirents limit us to 32 bits 2008-11-13 22:23 we could change that if we care 2008-11-13 22:23 i see 2008-11-13 22:23 but the ext2 dirent code will be entirely replaced at some point 2008-11-13 22:24 maybe we will fix it after phtree? 2008-11-13 22:24 yes, it will be fixed in phtree 2008-11-13 22:24 I don't think anybody will care about having more than 2**32 inodes for a long time 2008-11-13 22:25 yes 2008-11-13 22:28 I think we will keep the u16 halfwords in struct xattr 2008-11-13 22:28 with many cached xattrs is can make a difference to memory use 2008-11-13 22:29 we will return EINVAL for any xattr over some size, probably 2**12 2008-11-13 22:29 and allow bigger xattrs later 2008-11-13 22:30 I'm just not understanding the another u16 in dump_atoms is what for 2008-11-13 22:30 why is it 2**12? 2008-11-13 22:30 nobody uses xattrs bigger than that on linux 2008-11-13 22:30 ext3 has that limitation 2008-11-13 22:31 i see 2008-11-13 22:31 it packs xattrs into pages 2008-11-13 22:31 has to fit in a page 2008-11-13 22:31 ah, i see 2008-11-13 22:31 that's u16 * in dump_atoms 2008-11-13 22:32 that should be be_u16 2008-11-13 22:32 it's a disk block 2008-11-13 22:32 yes 2008-11-13 22:32 do you want to fix it, or me? 2008-11-13 22:32 and it has two part - hi and lo 2008-11-13 22:32 yes 2008-11-13 22:32 I thought that optimization would be worth it 2008-11-13 22:33 nearly always, the use inc/dec will not overlow into the other half 2008-11-13 22:33 I'm not understanding about those 2008-11-13 22:33 which means that when counts have to be flushed to disk, it will be half the data to transfer 2008-11-13 22:34 i see. I'm also not understanding format yet 2008-11-13 22:34 it's why tux3graph is dump those 2008-11-13 22:35 the count tables are stored at a high offeset in the atable 2008-11-13 22:35 I was missing it 2008-11-13 22:35 atable is a just normal file? 2008-11-13 22:36 yes 2008-11-13 22:36 the atom tables (counts and reverse map) are stored above i_size 2008-11-13 22:36 above i_size? 2008-11-13 22:37 i_size is only important because ext2 dirops rely on it to know how many dirent blocks are in the file 2008-11-13 22:37 ah, it's directory 2008-11-13 22:37 yes 2008-11-13 22:37 I'm forgetting it 2008-11-13 22:37 directory code works very well for atom names 2008-11-13 22:38 incidentally, btrfs uses dirops to implement xattrs 2008-11-13 22:38 i see 2008-11-13 22:39 so both tux3 and btrfs will do a directory lookup on every xattr get or set 2008-11-13 22:39 pretty gross really 2008-11-13 22:39 no wonder xattrs aren't fast enough for tridge 2008-11-13 22:40 i see 2008-11-13 22:42 I see that the refcounts are be_u16 in use_atom 2008-11-13 22:42 it's just dump_atoms that needs fixing 2008-11-13 22:43 yes 2008-11-13 22:44 sb->freeatom has to be be_ too 2008-11-13 22:44 and I'm thinking I'd like to change, "be_u16 *good_name = buffer->data" or something 2008-11-13 22:44 oh sorry 2008-11-13 22:44 it's cache 2008-11-13 22:44 right 2008-11-13 22:44 disksuper is 2008-11-13 22:44 good_name? 2008-11-13 22:45 currently - int low = from_be_u16(((be_u16 *)buffer->data)[offset]) + use 2008-11-13 22:45 oh sure 2008-11-13 22:45 I can't see buffer->data is what for 2008-11-13 22:46 buffer->data is char * I think 2008-11-13 22:46 void * 2008-11-13 22:46 in userspace, void * 2008-11-13 22:47 in kernel, right it's probably char * 2008-11-13 22:47 yes 2008-11-13 22:48 buffer_head was invented when there was no such thing as void * 2008-11-13 22:48 char * has to be explicitly cast, a pain 2008-11-13 22:49 we will probably just write (void *)buffer_head->data 2008-11-13 22:49 almost all pepole want void * 2008-11-13 22:50 (void *)buffer_head->b_data 2008-11-13 22:50 yes 2008-11-13 22:51 I think we should wrap all uses of buffer->data 2008-11-13 22:51 so we have buffer_data(buffer) 2008-11-13 22:51 then the change for kernel will be small 2008-11-13 22:52 maybe so, I'm not sure we will use buffer_head or not though 2008-11-13 22:54 we are stuck with it, for remembering block state in the page cache 2008-11-13 22:54 no practical alternative 2008-11-13 22:54 eventually, I will get time to work on replacing buffers by subpages, which will be much nicer 2008-11-13 22:55 it will make struct page smaller for one thing, no need for a list of buffers 2008-11-13 22:55 fsblock from nick? or I was thinking we may use something or it at past 2008-11-13 22:55 I don't know about fsblock 2008-11-13 22:55 but the name sounds like it could be similar to my plan 2008-11-13 22:56 oh 2008-11-13 22:56 looks very different 2008-11-13 22:56 much more heavyweight than what I had in mind 2008-11-13 22:56 got url? 2008-11-13 22:56 http://lwn.net/Articles/239621/ 2008-11-13 22:57 it seems to support lageblock too 2008-11-13 22:57 s/lage/large/ 2008-11-13 22:57 http://lkml.org/lkml/2007/6/25/408 2008-11-13 22:57 that was always the plan 2008-11-13 22:57 http://lkml.org/lkml/2007/6/23/252 2008-11-13 22:58 the patch was posted for review recently 2008-11-13 22:58 [*] About the furthest we could go is use the struct page for the 2008-11-13 22:58 information otherwise stored in the buffer_head, but this would be 2008-11-13 22:58 tricky and suboptimal for filesystems with non page sized blocks and 2008-11-13 22:58 would probably bloat the struct page as well. 2008-11-13 22:58 <- nick is wrong 2008-11-13 22:58 I should review it 2008-11-13 22:59 http://www.spinics.net/lists/linux-fsdevel/msg17327.html 2008-11-13 22:59 most of the recent improvements to vm have resulting in large amounts of bloat 2008-11-13 22:59 it's getting out of hand 2008-11-13 23:00 me too 2008-11-13 23:01 time management too for me 2008-11-13 23:01 well, one thing at a time 2008-11-13 23:01 for now, it's buffer_head or nothing 2008-11-13 23:02 it's gross, but it's not the grossest thing in kernel, far from it 2008-11-13 23:02 "we suck, but we suck fast" :-) 2008-11-13 23:03 well, I'm happy with buffer_head at least for now 2008-11-13 23:03 ext3 completely relies on it 2008-11-13 23:03 partly because of me :) 2008-11-13 23:04 the index code 2008-11-13 23:04 the journal code is also heaviliy dependent 2008-11-13 23:04 yes, jbh is 2008-11-13 23:07 let's see, permission setting in tux3fuse... 2008-11-13 23:07 it can't read/write permission? 2008-11-13 23:08 there is a little work to do 2008-11-13 23:08 nothing big 2008-11-13 23:08 just enable loading/saving that attribute 2008-11-13 23:08 i see 2008-11-13 23:08 and assign it on create etc 2008-11-13 23:08 I'm not using it usually 2008-11-13 23:09 I'm going to try to make it so the fuse fs can be accessible to normal user 2008-11-13 23:09 then the test mount can be in the local directory instead of /tmp 2008-11-13 23:10 current fuse can do it? 2008-11-13 23:10 see tux3_create in tux3fuse.c 2008-11-13 23:11 ok 2008-11-13 23:12 oh, it doesn't store some attributes 2008-11-13 23:12 right, that looks like the main problem 2008-11-13 23:12 probably just need | XXX_BIT 2008-11-13 23:13 I'm just checking that mode_t matches properly 2008-11-13 23:14 it seems to just set iattr.isize, .uid, .gid 2008-11-13 23:14 line number? 2008-11-13 23:15 tux3_create -> tuxcreate -> make_inode -> inode.c:115 2008-11-13 23:16 yes, and iattr->mode isn't set in tux3fuse.c 2008-11-13 23:16 need to set it 2008-11-13 23:17 and "mode | 0666" looks odd 2008-11-13 23:17 that's broken 2008-11-13 23:18 iattr->mode is just unsigned 2008-11-13 23:18 need to get precise about whether it is supposed to be libc mode_t or not 2008-11-13 23:19 yes 2008-11-13 23:19 at least it wants setuid bit 2008-11-13 23:19 libc mode_t does not seem to be documented precisely 2008-11-13 23:20 I think mode_t is same with in kernel usually 2008-11-13 23:20 that would be nice 2008-11-13 23:20 ah, wait 2008-11-13 23:21 size of type? 2008-11-13 23:22 static int xmp_create(const char *path, mode_t mode, struct fuse_file_info *fi) 2008-11-13 23:22 { 2008-11-13 23:22 int fd; 2008-11-13 23:22 fd = open(path, fi->flags, mode); 2008-11-13 23:23 typedef __kernel_mode_t mode_t; 2008-11-13 23:23 yes 2008-11-13 23:23 size varies by arch 2008-11-13 23:24 but what about bit layout 2008-11-13 23:24 yes 2008-11-13 23:24 I think it is same 2008-11-13 23:24 it looks that way 2008-11-13 23:25 I wonder why all the complexity then 2008-11-13 23:25 should just be unsigned mode_t in kernel 2008-11-13 23:26 now where do we get the uid/gid from 2008-11-13 23:26 on x86 uses unsigned short? 2008-11-13 23:26 in fuse 2008-11-13 23:26 a useless thing to do probably 2008-11-13 23:27 probably struct stat uses 2008-11-13 23:27 so, if changed, binary incompatible 2008-11-13 23:27 that should be declared with uintxx_t 2008-11-13 23:27 not mode_t etc 2008-11-13 23:27 if size has to be precisely defined 2008-11-13 23:28 obviously it does 2008-11-13 23:28 some archs standard may have difference? 2008-11-13 23:29 those apis are all defined per-arch 2008-11-13 23:29 yes 2008-11-13 23:30 fuse_get_context seems to be the way to get uid/gid 2008-11-13 23:31 i see 2008-11-13 23:31 I wonder why it isn't just passed with the create 2008-11-13 23:31 seems odd 2008-11-13 23:32 http://www.prism.uvsq.fr/users/ode/in115/references/fuse/fuse_8h.html#a20 2008-11-13 23:34 much odd 2008-11-13 23:34 usually it should be needed 2008-11-13 23:41 Is there a way to know the uid, gid or pid of the process performing 2008-11-13 23:41 -------------------------------------------------------------------- 2008-11-13 23:41 the operation? 2008-11-13 23:41 -------------- 2008-11-13 23:41 Yes: fuse_get_context()->uid, etc. 2008-11-13 23:41 bleah, fuse_context()->pid return 0 for a create, not a good sign 2008-11-13 23:44 well maybe it would be better to start with chmod 2008-11-13 23:45 req->ctx.uid = in->uid; 2008-11-13 23:45 req->ctx.gid = in->gid; 2008-11-13 23:45 req->ctx.pid = in->pid; 2008-11-13 23:45 in libfuse 2008-11-13 23:45 ah, that looks promising 2008-11-13 23:45 in is from kernel 2008-11-13 23:45 looks like 2008-11-13 23:52 I get dereferencing pointer to incomplete type when I try to use the fuse_req_t req in tux3_create 2008-11-13 23:52 even with fuse/fuse_lowlevel.h included 2008-11-13 23:52 um... 2008-11-13 23:56 that header just has typedef struct fuse_req *fuse_req_t 2008-11-13 23:56 and no definition of fuse_req 2008-11-13 23:56 this is lame 2008-11-13 23:56 fuse_get_context()->uid is not work? 2008-11-13 23:57 ->pid was just zero 2008-11-13 23:57 ->uid too? 2008-11-13 23:57 yes, because running as root 2008-11-13 23:57 ah 2008-11-13 23:57 that's why starting with chmod would be easier 2008-11-13 23:58 but fuse seems like it has some half ideas in its interface 2008-11-13 23:59 I'm not sure this is a good use of time