2008-11-04 06:15 -!- mlankhorst_(~m@fw1.astro.rug.nl) has joined #tux3 2008-11-04 06:49 -!- pgquiles(~pgquiles@228.Red-81-35-100.dynamicIP.rima-tde.net) has joined #tux3 2008-11-04 08:46 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2008-11-04 09:09 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-11-04 09:12 ACTION is going to be away for the rest of the week (due to a conference) 2008-11-04 10:14 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-11-04 12:02 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-04 14:31 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-04 19:56 -!- RalucaM(~ral@pool-151-196-126-202.balt.east.verizon.net) has joined #tux3 2008-11-04 19:57 hi 2008-11-04 19:57 yo 2008-11-04 20:08 hi 2008-11-04 20:08 now that the election thing's done, should we probe the kernel? 2008-11-04 20:08 :-) 2008-11-04 20:08 oh is it? 2008-11-04 20:08 just 2008-11-04 20:09 some whooping and hollering outside 2008-11-04 20:09 nice and quiet over here 2008-11-04 20:10 looks like a huge win too 2008-11-04 20:10 projections as of yesterday ran from 330 to 350 2008-11-04 20:11 looks like it will be towards the high side 2008-11-04 20:12 http://lxr.linux.no/linux+v2.6.27/ 2008-11-04 20:12 let's take a look at iget 2008-11-04 20:13 if you're ready 2008-11-04 20:13 hi 2008-11-04 20:14 hi 2008-11-04 20:14 today is tux3 u? 2008-11-04 20:14 just starting 2008-11-04 20:14 now that the u.s. election is no longer in doubt 2008-11-04 20:15 iget, is that like wget but from apple? 2008-11-04 20:15 oh 2008-11-04 20:15 in that they both run a computer, yes 2008-11-04 20:15 lol 2008-11-04 20:16 hmm, I'm getting some indexing incorrectness from lxr 2008-11-04 20:16 it only finds one occurance of ext2_iget 2008-11-04 20:18 search iget? 2008-11-04 20:18 grep in fs/ext2 finds a bunch 2008-11-04 20:18 I wonder if it is just this version that is messed up 2008-11-04 20:19 I noticed some other indexing errors with lxr a few days ago 2008-11-04 20:19 I get roughly (if not exactly) the same results with 2.6.26.7 2.6.27 and 2.6.27.4 2008-11-04 20:19 2 matches and 5 in freetext 2008-11-04 20:20 the matches are declaration and definition 2008-11-04 20:20 the freetext seem fine as well 2008-11-04 20:20 http://lxr.linux.no/linux+v2.6.27.4/fs/ext2/inode.c#L1184 2008-11-04 20:21 the original iget is long gone 2008-11-04 20:21 we now have iget_locked 2008-11-04 20:21 and iget5_locked 2008-11-04 20:22 the purpose is to return an inode given an inode number 2008-11-04 20:22 http://lxr.linux.no/linux+v2.6.27/fs/inode.c#L942 2008-11-04 20:23 this implies that the filesystem has inode numbers 2008-11-04 20:23 which is not a requirement in Linux 2008-11-04 20:23 vfat for example 2008-11-04 20:23 and ramfs 2008-11-04 20:24 the new interface is a little unfamiliar to me, it is broken into two parts 2008-11-04 20:25 iget(5)_locked, and the filesystem is actually called when the inode is unlocked 2008-11-04 20:25 as part of the unlock 2008-11-04 20:25 the two functions are almost identical 2008-11-04 20:26 iget5 takes a generic test function to be used in the hash search 2008-11-04 20:27 next stop is unlock_new_inode 2008-11-04 20:28 so basically more OO implemented in C... 2008-11-04 20:28 ersatz oo 2008-11-04 20:28 http://lxr.linux.no/linux+v2.6.27/fs/inode.c#L576 2008-11-04 20:29 I'm assuming the part in #ifdef is a no-op? 2008-11-04 20:30 it is 2008-11-04 20:30 so I lied 2008-11-04 20:31 just for lockdep 2008-11-04 20:31 call into the filesystem is part of a iget_locked ->iget unlock_new_inode sandwich 2008-11-04 20:31 I don't think we're looking at the func we should be looking at 2008-11-04 20:32 I think new_inode and unlock_new_inode are paired, iget_locked should probably be unlocked elsewhere 2008-11-04 20:33 hmm 2008-11-04 20:33 or maybe I'm seeing things 2008-11-04 20:34 for example 2008-11-04 20:34 http://lxr.linux.no/linux+v2.6.27/fs/ext2/inode.c#L1184 2008-11-04 20:34 ok, I think get(5)_locked can potentially return a new_inode, but not always - only if the initial lookup fails 2008-11-04 20:34 iget is just a library function, called by the fs 2008-11-04 20:35 unlock_new_inode would appear to be a misnomer 2008-11-04 20:35 it always clears the I_NEW flag, which is perhaps the reason its named that way 2008-11-04 20:35 it gets called at the very end of ext2_iget 2008-11-04 20:36 I think it requires some conditions that are true for new inodes 2008-11-04 20:36 for the locking to be correct 2008-11-04 20:36 http://lxr.linux.no/linux+v2.6.27/fs/inode.c#L595 2008-11-04 20:39 ok, is it really true that ext2_iget returns the inode locked if is already in the hash and unlocked if it is new? 2008-11-04 20:39 yes I'm wondering about that 2008-11-04 20:39 seems ... weird ... 2008-11-04 20:41 does iget_locked really return a locked inode? 2008-11-04 20:41 if iget_locked found in cache, it's already unlocked 2008-11-04 20:42 if not found, inode has I_NEW|I_LOCK 2008-11-04 20:42 yes 2008-11-04 20:43 ok, so the inode hash is just a service the vfs provides to a filesystem 2008-11-04 20:44 it'll increase ref count though 2008-11-04 20:44 I think I get it 2008-11-04 20:44 if it's in cache, then it's a valid inode, that others can access 2008-11-04 20:44 if it's not, then we allocate a new one 2008-11-04 20:44 and it's invalid, and thus has to be locked, so others don't get junk, until we fill it in and unlock it 2008-11-04 20:44 yes 2008-11-04 20:45 there is one user of iget* that doesn't treat it as a mere library call 2008-11-04 20:45 which is nfs 2008-11-04 20:46 care to elaborate? 2008-11-04 20:47 well 2008-11-04 20:47 it used to ;) 2008-11-04 20:47 now we have nfs-specific methods to resolve filehandles 2008-11-04 20:47 ah, nfsd 2008-11-04 20:47 moment 2008-11-04 20:48 are we talking about the nfs server or the client? 2008-11-04 20:48 server 2008-11-04 20:48 e.g., ext2_export_ops->ext2_fh_to_dentry 2008-11-04 20:49 so the server is a client of whatever filesystem it's exporting... it shouldn't be mucking around in that filesystems innards at all? 2008-11-04 20:49 right, the filessystem has to provide the export_operations interface and a couple other things 2008-11-04 20:50 and this makes iget* a proper library call 2008-11-04 20:50 not ever called by vfs 2008-11-04 20:51 the vfs way of getting an inode is to resolve a path 2008-11-04 20:51 so the iget* family of functions are just a library implementation of a inode cache? 2008-11-04 20:51 eventually calling ->lookup 2008-11-04 20:51 which filesystems may or may not use 2008-11-04 20:51 yes 2008-11-04 20:51 as far as I know, all inode using filesystems use it 2008-11-04 20:52 fatfs doesn't use it 2008-11-04 20:52 not inode - fs 2008-11-04 20:53 iget*_locked? 2008-11-04 20:53 yes 2008-11-04 20:54 I wonder how far back the _locked variant goes 2008-11-04 20:54 I first knew these as iget() and iget4() 2008-11-04 20:55 what ext2 actually does between the iget and the unlock is important, but not that interesting 2008-11-04 20:55 fill in the cached inode 2008-11-04 20:56 by finding a block in the buffer cache, or reading it in if it's not there 2008-11-04 20:56 basically on-disk format to in-memory inode-cache conversion 2008-11-04 20:56 yes 2008-11-04 20:57 we have ext2_update_inode that writes out a changed inode 2008-11-04 20:57 create a new inode in cache, delete an inode, and those are the main inode operations 2008-11-04 20:58 now if I may, I'll talk about something specific to tux3 2008-11-04 20:59 I found that I have one big interaction between the caching layer and the disk update layer that I initially overlooked 2008-11-04 21:00 go on ... 2008-11-04 21:00 writing to a file is very nicely decoupled, all the inode can go into the page cache 2008-11-04 21:00 and then later, changes can be made to the on disk structures 2008-11-04 21:01 and the ondisk inode updated to reflect that, with pointers to the new data and updated attributes 2008-11-04 21:01 when we create a file, the change to the dirent block only needs to happen in cache 2008-11-04 21:02 but we need to have a inode number to make the directory link 2008-11-04 21:02 that requires accessing the on-disk filesystems 2008-11-04 21:02 looking for a free inode 2008-11-04 21:02 and updating the inode table block so that the same inode is not allocated again 2008-11-04 21:02 well theoretically inodes could just be the number of the operation on the fs 2008-11-04 21:03 you have to remember that number somehow 2008-11-04 21:03 because they have to be persistent 2008-11-04 21:03 superblock? 2008-11-04 21:03 the inode number is forever, at least if you are supporting nfs on your filesystem 2008-11-04 21:04 the superblock stores the last # we allocated, playing back the log may increase that 2008-11-04 21:04 allocating an inode, involves taking the number, increasing it, and logging the increase 2008-11-04 21:04 and we use 64bits or something, so we don't have to worry about running out 2008-11-04 21:05 works if the value of the inode number doesn't matter 2008-11-04 21:05 why should it matteR? 2008-11-04 21:05 do we have size limits? 2008-11-04 21:05 usually it does matter, because you need to store related fs objects near each other 2008-11-04 21:06 in pracice, because more than one inode is stored on a block 2008-11-04 21:06 related in what sense? 2008-11-04 21:06 by being in the same directory for example 2008-11-04 21:06 can't you just use a hash tree for inode lookups though? 2008-11-04 21:07 meant b-tree ;-) 2008-11-04 21:07 hash keyed btree? 2008-11-04 21:07 no, hash was just a typo 2008-11-04 21:08 meant a b-tree indexed by inode # 2008-11-04 21:08 you could 2008-11-04 21:08 but physical proximity is pretty important 2008-11-04 21:08 you'll have physical proximity for files created close to each other timewise 2008-11-04 21:08 ACTION waits for Maze to suggest adding another layer of indirection 2008-11-04 21:09 not so, for files which are being updated etc 2008-11-04 21:09 ACTION thinks about the days of flash drives and physical location no longer mattering... 2008-11-04 21:09 even for flash it matters, if you want to pack more than one inode onto a block 2008-11-04 21:10 yes, but not as much 2008-11-04 21:10 by a huge margin 2008-11-04 21:10 true 2008-11-04 21:10 but still enough to care I think 2008-11-04 21:10 and single inodes are probably going to be pretty beefy 2008-11-04 21:11 filesystems that have taken the simplifying assumption of making inodes block-granular have paid for it in performance 2008-11-04 21:11 ocfs2 is a good example 2008-11-04 21:11 they can't be block granular, that much is obvious 2008-11-04 21:11 besides 2008-11-04 21:11 with disks 2008-11-04 21:11 it's not just a matter of having many within one disk block 2008-11-04 21:12 sector granular and maybe you have a deal ;) 2008-11-04 21:12 but having them close to each other (say within the same 64k) also makes a big deal (readahead etc) 2008-11-04 21:12 you should still probably be capable of fitting more than 1 (maybe 2?) inodes in a sector 2008-11-04 21:13 although really depends 2008-11-04 21:13 how many can be packed on a block has a big effect on cache performance, mass delete is a good way to stress that 2008-11-04 21:13 performance of which cache? 2008-11-04 21:13 disk cache? 2008-11-04 21:13 page cache? 2008-11-04 21:13 bugfer cache? 2008-11-04 21:13 inode table block cache -> buffer cache 2008-11-04 21:14 the answer is "yes" to all three 2008-11-04 21:14 clearly - the bug cache is most important ;-) 2008-11-04 21:15 anyway, the question is similar to "why don't we just allocate data blocks sequentially" 2008-11-04 21:15 instead of going to the bother of having bitmaps etc 2008-11-04 21:15 yup 2008-11-04 21:15 well, no, not quite 2008-11-04 21:15 for one thing, you need to be able to find freed inodes 2008-11-04 21:15 because you've got deletes - that's why you need the bitmap 2008-11-04 21:16 both for blocks and inodes 2008-11-04 21:16 data and inodes I meant 2008-11-04 21:16 well... 2008-11-04 21:16 you need to be able to reuse blocks 2008-11-04 21:16 you don't need to be able to reuse inodes 2008-11-04 21:16 and inodes 2008-11-04 21:16 ah, ok 2008-11-04 21:16 not reusing inodes probably even solves some problems (nfs) 2008-11-04 21:17 so you make the inode number extra big and never wrap 2008-11-04 21:17 right 2008-11-04 21:17 store them in a hash on disk 2008-11-04 21:17 well, some structure 2008-11-04 21:17 ok, a btree 2008-11-04 21:17 and the inode attributes... right there in the btree? 2008-11-04 21:18 yup 2008-11-04 21:18 and always add new inodes on the right of the btree 2008-11-04 21:18 yup 2008-11-04 21:19 another solution would be to have a pool of pre-allocated inodes that you can pull from if you need to create an inode 2008-11-04 21:19 but that would not have good physical locality either 2008-11-04 21:20 anyway, this was all by way of trying to avoid having to store a new inode number in the inode table block at file create time, right? 2008-11-04 21:21 well, you don't have to store it per say 2008-11-04 21:21 only log the intent 2008-11-04 21:21 I thought the problem was not so much the need to store it 2008-11-04 21:21 but the need to find a free inode number 2008-11-04 21:21 that's correct 2008-11-04 21:22 well, this gets rid of the need to find a free inode number 2008-11-04 21:22 and lets the fs pick the inode number before creating a dirent 2008-11-04 21:23 yup 2008-11-04 21:23 I'll ponder that 2008-11-04 21:23 in the mean time, let me go on about the messy interaction we get otherwise 2008-11-04 21:24 so, _assuming_ we have to look at the inode table structure to choose an inode at file create time 2008-11-04 21:24 [[this can also be extended lock-less to smp or even multi-kernel fs]] 2008-11-04 21:25 right, so a couple of solid advantages 2008-11-04 21:26 still, the notion of endless inode numbers is a little uncomfortable 2008-11-04 21:26 maybe it's just me 2008-11-04 21:26 why? 2008-11-04 21:27 don't know 2008-11-04 21:27 hmm 2008-11-04 21:27 maybe I'm comfortable with this because it lies at the core of my approach to netfs and solving the 2008-11-04 21:27 'make it stateless but sane with caching and reboots' problem 2008-11-04 21:28 I doubt there is any such thing as stateless and sane 2008-11-04 21:28 (two core ideas, not reuse anything that can be not reused, i.e. inodes, second idea, make every operation either a read or reversible) 2008-11-04 21:30 (ie. a delete has to have all the information needed to run it in reverse as a create - although this doesn't require file content) 2008-11-04 21:30 anyway, what happens is: vfs creates dentry for new inode; calls fs; fs allocates inum, stores initial attributes in inode table block; write data to page cache 2008-11-04 21:30 repeat a bunch of times, then flush the cache to stable storage 2008-11-04 21:31 flushing requires assigning blocks to the cached data and making the inode reference those blocks 2008-11-04 21:31 meanwhile... another file create happens 2008-11-04 21:32 the file create has to wait for the flush to complete, or it will change a block that has to be written to disk as part of a consistent fs image 2008-11-04 21:33 alternatively you could treat files/inodes with 1 hardlink to be directly part of the directory file, and only with the instance of a second hardlink would it become promoted into a true inode, although the number wouldn't change, but it would deal with locality on a dir-level 2008-11-04 21:34 ok, there's your problem 2008-11-04 21:34 Can't that be solved with copy-on-write techniques? 2008-11-04 21:35 not quite 2008-11-04 21:35 hmm 2008-11-04 21:35 we can "fork" a new inode table block to accomodate the create, but then the flush might change the original version 2008-11-04 21:35 I was actually thinking we never overwrite old blocks, always write out new ones, and only reuse old ones once they're fully freed 2008-11-04 21:36 (fully freed - not referenced from anywhere in the superblock + forward log chain) 2008-11-04 21:36 you have to distinguish between disk and cache when you say overwrite 2008-11-04 21:37 true 2008-11-04 21:37 well pages in flight to disk have to be treated as (or be) locked 2008-11-04 21:38 except the page isn't in flight yet, we're just setting up the cache to be transferred to disk 2008-11-04 21:38 yeah, that's why copy-on-write update semantics are so nice ;-) 2008-11-04 21:39 and don't work in this case 2008-11-04 21:39 because the flush changes to original copy, while the forked copy is also changed 2008-11-04 21:39 with no obvious way to merge the two sets of changes 2008-11-04 21:40 I think I see 2008-11-04 21:40 I think I need to write an email 2008-11-04 21:40 what's up with the gettogether? 2008-11-04 21:41 seems unlikely, I just spent the weekend out with stomach flu 2008-11-04 21:41 still pretty shaky 2008-11-04 21:42 in that case let's put it off 2008-11-04 21:43 I kind of need to know by wed if I'm going somewhere on thu ;-) 2008-11-04 21:43 wisest course of action 2008-11-04 21:52 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-04 22:11 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-11-04 22:46 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3