pachyderm: Decouple block size and data chunking
Problem
Right now Pachyderm has a static block size which is used when writing files to pfs with PutFile. A single write will be broken up based on the delimiter being used in such a way that block boundaries correspond to delimiter boundaries. If the delimiter boundary doesn’t fall right on a block boundary then pfs will create a block bigger than the block size. And if the total PutFile (or the fragment of data that remains after previous blocks have been filled) is smaller than a block it will create a block smaller than the block size. Blocks, and thus the block size, also come into play when processing data. Map steps split files up based on blocks and 2 datums that are in the same block will necessarily be processed by the same container.
The issue here is that there are competing goals that require moving the block size in opposite directions. Right now we have a block size of 8 MB which satisfy both goals badly. A larger block size makes our access pattern on object storage much nicer, 8MB is a pretty small object for most object storage and we’ve been seeing issues with sending too many requests to obj storage. However a smaller block size makes it so that Pachyderm can parallelize more data sets. Users have been running into problems where they’ll want to parallelize over a file that’s less than 8MB and the only way they can get this to work is to have 1 step in the pipeline which first splits the file up into separate files and then another step which does a Reduce on them so that things get split up by file not block. This is definitely a kludgy workflow that we would like to avoid. It also has the problem that each of these files becomes its own very small block which is even worse for obj storage access patterns than 8MB blocks. Although it’s slightly orthogonal, I think this is a pretty good chance to make our system smarter about merging multiple files together into the same block.
Goals
- BlockSize should be a value that we set entirely based on the underlying object storage, different object stores have different sweet spots in terms of file size that maximizes performance.
- BlockSize shouldn’t be hard coded anymore, setting it as an env var will make it easier for people to tweak stuff.
- Multiple small files within the same commit should be written to the same block. Having small files span commit boundaries seems like it’s going to be difficult because when
FinishCommitis called all of the data referenced in the commit needs to be safely written to obj storage before it can proceed. - How data is chunked for processing should have nothing to do with blocks, instead there should be a separate indexing scheme for this that allows you to chunk the data in a delimited way by reading smartly from different locations in blocks.
Implementation
The biggest change we need is a separate index for the delimiters in a file. Currently we store Diffs in rethinkdb for pfs that contain an array of BlockRefs. A BlockRef looks like:
message BlockRef {
string hash = 1;
uint64 lower = 2;
uint64 upper = 3;
}
I think we should add a field repeated int64 delimiters = 4; which will be all of the bytes we can break the chunk up into. This can be computed by the block server in readBlock pretty trivially. Once we have this information it should be pretty simple to use it in GetFile which already figures out what data to send using BlockRefs
In order to get coalescing small files to work we may need to rethink how we name blocks a little bit. The issue is that normally a call to PutBlock returns a BlockRef which gives you the block’s hash. This isn’t going to work because if later calls to PutBlock might add to data to the same block then it’s going to need have a different hash. Naming the blocks with their hashes (content addressing) might make less sense now that blocks can be composed of smaller files since it’s only going to work if the file requests happen to hit in the same order and thus create a truly identical block. Instead it might make more sense to do deduplication at the diff level, computing a hash for each file chunk that we write to a block and seeing if the hash for that chunk is present in any other block. If it is then we can just reference that chunk instead of putting it in a new block. Under this scheme blocks could just be named using a UUID I think. Since hashing the content wouldn’t correspond to anything that interesting. They could also still be hashed and I don’t really see any downside to that either.
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Comments: 16 (16 by maintainers)
Commits related to this issue
- Fix #933; also add a few more tests — committed to pachyderm/pachyderm by derekchiang 8 years ago
@jdoliner so I think there is a very fundamental difference between Hadoop and pachyderm and this hits it on the head. Hadoop is based on injecting user code as plugins into every aspect of it, including how it parses and breaks up files. Unless pachyderm starts to include a similar plugin based API there is no way the core can hope to cover all possible ways of breaking up files. After all, there are as many ways to breakup files as there are users. You can specify delimiters into the blocks but then how do users influence that. do we get to choose a set stock delimiters or do I inject parsing code somehow. If it’s a set of stock delimiters? Is it only one type per repo/pipeline or can I combine them? How can I inject my own delimiters, by modifying pachyderm?
In the current system there is two ways to inject behaviour, via various config options (so long as pachyderm supports it) and via a user supplied docker container. If we want to support the ability for end users to inject delimiters without modifying pachyderm itself then there needs to be a way to do that interacting only with the files and a filesystem. There is a solution to this and that’s writing out to individual files. The only difference between this and a single large file is coming up with a name.
I guess I kind of feel like, because of the way pachyderm works with containers and exposing things via a filesystem, that large sharded files are almost an anti-pattern. So instead of asking the question “Given that I have a large quantity of coarsely broken data, how do I finely content address it?” it changes into “Given that I have a large quantity of finely addressed data, how do I store it coarsely?”. I Suppose this is the same problem but it is perhaps slightly easier to think about.
If was to solve this problem, I would batch up writes to files in a given commit into one blob. The meta data for a file diff then contains the blob where the file is stored, the offset into the blob and the length of the file. S3 supports partial reads so this is ok, I don’t know about other stores. This also means most of a diff is stored together and so that when a pipeline needs to operate on it then it will likely use most of the diff and we can read the blob in its entirety. Ultimately I just think batching is an easier problem to solve than slicing.