8000 GitHub - s-udhaya/deltatorch
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

s-udhaya/deltatorch

{"props":{"initialPayload":{"allShortcutsEnabled":false,"path":"/","repo":{"id":643143351,"defaultBranch":"main","name":"deltatorch","ownerLogin":"s-udhaya","currentUserCanPush":false,"isFork":true,"isEmpty":false,"createdAt":"2023-05-20T08:26:56.000Z","ownerAvatar":"https://avatars.githubusercontent.com/u/2215597?v=4","public":true,"private":false,"isOrgOwned":false},"currentUser":null,"refInfo":{"name":"main","listCacheKey":"v0:1712064101.0","canEdit":false,"refType":"branch","currentOid":"4346c313dc194b1222cbc799ce386776fd964120"},"tree":{"items":[{"name":".github/workflows","path":".github/workflows","contentType":"directory","hasSimplifiedPath":true},{"name":"deltatorch","path":"deltatorch","contentType":"directory"},{"name":"examples","path":"examples","contentType":"directory"},{"name":"old_examples","path":"old_examples","contentType":"directory"},{"name":"tests","path":"tests","contentType":"directory"},{"name":".flake8","path":".flake8","contentType":"file"},{"name":".gitignore","path":".gitignore","contentType":"file"},{"name":"Makefile","path":"Makefile","contentType":"file"},{"name":"README.md","path":"README.md","contentType":"file"},{"name":"poetry.lock","path":"poetry.lock","contentType":"file"},{"name":"pyproject.toml","path":"pyproject.toml","contentType":"file"}],"templateDirectorySuggestionUrl":null,"readme":null,"totalCount":11,"showBranchInfobar":true},"fileTree":null,"fileTreeProcessingTime":null,"foldersToFetch":[],"treeExpanded":false,"symbolsExpanded":false,"isOverview":true,"overview":{"banners":{"shouldRecommendReadme":false,"isPersonalRepo":false,"showUseActionBanner":false,"actionSlug":null,"actionId":null,"showProtectBranchBanner":false,"publishBannersInfo":{"dismissActionNoticePath":"/settings/dismiss-notice/publish_action_from_repo","releasePath":"/s-udhaya/deltatorch/releases/new?marketplace=true","showPublishActionBanner":false},"interactionLimitBanner":null,"showInvitationBanner":false,"inviterName":null,"actionsMigrationBannerInfo":{"releaseTags":[],"showImmutableActionsMigrationBanner":false,"initialMigrationStatus":null}},"codeButton":{"contactPath":"/contact","isEnterprise":false,"local":{"protocolInfo":{"httpAvailable":true,"sshAvailable":null,"httpUrl":"https://github.com/s-udhaya/deltatorch.git","showCloneWarning":null,"sshUrl":null,"sshCertificatesRequired":null,"sshCertificatesAvailable":null,"ghCliUrl":"gh repo clone s-udhaya/deltatorch","defaultProtocol":"http","newSshKeyUrl":"/settings/ssh/new","setProtocolPath":"/users/set_protocol"},"platformInfo":{"cloneUrl":"https://desktop.github.com","showVisualStudioCloneButton":false,"visualStudioCloneUrl":"https://windows.github.com","showXcodeCloneButton":false,"xcodeCloneUrl":"xcode://clone?repo=https%3A%2F%2Fgithub.com%2Fs-udhaya%2Fdeltatorch","zipballUrl":"/s-udhaya/deltatorch/archive/refs/heads/main.zip"}},"newCodespacePath":"/codespaces/new?hide_repo_select=true\u0026repo=643143351"},"popovers":{"rename":null,"renamedParentRepo":null},"commitCount":"81","overviewFiles":[{"displayName":"README.md","repoName":"deltatorch","refName":"main","path":"README.md","preferredFileType":"readme","tabName":"README","richText":"\u003carticle class=\"markdown-body entry-content container-lg\" itemprop=\"text\"\u003e\u003cdiv class=\"markdown-heading\" dir=\"auto\"\u003e\u003ch1 tabindex=\"-1\" class=\"heading-element\" dir=\"auto\"\u003edeltatorch\u003c/h1\u003e\u003ca id=\"user-content-deltatorch\" class=\"anchor\" aria-label=\"Permalink: deltatorch\" href=\"#deltatorch\"\u003e\u003csvg class=\"octicon octicon-link\" viewBox=\"0 0 16 16\" version=\"1.1\" width=\"16\" height=\"16\" aria-hidden=\"true\"\u003e\u003cpath d=\"m7.775 3.275 1.25-1.25a3.5 3.5 0 1 1 4.95 4.95l-2.5 2.5a3.5 3.5 0 0 1-4.95 0 .751.751 0 0 1 .018-1.042.751.751 0 0 1 1.042-.018 1.998 1.998 0 0 0 2.83 0l2.5-2.5a2.002 2.002 0 0 0-2.83-2.83l-1.25 1.25a.751.751 0 0 1-1.042-.018.751.751 0 0 1-.018-1.042Zm-4.69 9.64a1.998 1.998 0 0 0 2.83 0l1.25-1.25a.751.751 0 0 1 1.042.018.751.751 0 0 1 .018 1.042l-1.25 1.25a3.5 3.5 0 1 1-4.95-4.95l2.5-2.5a3.5 3.5 0 0 1 4.95 0 .751.751 0 0 1-.018 1.042.751.751 0 0 1-1.042.018 1.998 1.998 0 0 0-2.83 0l-2.5 2.5a1.998 1.998 0 0 0 0 2.83Z\"\u003e\u003c/path\u003e\u003c/svg\u003e\u003c/a\u003e\u003c/div\u003e\n\u003cp dir=\"auto\"\u003e\u003ca target=\"_blank\" rel=\"noopener noreferrer\" href=\"https://github.com/mshtelma/deltatorch/actions/workflows/ci.yml/badge.svg\"\u003e\u003cimg src=\"https://github.com/mshtelma/deltatorch/actions/workflows/ci.yml/badge.svg\" alt=\"image\" style=\"max-width: 100%;\"\u003e\u003c/a\u003e\n\u003ca target=\"_blank\" rel=\"noopener noreferrer\" href=\"https://github.com/mshtelma/deltatorch/actions/workflows/black.yml/badge.svg\"\u003e\u003cimg src=\"https://github.com/mshtelma/deltatorch/actions/workflows/black.yml/badge.svg\" alt=\"image\" style=\"max-width: 100%;\"\u003e\u003c/a\u003e\n\u003ca target=\"_blank\" rel=\"noopener noreferrer\" href=\"https://github.com/mshtelma/deltatorch/actions/workflows/lint.yml/badge.svg\"\u003e\u003cimg src=\"https://github.com/mshtelma/deltatorch/actions/workflows/lint.yml/badge.svg\" alt=\"image\" style=\"max-width: 100%;\"\u003e\u003c/a\u003e\u003c/p\u003e\n\u003cdiv class=\"markdown-heading\" dir=\"auto\"\u003e\u003ch2 tabindex=\"-1\" class=\"heading-element\" dir=\"auto\"\u003eConcept\u003c/h2\u003e\u003ca id=\"user-content-concept\" class=\"anchor\" aria-label=\"Permalink: Concept\" href=\"#concept\"\u003e\u003csvg class=\"octicon octicon-link\" viewBox=\"0 0 16 16\" version=\"1.1\" width=\"16\" height=\"16\" aria-hidden=\"true\"\u003e\u003cpath d=\"m7.775 3.275 1.25-1.25a3.5 3.5 0 1 1 4.95 4.95l-2.5 2.5a3.5 3.5 0 0 1-4.95 0 .751.751 0 0 1 .018-1.042.751.751 0 0 1 1.042-.018 1.998 1.998 0 0 0 2.83 0l2.5-2.5a2.002 2.002 0 0 0-2.83-2.83l-1.25 1.25a.751.751 0 0 1-1.042-.018.751.751 0 0 1-.018-1.042Zm-4.69 9.64a1.998 1.998 0 0 0 2.83 0l1.25-1.25a.751.751 0 0 1 1.042.018.751.751 0 0 1 .018 1.042l-1.25 1.25a3.5 3.5 0 1 1-4.95-4.95l2.5-2.5a3.5 3.5 0 0 1 4.95 0 .751.751 0 0 1-.018 1.042.751.751 0 0 1-1.042.018 1.998 1.998 0 0 0-2.83 0l-2.5 2.5a1.998 1.998 0 0 0 0 2.83Z\"\u003e\u003c/path\u003e\u003c/svg\u003e\u003c/a\u003e\u003c/div\u003e\n\u003cp dir=\"auto\"\u003e\u003ccode\u003edeltatorch\u003c/code\u003e allows users to directly use \u003ccode\u003eDeltaLake\u003c/code\u003e tables as a data source for training using PyTorch.\nUsing \u003ccode\u003edeltatorch\u003c/code\u003e, users can create a PyTorch \u003ccode\u003eDataLoader\u003c/code\u003e to load the training data.\nWe support distributed training using PyTorch DDP as well.\u003c/p\u003e\n\u003cdiv class=\"markdown-heading\" dir=\"auto\"\u003e\u003ch2 tabindex=\"-1\" class=\"heading-element\" dir=\"auto\"\u003eWhy yet another data-loading framework?\u003c/h2\u003e\u003ca id=\"user-content-why-yet-another-data-loading-framework\" class=\"anchor\" aria-label=\"Permalink: Why yet another data-loading framework?\" href=\"#why-yet-another-data-loading-framework\"\u003e\u003csvg class=\"octicon octicon-link\" viewBox=\"0 0 16 16\" version=\"1.1\" width=\"16\" height=\"16\" aria-hidden=\"true\"\u003e\u003cpath d=\"m7.775 3.275 1.25-1.25a3.5 3.5 0 1 1 4.95 4.95l-2.5 2.5a3.5 3.5 0 0 1-4.95 0 .751.751 0 0 1 .018-1.042.751.751 0 0 1 1.042-.018 1.998 1.998 0 0 0 2.83 0l2.5-2.5a2.002 2.002 0 0 0-2.83-2.83l-1.25 1.25a.751.751 0 0 1-1.042-.018.751.751 0 0 1-.018-1.042Zm-4.69 9.64a1.998 1.998 0 0 0 2.83 0l1.25-1.25a.751.751 0 0 1 1.042.018.751.751 0 0 1 .018 1.042l-1.25 1.25a3.5 3.5 0 1 1-4.95-4.95l2.5-2.5a3.5 3.5 0 0 1 4.95 0 .751.751 0 0 1-.018 1.042.751.751 0 0 1-1.042.018 1.998 1.998 0 0 0-2.83 0l-2.5 2.5a1.998 1.998 0 0 0 0 2.83Z\"\u003e\u003c/path\u003e\u003c/svg\u003e\u003c/a\u003e\u003c/div\u003e\n\u003cul dir=\"auto\"\u003e\n\u003cli\u003eMany Deep Learning projects are struggling with efficient data loading, especially with tabular datasets or datasets containing many small images\u003c/li\u003e\n\u003cli\u003eClassical Big Data formats like Parquet can help with this issue, but are hard to operate:\n\u003cul dir=\"auto\"\u003e\n\u003cli\u003ewriters might block readers\u003c/li\u003e\n\u003cli\u003eFailed write can make the whole dataset unreadable\u003c/li\u003e\n\u003cli\u003eMore complicated projects might ingest data all the time, even during training\u003c/li\u003e\n\u003c/ul\u003e\n\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp dir=\"auto\"\u003eDelta Lake storage format solves all these issues, but PyTorch has no direct support for \u003ccode\u003eDeltaLake\u003c/code\u003e datasets.\n\u003ccode\u003edeltatorch\u003c/code\u003e introduces such support and allows users to use \u003ccode\u003eDeltaLake\u003c/code\u003e for training Deep Learning models using PyTorch.\u003c/p\u003e\n\u003cdiv class=\"markdown-heading\" dir=\"auto\"\u003e\u003ch2 tabindex=\"-1\" class=\"heading-element\" dir=\"auto\"\u003eUsage\u003c/h2\u003e\u003ca id=\"user-content-usage\" class=\"anchor\" aria-label=\"Permalink: Usage\" href=\"#usage\"\u003e\u003csvg class=\"octicon octicon-link\" viewBox=\"0 0 16 16\" version=\"1.1\" width=\"16\" height=\"16\" aria-hidden=\"true\"\u003e\u003cpath d=\"m7.775 3.275 1.25-1.25a3.5 3.5 0 1 1 4.95 4.95l-2.5 2.5a3.5 3.5 0 0 1-4.95 0 .751.751 0 0 1 .018-1.042.751.751 0 0 1 1.042-.018 1.998 1.998 0 0 0 2.83 0l2.5-2.5a2.002 2.002 0 0 0-2.83-2.83l-1.25 1.25a.751.751 0 0 1-1.042-.018.751.751 0 0 1-.018-1.042Zm-4.69 9.64a1.998 1.998 0 0 0 2.83 0l1.25-1.25a.751.751 0 0 1 1.042.018.751.751 0 0 1 .018 1.042l-1.25 1.25a3.5 3.5 0 1 1-4.95-4.95l2.5-2.5a3.5 3.5 0 0 1 4.95 0 .751.751 0 0 1-.018 1.042.751.751 0 0 1-1.042.018 1.998 1.998 0 0 0-2.83 0l-2.5 2.5a1.998 1.998 0 0 0 0 2.83Z\"\u003e\u003c/path\u003e\u003c/svg\u003e\u003c/a\u003e\u003c/div\u003e\n\u003cdiv class=\"markdown-heading\" dir=\"auto\"\u003e\u003ch3 tabindex=\"-1\" class=\"heading-element\" dir=\"auto\"\u003eRequirements\u003c/h3\u003e\u003ca id=\"user-content-requirements\" class=\"anchor\" aria-label=\"Permalink: Requirements\" href=\"#requirements\"\u003e\u003csvg class=\"octicon octicon-link\" viewBox=\"0 0 16 16\" version=\"1.1\" width=\"16\" height=\"16\" aria-hidden=\"true\"\u003e\u003cpath d=\"m7.775 3.275 1.25-1.25a3.5 3.5 0 1 1 4.95 4.95l-2.5 2.5a3.5 3.5 0 0 1-4.95 0 .751.751 0 0 1 .018-1.042.751.751 0 0 1 1.042-.018 1.998 1.998 0 0 0 2.83 0l2.5-2.5a2.002 2.002 0 0 0-2.83-2.83l-1.25 1.25a.751.751 0 0 1-1.042-.018.751.751 0 0 1-.018-1.042Zm-4.69 9.64a1.998 1.998 0 0 0 2.83 0l1.25-1.25a.751.751 0 0 1 1.042.018.751.751 0 0 1 .018 1.042l-1.25 1.25a3.5 3.5 0 1 1-4.95-4.95l2.5-2.5a3.5 3.5 0 0 1 4.95 0 .751.751 0 0 1-.018 1.042.751.751 0 0 1-1.042.018 1.998 1.998 0 0 0-2.83 0l-2.5 2.5a1.998 1.998 0 0 0 0 2.83Z\"\u003e\u003c/path\u003e\u003c/svg\u003e\u003c/a\u003e\u003c/div\u003e\n\u003cul dir=\"auto\"\u003e\n\u003cli\u003ePython Version \u0026gt; 3.8\u003c/li\u003e\n\u003cli\u003e\u003ccode\u003epip\u003c/code\u003e or \u003ccode\u003econda\u003c/code\u003e\u003c/li\u003e\n\u003c/ul\u003e\n\u003cdiv class=\"markdown-heading\" dir=\"auto\"\u003e\u003ch3 tabindex=\"-1\" class=\"heading-element\" dir=\"auto\"\u003eInstallation\u003c/h3\u003e\u003ca id=\"user-content-installation\" class=\"anchor\" aria-label=\"Permalink: Installation\" href=\"#installation\"\u003e\u003csvg class=\"octicon octicon-link\" viewBox=\"0 0 16 16\" version=\"1.1\" width=\"16\" height=\"16\" aria-hidden=\"true\"\u003e\u003cpath d=\"m7.775 3.275 1.25-1.25a3.5 3.5 0 1 1 4.95 4.95l-2.5 2.5a3.5 3.5 0 0 1-4.95 0 .751.751 0 0 1 .018-1.042.751.751 0 0 1 1.042-.018 1.998 1.998 0 0 0 2.83 0l2.5-2.5a2.002 2.002 0 0 0-2.83-2.83l-1.25 1.25a.751.751 0 0 1-1.042-.018.751.751 0 0 1-.018-1.042Zm-4.69 9.64a1.998 1.998 0 0 0 2.83 0l1.25-1.25a.751.751 0 0 1 1.042.018.751.751 0 0 1 .018 1.042l-1.25 1.25a3.5 3.5 0 1 1-4.95-4.95l2.5-2.5a3.5 3.5 0 0 1 4.95 0 .751.751 0 0 1-.018 1.042.751.751 0 0 1-1.042.018 1.998 1.998 0 0 0-2.83 0l-2.5 2.5a1.998 1.998 0 0 0 0 2.83Z\"\u003e\u003c/path\u003e\u003c/svg\u003e\u003c/a\u003e\u003c/div\u003e\n\u003cul dir=\"auto\"\u003e\n\u003cli\u003ewith \u003ccode\u003epip\u003c/code\u003e:\u003c/li\u003e\n\u003c/ul\u003e\n\u003cdiv class=\"snippet-clipboard-content notranslate position-relative overflow-auto\" data-snippet-clipboard-copy-content=\"pip install git+https://github.com/delta-incubator/deltatorch\"\u003e\u003cpre class=\"notranslate\"\u003e\u003ccode\u003epip install git+https://github.com/delta-incubator/deltatorch\n\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\n\u003cdiv class=\"markdown-heading\" dir=\"auto\"\u003e\u003ch3 tabindex=\"-1\" class=\"heading-element\" dir=\"auto\"\u003eCreate PyTorch DataLoader to read our DeltaLake table\u003c/h3\u003e\u003ca id=\"user-content-create-pytorch-dataloader-to-read-our-deltalake-table\" class=\"anchor\" aria-label=\"Permalink: Create PyTorch DataLoader to read our DeltaLake table\" href=\"#create-pytorch-dataloader-to-read-our-deltalake-table\"\u003e\u003csvg class=\"octicon octicon-link\" viewBox=\"0 0 16 16\" version=\"1.1\" width=\"16\" height=\"16\" aria-hidden=\"true\"\u003e\u003cpath d=\"m7.775 3.275 1.25-1.25a3.5 3.5 0 1 1 4.95 4.95l-2.5 2.5a3.5 3.5 0 0 1-4.95 0 .751.751 0 0 1 .018-1.042.751.751 0 0 1 1.042-.018 1.998 1.998 0 0 0 2.83 0l2.5-2.5a2.002 2.002 0 0 0-2.83-2.83l-1.25 1.25a.751.751 0 0 1-1.042-.018.751.751 0 0 1-.018-1.042Zm-4.69 9.64a1.998 1.998 0 0 0 2.83 0l1.25-1.25a.751.751 0 0 1 1.042.018.751.751 0 0 1 .018 1.042l-1.25 1.25a3.5 3.5 0 1 1-4.95-4.95l2.5-2.5a3.5 3.5 0 0 1 4.95 0 .751.751 0 0 1-.018 1.042.751.751 0 0 1-1.042.018 1.998 1.998 0 0 0-2.83 0l-2.5 2.5a1.998 1.998 0 0 0 0 2.83Z\"\u003e\u003c/path\u003e\u003c/svg\u003e\u003c/a\u003e\u003c/div\u003e\n\u003cp dir=\"auto\"\u003eTo utilize \u003ccode\u003edeltatorch\u003c/code\u003e at first, we will need a DeltaLake table containing training data we would like to use for training your PyTorch deep learning model.\nThere is a requirement: this table must have an autoincrement ID field. This field is used by \u003ccode\u003edeltatorch\u003c/code\u003e for sharding and parallelization of loading.\nAfter that, we can use the \u003ccode\u003ecreate_pytorch_dataloader\u003c/code\u003e function to create PyTorch DataLoader, which can be used directly during training.\nBelow you can find an example of creating a DataLoader for the following table schema :\u003c/p\u003e\n\u003cdiv class=\"highlight highlight-source-sql notranslate position-relative overflow-auto\" dir=\"auto\" data-snippet-clipboard-copy-content=\"CREATE TABLE TRAINING_DATA \n( \n image BINARY, \n label BIGINT, \n id INT\n) \nUSING delta LOCATION 'path' \"\u003e\u003cpre\u003e\u003cspan class=\"pl-k\"\u003eCREATE\u003c/span\u003e \u003cspan class=\"pl-k\"\u003eTABLE\u003c/span\u003e \u003cspan class=\"pl-en\"\u003eTRAINING_DATA\u003c/span\u003e \n( \n image BINARY, \n label \u003cspan class=\"pl-k\"\u003eBIGINT\u003c/span\u003e, \n id \u003cspan class=\"pl-k\"\u003eINT\u003c/span\u003e\n) \nUSING delta LOCATION \u003cspan class=\"pl-s\"\u003e\u003cspan class=\"pl-pds\"\u003e'\u003c/span\u003epath\u003cspan class=\"pl-pds\"\u003e'\u003c/span\u003e\u003c/span\u003e \u003c/pre\u003e\u003c/div\u003e\n\u003cp dir=\"auto\"\u003eAfter the table is ready we can use the \u003ccode\u003ecreate_pytorch_dataloader\u003c/code\u003e function to create a PyTorch DataLoader :\u003c/p\u003e\n\u003cdiv class=\"highlight highlight-source-python notranslate position-relative overflow-auto\" dir=\"auto\" data-snippet-clipboard-copy-content=\"from deltatorch import create_pytorch_dataloader\nfrom deltatorch import FieldSpec\n\ndef create_data_loader(path:str, batch_size:int):\n\n return create_pytorch_dataloader(\n # Path to the DeltaLake table\n path,\n # Autoincrement ID field\n id_field=\u0026quot;id\u0026quot;,\n # Fields which will be used during training\n fields=[\n FieldSpec(\u0026quot;image\u0026quot;,\n # Load image using Pillow\n load_image_using_pil=True, \n # PyTorch Transform\n transform=transform),\n FieldSpec(\u0026quot;label\u0026quot;),\n ],\n # Number of readers \n num_workers=2,\n # Shuffle data inside the record batches\n shuffle=True,\n # Batch size \n batch_size=batch_size,\n )\"\u003e\u003cpre\u003e\u003cspan class=\"pl-k\"\u003efrom\u003c/span\u003e \u003cspan class=\"pl-s1\"\u003edeltatorch\u003c/span\u003e \u003cspan class=\"pl-k\"\u003eimport\u003c/span\u003e \u003cspan class=\"pl-s1\"\u003ecreate_pytorch_dataloader\u003c/span\u003e\n\u003cspan class=\"pl-k\"\u003efrom\u003c/span\u003e \u003cspan class=\"pl-s1\"\u003edeltatorch\u003c/span\u003e \u003cspan class=\"pl-k\"\u003eimport\u003c/span\u003e \u003cspan class=\"pl-v\"\u003eFieldSpec\u003c/span\u003e\n\n\u003cspan class=\"pl-k\"\u003edef\u003c/span\u003e \u003cspan class=\"pl-en\"\u003ecreate_data_loader\u003c/span\u003e(\u003cspan class=\"pl-s1\"\u003epath\u003c/span\u003e:\u003cspan class=\"pl-smi\"\u003estr\u003c/span\u003e, \u003cspan class=\"pl-s1\"\u003ebatch_size\u003c/span\u003e:\u003cspan class=\"pl-smi\"\u003eint\u003c/span\u003e):\n\n \u003cspan class=\"pl-k\"\u003ereturn\u003c/span\u003e \u003cspan class=\"pl-en\"\u003ecreate_pytorch_dataloader\u003c/span\u003e(\n \u003cspan class=\"pl-c\"\u003e# Path to the DeltaLake table\u003c/span\u003e\n \u003cspan class=\"pl-s1\"\u003epath\u003c/span\u003e,\n \u003cspan class=\"pl-c\"\u003e# Autoincrement ID field\u003c/span\u003e\n \u003cspan class=\"pl-s1\"\u003eid_field\u003c/span\u003e\u003cspan class=\"pl-c1\"\u003e=\u003c/span\u003e\u003cspan class=\"pl-s\"\u003e\"id\"\u003c/span\u003e,\n \u003cspan class=\"pl-c\"\u003e# Fields which will be used during training\u003c/span\u003e\n \u003cspan class=\"pl-s1\"\u003efields\u003c/span\u003e\u003cspan class=\"pl-c1\"\u003e=\u003c/span\u003e[\n \u003cspan class=\"pl-en\"\u003eFieldSpec\u003c/span\u003e(\u003cspan class=\"pl-s\"\u003e\"image\"\u003c/span\u003e,\n \u003cspan class=\"pl-c\"\u003e# Load image using Pillow\u003c/span\u003e\n \u003cspan class=\"pl-s1\"\u003eload_image_using_pil\u003c/span\u003e\u003cspan class=\"pl-c1\"\u003e=\u003c/span\u003e\u003cspan class=\"pl-c1\"\u003eTrue\u003c/span\u003e, \n \u003cspan class=\"pl-c\"\u003e# PyTorch Transform\u003c/span\u003e\n \u003cspan class=\"pl-s1\"\u003etransform\u003c/span\u003e\u003cspan class=\"pl-c1\"\u003e=\u003c/span\u003e\u003cspan class=\"pl-s1\"\u003etransform\u003c/span\u003e),\n \u003cspan class=\"pl-en\"\u003eFieldSpec\u003c/span\u003e(\u003cspan class=\"pl-s\"\u003e\"label\"\u003c/span\u003e),\n ],\n \u003cspan class=\"pl-c\"\u003e# Number of readers \u003c/span\u003e\n \u003cspan class=\"pl-s1\"\u003enum_workers\u003c/span\u003e\u003cspan class=\"pl-c1\"\u003e=\u003c/span\u003e\u003cspan class=\"pl-c1\"\u003e2\u003c/span\u003e,\n \u003cspan class=\"pl-c\"\u003e# Shuffle data inside the record batches\u003c/span\u003e\n \u003cspan class=\"pl-s1\"\u003eshuffle\u003c/span\u003e\u003cspan class=\"pl-c1\"\u003e=\u003c/span\u003e\u003cspan class=\"pl-c1\"\u003eTrue\u003c/span\u003e,\n \u003cspan class=\"pl-c\"\u003e# Batch size \u003c/span\u003e\n \u003cspan class=\"pl-s1\"\u003ebatch_size\u003c/span\u003e\u003cspan class=\"pl-c1\"\u003e=\u003c/span\u003e\u003cspan class=\"pl-s1\"\u003ebatch_size\u003c/span\u003e,\n )\u003c/pre\u003e\u003c/div\u003e\n\u003c/article\u003e","loaded":true,"timedOut":false,"errorMessage":null,"headerInfo":{"toc":[{"level":1,"text":"deltatorch","anchor":"deltatorch","htmlText":"deltatorch"},{"level":2,"text":"Concept","anchor":"concept","htmlText":"Concept"},{"level":2,"text":"Why yet another data-loading framework?","anchor":"why-yet-another-data-loading-framework","htmlText":"Why yet another data-loading framework?"},{"level":2,"text":"Usage","anchor":"usage","htmlText":"Usage"},{"level":3,"text":"Requirements","anchor":"requirements","htmlText":"Requirements"},{"level":3,"text":"Installation","anchor":"installation","htmlText":"Installation"},{"level":3,"text":"Create PyTorch DataLoader to read our DeltaLake table","anchor":"create-pytorch-dataloader-to-read-our-deltalake-table","htmlText":"Create PyTorch DataLoader to read our DeltaLake table"}],"siteNavLoginPath":"/login?return_to=https%3A%2F%2Fgithub.com%2Fs-udhaya%2Fdeltatorch"}}],"overviewFilesProcessingTime":0}},"appPayload":{"helpUrl":"https://docs.github.com","findFileWorkerPath":"/assets-cdn/worker/find-file-worker-263cab1760dd.js","findInFileWorkerPath":"/assets-cdn/worker/find-in-file-worker-2e7f7047116e.js","githubDevUrl":null,"enabled_features":{"copilot_workspace":null,"code_nav_ui_events":false,"react_blob_overlay":false,"accessible_code_button":true,"github_models_repo_integration":false}}}}
 
 

Repository files navigation

deltatorch

image image image

Concept

deltatorch allows users to directly use DeltaLake tables as a data source for training using PyTorch. Using deltatorch, users can create a PyTorch DataLoader to load the training data. We support distributed training using PyTorch DDP as well.

Why yet another data-loading framework?

  • Many Deep Learning projects are struggling with efficient data loading, especially with tabular datasets or datasets containing many small images
  • Classical Big Data formats like Parquet can help with this issue, but are hard to operate:
    • writers might block readers
    • Failed write can make the whole dataset unreadable
    • More complicated projects might ingest data all the time, even during training

Delta Lake storage format solves all these issues, but PyTorch has no direct support for DeltaLake datasets. deltatorch introduces such support and allows users to use DeltaLake for training Deep Learning models using PyTorch.

Usage

Requirements

  • Python Version > 3.8
  • pip or conda

Installation

  • with pip:
pip install  git+https://github.com/delta-incubator/deltatorch

Create PyTorch DataLoader to read our DeltaLake table

To utilize deltatorch at first, we will need a DeltaLake table containing training data we would like to use for training your PyTorch deep learning model. There is a requirement: this table must have an autoincrement ID field. This field is used by deltatorch for sharding and parallelization of loading. After that, we can use the create_pytorch_dataloader function to create PyTorch DataLoader, which can be used directly during training. Below you can find an example of creating a DataLoader for the following table schema :

CREATE TABLE TRAINING_DATA 
(   
    image BINARY,   
    label BIGINT,   
    id INT
) 
USING delta LOCATION 'path' 

After the table is ready we can use the create_pytorch_dataloader function to create a PyTorch DataLoader :

from deltatorch import create_pytorch_dataloader
from deltatorch import FieldSpec

def create_data_loader(path:str, batch_size:int):

    return create_pytorch_dataloader(
        # Path to the DeltaLake table
        path,
        # Autoincrement ID field
        id_field="id",
        # Fields which will be used during training
        fields=[
            FieldSpec("image",
                      # Load image using Pillow
                      load_image_using_pil=True, 
                      # PyTorch Transform
                      transform=transform),
            FieldSpec("label"),
        ],
        # Number of readers 
        num_workers=2,
        # Shuffle data inside the record batches
        shuffle=True,
        # Batch size        
        batch_size=batch_size,
    )

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.5%
  • Makefile 0.5%
0