WO2024156084A1

WO2024156084A1 - Variants of cpf1 (cas12a) with improved activity

Info

Publication number: WO2024156084A1
Application number: PCT/CN2023/073486
Authority: WO
Inventors: Jianping Xu; Wan SHI; Lizhao GENG
Original assignee: Syngenta Crop Protection Ag; Syngenta Group Co., Ltd.
Priority date: 2023-01-27
Filing date: 2023-01-27
Publication date: 2024-08-02

Abstract

Provided herein are variant Cas12a proteins comprising at least one human-induced mutation. Also provided are fusion proteins comprising the variant Cas12a proteins and one or more heterologous domains. Also provided are associated nucleic acids, DNA constructs, vectors, cells, and methods of editing nucleic acids using the variant Cas12a proteins and/or fusion proteins. Use of the provided proteins can increase the frequency of desired nucleic acid edits (e.g., SDN-1 edits in plant genomes).

Description

VARIANTS OF CPF1 (CAS12A) WITH IMPROVED ACTIVITY

FIELD

This disclosure relates to methods to increase site-directed nuclease editing.

REFERENCE TO A SEQUENCE LISTING SUBMITTED AS AN XML FILE

This application is accompanied by a sequence listing entitled 82447-SL. xml, created January 19, 2023, which is approximately 149 kilobytes in size. This sequence listing is incorporated herein by reference in its entirety.

BACKGROUND

Site directed nucleases (SDNs) (e.g. zinc finger nucleases, transcription activator-like effector nucleases, CRISPR-associated nucleases) have gained increasing popularity in the gene editing space. These SDNs act as endonucleases and generally create double-stranded breaks (DSBs) in specific DNA sequences, activating intrinsic repair mechanisms of the cell (e.g., homologous recombination) . During the repair process, site-directed modification to said specific DNA sequence can be achieved. The CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats) /Cas (CRISPR-associated) system evolved in bacteria and archaea as an adaptive immune system to defend against viral attack. In recent years, the CRISPR/Cas system has attracted particular interest as a tool for genome editing. CRISPR/Cas systems that generate site-specific double stranded breaks (DSBs) can be used to edit DNA in eukaryotic cells, e.g., by producing deletions, insertions, and/or changes in nucleotide sequence.

BRIEF SUMMARY

The Summary is provided to introduce a selection of concepts that are further described below in the Detailed Description. This Summary is not intended to identify key or essential features of the claimed subject matter, nor is it intended to be used as an aid in limiting the scope of the claimed subject matter.

In one aspect, provided is a Cas12a protein comprising a sequence that is at least 80%identical to the amino acid sequence of SEQ ID NO: 1 and a human-induced mutation at position C965. In some embodiments, the human-induced mutation is a cysteine to serine substitution. In some embodiments, the Cas12a protein further comprises a human-induced mutation at position D156. In some embodiments, the human-induced mutation at position D156 is an aspartic acid to arginine substitution. In some embodiments, the sequence of the Cas12a protein comprises any one of SEQ ID NOs: 5-11.

In another aspect, provided is a Cas12a protein comprising a sequence that is at least 80%identical to the amino acid sequence of SEQ ID NO: 2 and a human-induced mutation at position C70, C1116, and/or C1190. In some embodiments, the human-induced mutation is a cysteine to serine substitution. In some embodiments, the Cas12a protein further comprises a human-induced mutation at position E184. In some embodiments, the human-induced mutation at position E184 is a glutamic acid to arginine substitution.

In another aspect, provided is a Cas12a protein comprising a sequence that is at least 80%identical to the amino acid sequence of SEQ ID NO: 3 and a human-induced mutation at position C334, C379, and/or C674. In some embodiments, the human-induced mutation is a cysteine to serine substitution. In some embodiments, the Cas12a protein further comprises a human-induced mutation at position E174. In some embodiments, the human-induced mutation at position E174 is a glutamic acid to arginine substitution.

In another aspect, provided is a Cas12a protein comprising a sequence that is at least 80%identical to the amino acid sequence of SEQ ID NO: 4 and a human-induced mutation at position C270, C583, C1068, C1099, and/or C1149. In some embodiments, the human-induced mutation is a cysteine to serine substitution. In some embodiments, the Cas12a protein further comprises a human-induced mutation at position D172. In some embodiments, the human-induced mutation at position D172 is an aspartic acid to arginine substitution. In some embodiments, the sequence of the Cas12a protein comprises any one of SEQ ID NOs: 12-19.

In some embodiments of any of the Cas12a proteins described above, the Cas12a protein is a catalytically dead Cas12a (dCas12a) protein of a nickase Cas12a (nCas12a) protein.

In some embodiments of any of the Cas12a proteins described above, the Cas12a protein further comprises a nuclear localization signal.

In another aspect, provided is a fusion protein comprising any of the Cas12a proteins described above and a heterologous domain.

In some embodiments, the heterologous domain is a deaminase domain, a transcription factor domain, a nuclease domain, a reverse-transcriptase domain, a transposase domain, a integrase domain, a uracil DNA glycosylase inhibitor domain, a recombinase domain, a nickase domain, a methyltransferase domain, a methylase domain, an acetylase domain, an acetyltransferase domain, a transcriptional activator domain, or a transcriptional repressor domain.

In some embodiments of the fusion protein, the Cas12a protein is linked to the heterologous domain by a linker sequence.

In another aspect, provided is a nucleic acid encoding any of the Cas12a proteins or any of the fusion proteins described above. In some embodiments, the nucleic acid sequence is any one of SEQ ID NOs: 20-34.

In another aspect, provided is a DNA construct comprising a promoter operably linked to the nucleic acid encoding any of the Cas12a proteins or any of the fusion proteins described above.

In another aspect, provided is a vector comprising the nucleic acid or the DNA construct described above.

In another aspect, provided is a cell comprising the nucleic acid, the DNA construct, or the vector described above. In some embodiments, the cell is a plant cell. In some embodiments, the cell is a maize plant cell, a wheat plant cell, a rice plant cell, a soybean plant cell, a sunflower plant cell, or a tomato plant cell.

In another aspect, provided is a method of editing a nucleic acid, the method comprising contacting the nucleic acid with (i) any one of the Cas12a protein described above or any one of the fusion proteins described above, and (ii) a guide RNA having a region complementary to a selected portion of the nucleic acid, thereby resulting in an edit to the nucleic acid.

BRIEF DESCRIPTION OF THE DRAWINGS

The present application includes the following figures. The figures are intended to illustrate certain embodiments and/or features of the compositions and methods, and to supplement any description (s) of the compositions and methods. The figures do not limit the scope of the compositions and methods, unless the written description expressly indicates that such is the case.

FIG. 1 shows the cysteine residues in LbCas12a may potentially form inter-or intra-molecular interactions. Left: the PyMOL surface model of LbCas12a-crRNA-DNA ternary complex (PDB entry 5XUS) . The highlighted area pointed by arrows are the thiol groups of C965 and C1090 that are potentially exposed to the surface. Right: four cysteine residues (C10, C805, C912, C965) that are scattered in the linear amino acid sequence form a cluster inside the 3D structure of LbCas12a.

FIG. 2 shows two of the cysteine residues in FnCas12a that were selected for substitution, according to aspects of this disclosure. The PyMOL stick models of C1190 and C1116 suggest the thiol groups (in black) are close to each other in the FnCas12a 3D structure (PDB entry 5NFV) , and may potentially form an intramolecular disulfide bond in between.

DETAILED DESCRIPTION

The following description recites various aspects and embodiments of the present compositions and methods. No particular embodiment is intended to define the scope of the compositions and methods. Rather, the embodiments merely provide non-limiting examples of various compositions and methods that are at least included within the scope of the disclosed compositions and methods. The description is to be read from the perspective of one of ordinary skill in the art; therefore, information well known to the skilled artisan is not necessarily included.

I. Terminology

All technical and scientific terms used herein, unless otherwise defined below, are intended to have the same meaning as commonly understood by one of ordinary skill in the art. References to techniques employed herein are intended to refer to the techniques as commonly understood in the art, including variations on those techniques and/or substitutions of equivalent techniques that would be apparent to one of skill in the art. While the following terms are believed to be well understood by one of ordinary skill in the art, the following definitions are set forth to facilitate explanation of the presently disclosed subject.

As used herein, the singular forms “a” , “an” and “the” include plural referents unless the content clearly dictates otherwise. Thus, for example, reference to “an enzyme” optionally includes a combination of two or more such molecules, and the like.

As used herein, “and/or” refers to and encompasses any and all possible combinations of one or more of the associated listed items.

The term “about” as used herein refers to the usual error range for the respective value readily known to the skilled person in this technical field, for example ± 20%, ± 10%, or ± 5%, are within the intended meaning of the recited value.

As used herein, the term “comprising” or “comprise” is open-ended. When used in connection with a subject nucleic acid (or amino acid sequence) , it refers to a nucleic acid sequence (or an amino acid sequence) that includes the subject sequence as a part or as its entire sequence.

As used herein, the transitional phrase “consisting essentially of” means that the scope of a claim is to be interpreted to encompass the specified materials or steps recited in the claim and those that do not materially affect the basic and novel characteristic (s) of the claimed matter. Thus, the term “consisting essentially of” when used in a claim of this disclosure is not intended to be interpreted to be equivalent to “comprising. ”

The term “plurality” refers to more than one entity. Thus, a “plurality of individuals” refers to at least two individuals. In some embodiments, the term plurality refers to more than half of the whole. For example, in some embodiments a “plurality of a population” refers to more than half the members of that population.

The term “plant” as used herein refers to any plant at any stage of development, particularly a seed plant. The term “plant cell” as used herein refers to a structural and physiological unit of a plant, comprising a protoplast and a cell wall. The plant cell may be in form of an isolated single cell or a cultured cell, or as a part of higher organized unit such as, for example, plant tissue, a plant organ, or a whole plant. The plant cell may be derived from or part of an angiosperm or gymnosperm. The plant cell may be a monocotyledonous plant cell (e.g., a maize cell, a rice cell, a sorghum cell, a sugarcane cell, a barley cell, a wheat cell, an oat cell, a turf grass cell, or an ornamental grass cell) or a dicotyledonous plant cell (e.g., a tobacco cell, a pepper cell, an eggplant cell, a sunflower cell, a crucifer cell, a flax cell, a potato cell, a cotton cell, a soybean cell, a sugar bee cell, or an oilseed rape cell. The term “plant cell culture” as used herein refers to cultures of plant units such as, for example, protoplasts, cell culture cells, cells in plant tissues, pollen, pollen tubes, ovules, embryo sacs, zygotes and embryos at various stages of development. The term “plant tissue” as used herein refers to a group of plant cells organized into a structural and functional unit. Any tissue of a plant in planta or in culture is included. This term includes, but is not limited to, whole plants, plant organs, plant seeds, tissue culture and any group of plant cells organized into structural and/or functional units. The use of this term in conjunction with, or in the absence of, any specific type of plant tissue as listed above or otherwise embraced by this definition is not intended to be exclusive of any other type of plant tissue. The term “plant part” as used herein refers to a part of a plant, including single cells and cell tissues such as plant cells that are intact in plants, cell clumps and tissue cultures from which plants can be regenerated. Examples of plant parts include, but are not limited to, single cells and tissues from pollen, ovules, zygotes, leaves, embryos, roots, root tips, anthers, flowers, flower parts, fruits, stems, shoots, cuttings, and seeds; as well as pollen, ovules, egg cells, zygotes, leaves, embryos, roots, root tips, anthers, flowers, flower parts, fruits, stems, shoots, cuttings, scions, rootstocks, seeds, protoplasts, calli, and the like.

The terms “polypeptide, ” “peptide, ” and “protein” are used interchangeably herein to refer to a polymer of amino acid residues. As used herein, the terms encompass amino acid chains of any length, including full-length proteins, wherein the amino acid residues are linked by covalent peptide bonds.

The terms “nucleic acid” and “polynucleotide” are used interchangeably and as used herein refer to deoxyribonucleic acids (DNA) or ribonucleic acids (RNA) and polymers thereof in either single-or double-stranded form, as well as to both sense and anti-sense strands of RNA, cDNA, genomic DNA, mitochondrial DNA, and synthetic forms and mixed polymers of the above. In higher plants, DNA is the genetic material while RNA is involved in the transfer of information contained within DNA into proteins. A “genome” is the entire body of genetic material contained in each cell of an organism. It is understood that when an RNA is described, its corresponding cDNA is also described, wherein uridine is represented as thymidine. In particular embodiments, a nucleotide refers to a ribonucleotide, deoxynucleotide or a modified form of either type of nucleotide, and combinations thereof. In addition, a polynucleotide disclosed herein may include either or both naturally occurring and modified nucleotides linked together by naturally occurring and/or non-naturally occurring nucleotide linkages. The nucleic acid molecules may be modified chemically or biochemically or may contain non-natural or derivatized nucleotide bases, as will be readily appreciated by those of skill in the art. Such modifications include, for example, labels, methylation, substitution of one or more of the naturally occurring nucleotides with an analogue, internucleotide modifications such as uncharged linkages (e.g., methyl phosphonates, phosphotriesters, phosphoramidates, carbamates, and the like) , charged linkages (e.g., phosphorothioates, phosphorodithioates, and the like) , pendent moieties (e.g., polypeptides) , intercalators (e.g., acridine, psoralen, and the like) , chelators, alkylators, and modified linkages (e.g., alpha anomeric nucleic acids, and the like) . The above term is also intended to include any topological conformation, including single-stranded, double-stranded, partially duplexed, triplex, hairpinned, circular and padlocked conformations. A reference to a nucleic acid sequence encompasses its complement unless otherwise specified. Thus, a reference to a nucleic acid molecule having a particular sequence should be understood to encompass its complementary strand, with its complementary sequence. Nucleotide sequences are “complementary” when they specifically hybridize in solution (e.g., according to Watson-Crick base pairing rules) . The term also includes codon-optimized nucleic acids that encode the same polypeptide sequence. It is also understood that nucleic acids can be unpurified, purified, or attached, for example, to a synthetic material such as a bead or column matrix.

The term “corresponding to” in the context of nucleic acid sequences means that when the nucleic acid sequences of certain sequences are aligned with each other, the nucleic acids that “correspond to” certain enumerated positions in the present invention are those that align with these positions in a reference sequence, but that are not necessarily in these exact numerical positions relative to a particular nucleic acid sequence of the invention. Optimal alignment of sequences for comparison can be conducted by computerized implementations of known algorithms. or by visual inspection. Readily available sequence comparison and multiple sequence alignment algorithms are, respectively, the Basic Local Alignment Search Tool (BLAST) and ClustalW/ClustalW2/Clustal Omega programs available on the Internet (e.g., the website of the EMBL-EBI) . Other suitable programs include, but are not limited to, GAP, BestFit, Plot Similarity, and FASTA, which are part of the Accelrys GCG Package available from Accelrys, Inc. of San Diego, Calif., United States of America. See also Smith &Waterman, 1981; Needleman &Wunsch, 1970; Pearson &Lipman, 1988; Ausubel et al., 1988; and Sambrook &Russell, 2001.

Unless otherwise indicated, a particular nucleic acid sequence also implicitly encompasses conservatively modified variants thereof (e.g., degenerate codon substitutions) , alleles, orthologs, SNPs, and complementary sequences as well as the sequence explicitly indicated. Specifically, degenerate codon substitutions may be achieved by generating sequences in which the third position of one or more selected (or all) codons is substituted with mixed-base and/or deoxyinosine residues. See Batzer et al., Nucleic Acid Res. 19: 5081 (1991) ; Ohtsuka et al., J. Biol. Chem. 260: 2605-2608 (1985) ; and Rossolini et al., Mol. Cell. Probes 8: 91-98 (1994) .

The terms “identity” or “substantial identity, ” as used in the context of a polynucleotide or polypeptide sequence described herein, refers to a sequence that has at least 60%sequence identity to a reference sequence. Alternatively, percent identity can be any integer from 60%to 100%. Exemplary embodiments include at least: 60%, 65%, 70%, 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, or 99%, as compared to a reference sequence using the programs described herein; preferably BLAST using standard parameters, as described below. One of skill will recognize that these values can be appropriately adjusted to determine corresponding identity of proteins encoded by two nucleotide sequences by taking into account codon degeneracy, amino acid similarity, reading frame positioning and the like.

For sequence comparison, typically one sequence acts as a reference sequence to which test sequences are compared. When using a sequence comparison algorithm, test and reference sequences are entered into a computer, subsequence coordinates are designated, if necessary, and sequence algorithm program parameters are designated. Default program parameters can be used, or alternative parameters can be designated. The sequence comparison algorithm then calculates the percent sequence identities for the test sequences relative to the reference sequence, based on the program parameters.

A “comparison window, ” as used herein, includes reference to a segment of any one of the number of contiguous positions selected from the group consisting of from 20 to 600, usually about 50 to about 200, more usually about 100 to about 150 in which a sequence may be compared to a reference sequence of the same number of contiguous positions after the two sequences are optimally aligned. Methods of alignment of sequences for comparison are well-known in the art. Optimal alignment of sequences for comparison may be conducted by the local homology algorithm of Smith and Waterman Add. APL. Math. 2: 482 (1981) , by the homology alignment algorithm of Needleman and Wunsch J. Mol. Biol. 48: 443 (1970) , by the search for similarity method of Pearson and Lipman Proc. Natl. Acad. Sci. (U.S.A. ) 85: 2444 (1988) , by computerized implementations of these algorithms (e.g., BLAST) , or by manual alignment and visual inspection.

Algorithms that are suitable for determining percent sequence identity and sequence similarity are the BLAST and BLAST 2.0 algorithms, which are described in Altschul et al. (1990) J. Mol. Biol. 215: 403-410 and Altschul et al. (1977) Nucleic Acids Res. 25: 3389-3402, respectively. Software for performing BLAST analyses is publicly available through the National Center for Biotechnology Information (NCBI) web site. The algorithm involves first identifying high scoring sequence pairs (HSPs) by identifying short words of length W in the query sequence, which either match or satisfy some positive-valued threshold score T when aligned with a word of the same length in a database sequence. T is referred to as the neighborhood word score threshold (Altschul et al, supra) . These initial neighborhood word hits acts as seeds for initiating searches to find longer HSPs containing them. The word hits are then extended in both directions along each sequence for as far as the cumulative alignment score can be increased. Cumulative scores are calculated using, for nucleotide sequences, the parameters M (reward score for a pair of matching residues; always >0) and N (penalty score for mismatching residues; always <0) . For amino acid sequences, a scoring matrix is used to calculate the cumulative score. Extension of the word hits in each direction are halted when: the cumulative alignment score falls off by the quantity X from its maximum achieved value; the cumulative score goes to zero or below, due to the accumulation of one or more negative-scoring residue alignments; or the end of either sequence is reached. The BLAST algorithm parameters W, T, and X determine the sensitivity and speed of the alignment. The BLASTN program (for nucleotide sequences) uses as defaults a word size (W) of 28, an expectation (E) of 10, M=1, N=-2, and a comparison of both strands. For amino acid sequences, the BLASTP program uses as defaults a word size (W) of 3, an expectation (E) of 10, and the BLOSUM62 scoring matrix. See Henikoff &Henikoff, Proc. Natl. Acad. Sci. USA 89: 10915 (1989) .

The BLAST algorithm also performs a statistical analysis of the similarity between two sequences. See, e.g., Karlin &Altschul, Proc. Nat'l. Acad. Sci. USA 90: 5873-5787

. One measure of similarity provided by the BLAST algorithm is the smallest sum probability (P (N) ) , which provides an indication of the probability by which a match between two nucleotide or amino acid sequences would occur by chance. For example, a nucleic acid is considered similar to a reference sequence if the smallest sum probability in a comparison of the test nucleic acid to the reference nucleic acid is less than about 0.01, more preferably less than about 10^-5, and most preferably less than about 10^-20.

“Recombination” is the exchange of DNA strands to produce new nucleotide sequence arrangements. The term may refer to the process of homologous recombination that occurs in double-strand DNA break repair, where a polynucleotide is used as a template to repair a homologous polynucleotide. The term may also refer to exchange of information between two homologous chromosomes during meiosis. The frequency of double recombination is the product of the frequencies of the single recombinants. For instance, a recombinant in a 10 cM area can be found with a frequency of 10%, and double recombinants are found with a frequency of 10%x 10%= 1 % (1 centimorgan is defined as 1%recombinant progeny in a testcross) .

A “gene” is a defined region that is located within a genome and that, besides the aforementioned coding nucleic acid sequence, comprises other, primarily regulatory, nucleic acid sequences responsible for the control of the expression, that is to say the transcription and translation, of the coding portion. Genes can include both coding and non-coding regions (e.g., introns, regulatory elements, promoters, enhancers, termination sequences and 5'a nd 3'untranslated regions) . A gene typically expresses mRNA, functional RNA, or specific protein, including regulatory sequences. Genes may or may not be capable of being used to produce a functional protein. In some embodiments, a gene refers to only the coding region. The term “native gene” refers to a gene as found in nature. The term “chimeric gene” refers to any gene that contains 1) DNA sequences, including regulatory and coding sequences that are not found together in nature, or 2) sequences encoding parts of proteins not naturally adjoined, or 3) parts of promoters that are not naturally adjoined. Accordingly, a chimeric gene may comprise regulatory sequences and coding sequences that are derived from different sources, or comprise regulatory sequences and coding sequences derived from the same source, but arranged in a manner different from that found in nature. A gene may be “isolated” by which is meant a nucleic acid molecule that is substantially or essentially free from components normally found in association with the nucleic acid molecule in its natural state. Such components include other cellular material, culture medium from recombinant production, and/or various chemicals used in chemically synthesizing the nucleic acid molecule.

A “gene of interest” or “nucleotide sequence of interest” refers to any gene which, when transferred to a plant, confers upon the plant a desired characteristic such as antibiotic resistance, virus resistance, insect resistance, disease resistance, or resistance to other pests, herbicide tolerance, improved nutritional value, improved performance in an industrial process or altered reproductive capability. The “gene of interest” may also be one that is transferred to plants for the production of commercially valuable enzymes or metabolites in the plant.

An “isolated” nucleic acid molecule or nucleotide sequence or an “isolated” polypeptide is a nucleic acid molecule, nucleotide sequence, or polypeptide that, by the hand of man, exists apart from its native environment and/or has a function that is different, modified, modulated and/or altered as compared to its function in its native environment and is therefore not a product of nature. An isolated nucleic acid molecule or isolated polypeptide may exist in a purified form or may exist in a non-native environment such as, for example, a recombinant host cell. Thus, for example, with respect to polynucleotides, the term isolated means that it is separated from the chromosome and/or cell in which it naturally occurs. A polynucleotide is also isolated if it is separated from the chromosome and/or cell in which it naturally occurs and is then inserted into a genetic context, a chromosome, a chromosome location, and/or a cell in which it does not naturally occur. The recombinant nucleic acid molecules and nucleotide sequences of the invention can be considered to be “isolated” as defined above.

Thus, an “isolated nucleic acid molecule” or “isolated nucleotide sequence” is a nucleic acid molecule or nucleotide sequence that is not immediately contiguous with nucleotide sequences with which it is immediately contiguous (one on the 5'end and one on the 3'end) in the naturally occurring genome of the organism from which it is derived. Accordingly, in one embodiment, an isolated nucleic acid includes some or all of the 5'non-coding (e.g., promoter) sequences that are immediately contiguous to a coding sequence. The term therefore includes, for example, a recombinant nucleic acid that is incorporated into a vector, into an autonomously replicating plasmid or virus, or into the genomic DNA of a prokaryote or eukaryote, or which exists as a separate molecule (e.g., a cDNA or a genomic DNA fragment produced by PCR or restriction endonuclease treatment) , independent of other sequences. It also includes a recombinant nucleic acid that is part of a hybrid nucleic acid molecule encoding an additional polypeptide or peptide sequence. An “isolated nucleic acid molecule” or “isolated nucleotide sequence” can also include a nucleotide sequence derived from and inserted into the same natural, original cell type, but which is present in a non-natural state, e.g., present in a different copy number, and/or under the control of different regulatory sequences than that found in the native state of the nucleic acid molecule.

The term “isolated” can further refer to a nucleic acid molecule, nucleotide sequence, polypeptide, peptide or fragment that is substantially free of cellular material, viral material, and/or culture medium (e.g., when produced by recombinant DNA techniques) , or chemical precursors or other chemicals (e.g., when chemically synthesized) . Moreover, an “isolated fragment” is a fragment of a nucleic acid molecule, nucleotide sequence or polypeptide that is not naturally occurring as a fragment and would not be found as such in the natural state. “Isolated” does not necessarily mean that the preparation is technically pure (homogeneous) , but it is sufficiently pure to provide the polypeptide or nucleic acid in a form in which it can be used for the intended purpose.

“Homology dependent repair” or “homology directed repair” or “HDR” refers to a mechanism for repairing ssDNA and double stranded dna (dsDNA) damage in cells. This repair mechanism can be used by the cell when there is an HDR template with a sequence with significant homology to the injury site. The term “perfect HDR” refers to a situation in which genomic-homology junctions in the replaced allele underwent complete HDR and “imperfect HDR” refers to a situation in which genomic-homology junctions in the replaced allele underwent partial or incomplete HDR. a donor DNA molecule with homology to the cleaved target DNA sequence is used as a template for repair of the cleaved target DNA sequence, resulting in the transfer of genetic information from the donor polynucleotide to the target DNA. As such, new nucleic acid material may be inserted/copied into the site. In some cases, a target DNA is contacted with a donor molecule, for example a donor DNA molecule. In some cases, a donor DNA molecule is introduced into a cell. In some cases, at least a segment of a donor DNA molecule integrates into the genome of the cell.

“Microhomology-mediated end joining” or “MMEJ” or “alternative nonhomologous end-joining” (Alt-NHEJ) refers to a form of repairing double-stranded breaks in DNA. This repair mechanism utilizes microhomologous sequences to align the broken strands. “Non-homologous end joining” or “NHEJ” refers to a form of repairing double-stranded breaks in DNA. The double-strand breaks are repaired by direct ligation of the break ends to one another. Generally, no new nucleic acid material is inserted into the site, although some nucleic acid material may be lost or added, resulting in a small deletion or a small insertion.

As used herein, “heterologous” refers to a nucleic acid molecule, nucleotide sequence, polypeptide, or amino acid sequence not naturally associated with a host cell into which it is introduced, that either originates from another species or is from the same species or organism but is modified from either its original form or the form primarily expressed in the cell, including non-naturally occurring multiple copies of a naturally occurring sequence. Thus, an amino acid sequence derived from an organism or species different from that of the cell into which the amino acid sequence is introduced, is heterologous with respect to that cell and the cell's descendants. In addition, a heterologous sequence includes a sequence derived from and inserted into the same natural, original cell type, but which is present in a non-natural state, e.g., present in a different copy number, and/or under the control of different regulatory sequences than that found in the native state of the polypeptide. A sequence can also be heterologous to other sequences with which it may be associated, for example in a nucleic acid construct, such as e.g., an expression vector. As one non-limiting example, a promoter may be present in a nucleic acid construct in combination with one or more regulatory element and/or coding sequences that do not naturally occur in association with that particular promoter, i.e., they are heterologous to the promoter.

II. Introduction

In some aspects, provided herein are variant Cas12a proteins having increased site-directed nuclease (SDN) genome editing activity. Site-directed nuclease technology has dramatically increased the speed and precision with which one can make genome edits in various organisms, including plants. Generally, the desired outcomes in SDN-mediated genome editing are 1) to target SDNs to cleave DNA at a specific genomic site in a host (e.g., a plant cell) and 2) to use the host’s natural repair mechanisms to introduce specific genomic changes at the cleavage site. The changes can include small deletions, substitutions, or the addition of a number of nucleotides. Such targeted edits can result in a new and desired characteristic (e.g., enhanced nutrient uptake, decreased allergen production) and/or a reduction in an undesirable characteristic (e.g., herbicide susceptibility) . SDN applications have generally been divided into three categories: SDN-1, SDN-2, and SDN-3. SDN-1 produces a double-stranded break in a genome without the addition of foreign DNA. When such a break is repaired by the host (e.g., via NHEJ) , mutations or deletions can be introduced. If these mutations or deletions are in a gene, the gene can be silenced or knocked out. SDN-2 uses template DNA to introduce a predicted modification at the target cleavage site (e.g., via HDR) , but does not result in insertion of recombinant DNA. SDN-3 also uses template DNA to introduce recombinant or exogenous DNA templates (e.g., a transgene) at the target cleavage site.

Cas12a is a CRISPR-associated (Cas) SDN that functions in a CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats) /Cas system. In bacteria, this system can provide adaptive immunity against foreign DNA (Barrangou, R., et al, “CRISPR provides acquired resistance against viruses in prokaryotes, “Science (2007) 315: 1709-1712; Makarova, K.S., et al, “Evolution and classification of the CRISPR-Cas systems, ” Nat Rev Microbiol (2011) 9: 467-477; Garneau, J. E., et al, “The CRISPR/Cas bacterial immune system cleaves bacteriophage and plasmid DNA, ” Nature (2010) 468: 67-71; Sapranauskas, R., et al, “The Streptococcus thermophilus CRISPR/Cas system provides immunity in Escherichia coli, ” Nucleic Acids Res (2011) 39: 9275-9282) . In a wide variety of organisms including diverse mammals, animals, plants, microbes, and yeast, a CRISPR/Cas system (e.g., modified and/or unmodified) can be utilized as a genome engineering tool. A CRISPR/Cas system can comprise a guide nucleic acid such as a guide RNA (gRNA) complexed with a Cas protein for targeted regulation of gene expression and/or activity or nucleic acid editing. An RNA-guided Cas protein (e.g., a Cas nuclease such as a Cas9 nuclease) can specifically bind a target polynucleotide (e.g., DNA) in a sequence-dependent manner. The Cas protein, if possessing nuclease activity, can cleave the DNA (Gasiunas, G., et al, “Cas9-crRNA ribonucleoprotein complex mediates specific DNA cleavage for adaptive immunity in bacteria, ” Proc Natl Acad Sci USA (2012) 109: E2579-E2 86; Jinek, M., et al, “A programmable dual-RNA-guided DNA endonuclease in adaptive bacterial immunity, ” Science (2012) 337: 816-821; Sternberg, S. H., et al, “DNA interrogation by the CRISPR RNA-guided endonuclease Cas9, ” Nature (2014) 507: 62; Deltcheva, E., et al, “CRISPR RNA maturation by trans-encoded small RNA and host factor RNase III, ” Nature (201 1) 471 : 602-607) . DNA cleavage (e.g., double-strand breaks) can result in DNA break repair which allows for the introduction of gene modification (s) (e.g., nucleic acid editing) .

Cysteine residues are highly reactive residues that are subject to posttranslational modifications. Formation of undesired disulfide bonds and/or modifications could affect proper folding, and/or localization, and/or enzymatic activity of a protein. There are 8-9 cysteine residues in most Cas12a orthologs; in comparison, there are only 2 cysteine residues in Cas9 from Streptococcus pyogenes (SpCas9) . Most cysteine residues in Cas12a orthologs are not conserved. Therefore, those exposed on the surface are more likely to be involved in intermolecular disulfide bond formation and/or posttranslational modifications. Conserved cysteine residues LbCas12a protein, FnCas12a protein, AsCas12a protein, and Mb2Cas12a are shown in Tables 1-4.

Table 1. The cysteine residues in LbCas12a and their aligned residues in a four-ortholog pairwise alignment.

Table 2. The cysteine residues in FnCas12a and their aligned residues in a four-ortholog pairwise alignment.

Table 3. The cysteine residues in AsCas12a and their aligned residues in a four-ortholog pairwise alignment.

Table 4. The cysteine residues in Mb2Cas12a and their aligned residues in a four-ortholog pairwise alignment.

The present disclosure is based in part on the discovery by the inventors that mutating surface-exposed cysteine residues of Cas12a can improve the bioavailability of Cas12a proteins. Without being bound by any particular theory, it is likely that such mutations avoid the undesired modifications described above. Provided herein are variant Cas12a proteins comprising at least one human-induced mutation. Also provided are fusion proteins comprising the variant Cas12a proteins and one or more heterologous domains. Also provided are associated nucleic acids, DNA constructs, vectors, cells, and methods of editing nucleic acids using the variant Cas12a proteins and/or fusion proteins. In some embodiments, as demonstrated in the Examples herein, the provided methods result in an increased frequency of desired nucleic acid edits. In some embodiments, the edits are SDN-1 edits. In some embodiments, the increased frequency of desired nucleic acid edits is seen at genomic sites that are difficult to edit.

III. Variant Cas12a proteins and fusion proteins

In one aspect, provided herein are variant Cas12a proteins comprising at least one human-induced mutation that have enhanced function (i.e., when compared to unmodified Cas12a proteins) . Also provided are fusion proteins comprising said variant Cas12a proteins and at least one heterologous domain. In some embodiments, the enhanced function of Cas12a is increased SDN-1 genome editing activity. In some embodiments, the variant Cas12a proteins comprise substitutions of one or more surface-exposed cysteine residues. In some embodiments, the variant Cas12a proteins comprise cysteine to serine substitutions at one or more surface-exposed cysteine residues. In some embodiments, the variant Cas12a proteins provided herein further comprise a substitution of an aspartic acid residue and/or a glutamic acid residue to an arginine residue.

Cas12a (which is also referred to as Cpf1) is a Class II, Type V CRISPR/Cas. A variant Cas12a protein provided herein can be a modified form of Cas12a from any of a number of bacterial species including, but not limited to, Lachnospiraceae bacterium, Acidaminococcus sp., Moraxella bovoculi, Thiomicrospira sp., Moraxella lacunata, Methanomethylophilus alvus, Btyrivibrio sp., or Bacteroidetesoral sp. Unmodified Cas12a protein sequences include Lachnospiraceae bacterium Cas12a (LbCas12a; SEQ ID NO: 1) , Francisella novicida U112 Cas12a (FnCas12a; SEQ ID NO: 2) , Acidaminococcus sp. Cas12a (AsCas12a; SEQ ID NO: 3) , and Moraxella bovoculi strain 57922 Cas12a (Mb2Cas12a; SEQ ID NO: 4) .

In some embodiments, the variant Cas12a protein is a modified form of LbCas12a. In some embodiments, the Cas12a protein comprises a sequence that is at least 60%identical (e.g., at least 65%, at least 70%, at least 75%, at least 76%, at least 77%, at least 78%, at least 79%, at least 80%, at least 81%, at least 82%, at least 83%, at least 84%, at least 85%, at least 86%, at least 87%, at least 88%, at least 89%, at least 90%, at least 91%, at least 92%, at least 93%, at least 94%, at least 95%, at least 96%, at least 97%, at least 98%, or at least 99%) to the amino acid sequence of SEQ ID NO: 1 and at least one human-induced mutation. In some embodiments, the human-induced mutation is a substitution of a surface-exposed cysteine residue. Surface-exposed cysteine residues can be identified using methods known in the art, e.g., by the methods described in the Examples herein. In some embodiments, one or more surface-exposed cysteine residues are substituted with another residue (e.g., a serine residue) . In some embodiments, the human-induced mutation is at position C965 (i.e., the cysteine residue at position 965 of SEQ ID NO: 1) . In some embodiments, the human-induced mutation is a substitution of the cysteine residue. In some embodiments, the human-induced mutation is a cysteine to serine substitution. In some embodiments, the Cas12a protein further comprises a human-induced mutation at position D156 (i.e., the aspartic acid residue at position 156 of SEQ ID NO: 1) , as described for example in WO2018195545 and WO2017184768, which are incorporated herein by reference in their entiriety. In some embodiments, the human-induced mutation is a substitution of the aspartic acid residue. In some embodiments, the human-induced mutation is an aspartic acid to arginine substitution. In some embodiments, the sequence of the Cas12a protein comprises any one of SEQ ID NOs: 5-11.

In some embodiments, the variant Cas12a protein is a modified form of FnCas12a. In some embodiments, the Cas12a protein comprises a sequence that is at least 60%identical (e.g., at least 65%, at least 70%, at least 75%, at least 76%, at least 77%, at least 78%, at least 79%, at least 80%, at least 81%, at least 82%, at least 83%, at least 84%, at least 85%, at least 86%, at least 87%, at least 88%, at least 89%, at least 90%, at least 91%, at least 92%, at least 93%, at least 94%, at least 95%, at least 96%, at least 97%, at least 98%, or at least 99%) to the amino acid sequence of SEQ ID NO: 2 and at least one human-induced mutation. In some embodiments, the human-induced mutation is a substitution of a surface-exposed cysteine residue. In some embodiments, one or more surface-exposed cysteine residues are substituted with another residue (e.g., a serine residue) . In some embodiments, the human-induced mutation is at position C70, C1116, and/or C1190. In some embodiments, the human-induced mutation is a substitution of the cysteine residue. In some embodiments, the human-induced mutation is a cysteine to serine substitution. In some embodiments, the Cas12a protein further comprises a human-induced mutation at position E184 (i.e., the glutamic acid residue at position 184 of SEQ ID NO: 2) , as described for example in WO2018195545, which is incorporated herein by reference in their entiriety. In some embodiments, the human-induced mutation is a substitution of the glutamic acid residue. In some embodiments, the human-induced mutation is a glutamic acid to arginine substitution.

In some embodiments, the variant Cas12a protein is a modified form of AsCas12a. In some embodiments, the Cas12a protein comprises a sequence that is at least 60%identical (e.g., at least 65%, at least 70%, at least 75%, at least 76%, at least 77%, at least 78%, at least 79%, at least 80%, at least 81%, at least 82%, at least 83%, at least 84%, at least 85%, at least 86%, at least 87%, at least 88%, at least 89%, at least 90%, at least 91%, at least 92%, at least 93%, at least 94%, at least 95%, at least 96%, at least 97%, at least 98%, or at least 99%) to the amino acid sequence of SEQ ID NO: 3 and at least one human-induced mutation. In some embodiments, the human-induced mutation is a substitution of a surface-exposed cysteine residue. In some embodiments, one or more surface-exposed cysteine residues are substituted with another residue (e.g., a serine residue) . In some embodiments, the human-induced mutation is at position C334, C379, and/or C674. In some embodiments, the human-induced mutation is a substitution of the cysteine residue. In some embodiments, the human-induced mutation is a cysteine to serine substitution. In some embodiments, the Cas12a protein further comprises a human-induced mutation at position E174, as described for example in WO2018195545, which is incorporated herein by reference in their entiriety. In some embodiments, the human-induced mutation is a substitution of the glutamic acid residue. In some embodiments, the human-induced mutation is a glutamic acid to arginine substitution.

In some embodiments, the variant Cas12a protein is a modified form of Mb2Cas12a. In some embodiments, the Cas12a protein comprises a sequence that is at least 60%identical (e.g., at least 65%, at least 70%, at least 75%, at least 76%, at least 77%, at least 78%, at least 79%, at least 80%, at least 81%, at least 82%, at least 83%, at least 84%, at least 85%, at least 86%, at least 87%, at least 88%, at least 89%, at least 90%, at least 91%, at least 92%, at least 93%, at least 94%, at least 95%, at least 96%, at least 97%, at least 98%, or at least 99%) to the amino acid sequence of SEQ ID NO: 4 and at least one human-induced mutation. In some embodiments, the human-induced mutation is a substitution of a surface-exposed cysteine residue. In some embodiments, one or more surface-exposed cysteine residues are substituted with another residue (e.g., a serine residue) . In some embodiments, the human-induced mutation is at position C270, C583, C1068, C1099, and/or C1149. In some embodiments, the human-induced mutation is a substitution of the cysteine residue. In some embodiments, the human-induced mutation is a cysteine to serine substitution. In some embodiments, the Cas12a protein further comprises a human-induced mutation at position D172. In some embodiments, the human-induced mutation is a substitution of the aspartic acid residue. In some embodiments, the human-induced mutation is an aspartic acid to arginine substitution. In some embodiments, the sequence of the Cas12a protein comprises any one of SEQ ID NOs: 12-19.

A Cas protein (e.g., a Cas12a protein) can comprise one or more domains. Non-limiting examples of domains include guide nucleic acid recognition and/or binding domains, nuclease domains (e.g., DNase or RNase domains, RuvC, HNH) , DNA binding domains, RNA binding domains, helicase domains, protein-protein interaction domains, and dimerization domains. A guide nucleic acid recognition and/or binding domain can interact with a guide nucleic acid. A nuclease domain can comprise catalytic activity for nucleic acid cleavage. A nuclease domain can lack catalytic activity to prevent nucleic acid cleavage. A Cas protein can be a chimeric Cas protein that is fused to other proteins or polypeptides. A Cas protein can be a chimera of various Cas proteins, for example, comprising domains from different Cas proteins.

A Cas protein (e.g., a Cas12a protein) used herein can be an active variant, inactive variant, or fragment of a wild-type or modified Cas protein. A Cas protein can comprise an amino acid change such as a deletion, insertion, substitution, variant, mutation, fusion, chimera, or any combination thereof relative to a wild-type version of the Cas protein. A Cas protein can be a polypeptide with at least about 5%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 100%sequence identity or sequence similarity to a wild-type exemplary Cas protein. A Cas protein can be a polypeptide with at most about 5%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100%sequence identity and/or sequence similarity to a wild-type exemplary Cas protein. Variants or fragments can comprise at least about 5%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 100%sequence identity or sequence similarity to a wild-type or modified Cas protein or a portion thereof. Variants or fragments can be targeted to a nucleic acid locus in complex with a guide nucleic acid while lacking nucleic acid cleavage activity.

In some embodiments, a modified Cas protein has decreased function relative to the unmodified form. In some embodiments, a modified Cas protein is deficient in a function of the unmodified form. For example, a nuclease deficient Cas protein retains the ability to bind DNA but lacks or has reduced nucleic acid cleavage activity. A Cas nuclease (e.g., retaining wild-type nuclease activity, having reduced nuclease activity, and/or lacking nuclease activity) can function in a CRISPR/Cas system to regulate the level and/or activity of a target gene or protein (e.g., decrease, increase, or elimination) . The Cas protein can bind to a target polynucleotide and prevent transcription by physical obstruction or edit a nucleic acid sequence to yield non-functional gene products. In some embodiments, the modified Cas protein has no more than 90%, no more than 80%, no more than 70%, no more than 60%, no more than 50%, no more than 40%, no more than 30%, no more than 20%, no more than 10%, no more than 5%, or no more than 1%of the function (e.g., nuclease activity) of the wild-type Cas protein (e.g., Cas12a) . In some embodiments, the modified Cas protein has no substantial function of the wild-type Cas protein. When a Cas protein is a modified form that has no substantial nucleic acid-cleaving activity, it can be referred to as enzymatically inactive and/or “dead” (abbreviated by “d” ) . A dead Cas protein (e.g., dCas, dCas12a) can bind to a target polynucleotide but may not cleave the target polynucleotide. In some embodiments, a Cas12a protein provided herein is a dCas12a protein.

In some embodiments, a modified Cas protein can be a modified Cas “base editor” . Base editing enables direct, irreversible conversion of one target DNA base into another in a programmable manner, without requiring DNA cleavage or a donor DNA molecule. For example, Komor et al (2016, Nature, 533: 420-424) , teach a Cas9-cytidine deaminase fusion, where the Cas9 has also been engineered to be inactivated and not induce double-stranded DNA breaks. Additionally, Gaudelli et al (2017, Nature, doi: 10.1038/nature24644) teach a catalytically impaired Cas9 fused to a tRNA adenosine deaminase, which can mediate conversion of an A/T to G/C in a target DNA sequence. In some embodiments, a Cas12a protein provided herein is a modified Cas12a base editor.

A Cas protein can be modified to optimize regulation of gene expression. A Cas protein can be modified to increase or decrease nucleic acid binding affinity, nucleic acid binding specificity, and/or enzymatic activity. Cas proteins can also be modified to change any other activity or property of the protein, such as stability. For example, one or more nuclease domains of the Cas protein can be modified, deleted, or inactivated, or a Cas protein can be truncated to remove domains that are not essential for the function of the protein or to optimize (e.g., enhance or reduce) the activity of the Cas protein for regulating gene expression.

One or a plurality of the nuclease domains (e.g., RuvC, HNH) of a Cas protein can be deleted or mutated so that they are no longer functional or comprise reduced nuclease activity. For example, in a Cas protein comprising at least two nuclease domains (e.g., Cas12a) , if one of the nuclease domains is deleted or mutated, the resulting Cas protein, known as a nickase, can generate a single-strand break at a CRISPR RNA (crRNA) recognition sequence within a double-stranded DNA but not a double-strand break. Such a nickase can cleave the complementary strand or the non-complementary strand, but may not cleave both. In some embodiments, double strand break targeting specificity is improved by targeting a nickase to opposite strands at two nearby loci. If a nickase cleaves the single strand at both loci, a double strand break is formed and can be repaired as described herein. If all of the nuclease domains of a Cas protein (e.g., RuvC nuclease domains in a Cas12a protein) are deleted or mutated, the resulting Cas protein can have a reduced or no ability to cleave both strands of a double-stranded DNA. In some embodiments, a Cas12a protein provided herein is a Cas12a nickase protein.

Also provided herein are fusion proteins comprising any of the proteins described above and a heterologous domain. As used throughout, a “fusion protein” is a protein comprising two different polypeptide sequences, i.e. a Cas12a protein sequence as described above and a heterologous polypeptide sequence, that are joined or linked to form a single polypeptide. In some embodiments, the two amino acid sequences are encoded by separate nucleic acid sequences that have been joined so that they are transcribed and translated to produce a single polypeptide. The Cas12a protein and the heterologous domain can be linked in any order and orientation relative to each other. For example, the C’ terminal end of the Cas12a protein may be linked to the N’ terminal end or the C’ terminal end of the heterologous domain. The Cas12a protein and the heterologous domain may also be separated by one or more additional fusion protein domains, as described below.

Exemplary heterologous domains include deaminase domains, transcription factor domains, nuclease domains, reverse-transcriptase domains, transposase domains, integrase domains, uracil DNA glycosylase inhibitor domains, recombinase domains, nickase domains, methyltransferase domains, methylase domains, acetylase domains, acetyltransferase domains, transcriptional activator domains, and transcriptional repressor domains. See, e.g., WO2021/061507, incorporated herein by reference in its entirety.

In some embodiments, the fusion proteins provided herein comprise one or more linkers. Linkers, also referred to as spacers, as used herein are flexible molecules or a flexible stretch of molecules that joins or connects two portions (e.g., domains) of a fusion protein or a variant Cas12a protein as provided herein. In some embodiments, the linker is a polypeptide. Proteins with domains joined by polypeptide linkers are referred to as fusion proteins. In some embodiments, the linker is a non-peptide linker. Proteins with domains joined by polypeptide linkers are referred to as modified proteins. It will be understood that, where fusion proteins are discussed throughout the present disclosure, modified proteins are generally also contemplated, where feasible.

The linker may increase the range of orientations that may be adopted by the domains of the fusion protein or variant protein. The linker may be optimized to produce desired effects in the fusion protein or variant protein. Aspects of linker design and considerations are described, for example, in Chen, X. et al., Adv Drug Deliv Rev. 2013 Oct 15; 65 (10) : 1357-1369, and Klein, J.S. et al. 2014 Protein Eng. Des. Sel. 27 (10) : 325-330. In some embodiments, the proteins provided herein comprise a peptide linker. In some embodiments, the proteins provided herein comprise a non-peptide linker. In some embodiments, the proteins provided herein comprise a peptide linker and a non-peptide linker. The proteins provided herein may also comprise a plurality of linkers, including at least one peptide linker, at least one non-peptide linker, or at least one peptide linker and at least one non-peptide linker.

Linkers may be short or long, flexible or rigid. See, e.g., WO2021/061507, which incorporated herein by reference in its entirety, and WO 2020/168102, incorporated herein by reference in its entirety, and US 2021/0017506, incorporated herein by reference in its entirety.

In some embodiments, the length of a linker may affect one or more functions of the fusion protein. Selection of linkers to achieve the desired length is within the ability of one skilled in the art. In some embodiments, a peptide linker may be, for example, 5 to 100 or more amino acids in length (e.g., 4 aa, 5 aa, 8 aa, 10 aa, 15 aa, 18 aa, 20 aa, 25 aa, 30 aa, 35 aa, 40 aa, 45 aa, 50 aa, 55 aa, 60 aa, 65 aa, 70 aa, 75 aa, 80 aa, 85 aa, 90 aa, 95 aa, or 100 aa) . In some embodiments, the linker is about 30 amino acids in length. In some embodiments, the linker is about 8 amino acids in length.

Depending on length, linker sequence may have various conformations in secondary structure, such as helical, β-strand, coil/bend, and turns. In some instances, a linker sequence may have an extended conformation and function as an independent domain that does not interact with the adjacent protein domains. Linker sequences may be flexible or rigid. Flexible linkers provide a certain degree of movement or interaction between the polypeptide domains and are generally rich in small or polar amino acids such as Gly and Ser (e.g., at least 90%, at least 95%, at least 98%, at least 99%, or all of the amino acid residues of the linker are either Gly or Ser) . A rigid linker can be used to keep a fixed distance between the domains and to help maintain their independent functions. Linker attachment can be through an amide linkage (e.g., a peptide bond) or other functionalities as discussed further below.

In some embodiments, a peptide linker described herein comprises one or more repeats (e.g., 2 repeats, 3 repeats, 4 repeats, 5 repeats 6 repeats, or more) of GSSSS (SEQ ID NO: 43) and/or one or more repeats of GGGGS (SEQ ID NO: 44) and/or one or more repeats of GSSGSS (SEQ ID NO: 45) and/or one or more repeats of SGGS (SEQ ID NO: 77) . In some embodiments, the linker comprises an amino acid sequence with at least 90%sequence identity to (GSSSS) ₆ (SEQ ID NO: 46) or (SGGS) ₂ (SEQ ID NO: 78) . Additional exemplary peptide linkers include, but are not limited to, peptide linkers comprising SGSETPGTSESATPE (SEQ ID NO: 47) , SGSETPGTSESATPES (SEQ ID NO: 48) , (GGGGS) ₃ (SEQ ID NO: 49) , (GGGGS) ₅ (SEQ ID NO: 50) , (GGGGS) ₁₀ (SEQ ID NO: 51) , GGGGGGGG (SEQ ID NO: 52) , GSAGSAAGSGEF (SEQ ID NO: 53) , A (EAAAK) ₃A (SEQ ID NO: 54) , or A (EAAAK) ₁₀A (SEQ ID NO: 55) . Additional non-limiting exemplary linkers that can be used include those disclosed in PCT/US2020/051383, Chen et al., Adv. Drug. Deliv. Rev. 65 (10) : 1357-1369 (2014) and Rosemalen et al., Biochemistry 2017, 56, 50, 6565-6574, the entire contents of both of which are herein incorporated by reference.

In some embodiments, a non-peptide linker can comprise any of a number of known chemical linkers. Exemplary chemical linkers can include one or more units of beta-alanine, 4-aminobutyric acid (GABA) , (2-aminoethoxy) acetic acid (AEA) , 5-aminobexanoic acid (Ahx) , PEG multimers, and trioxatricdeacan-succinamic acid (Ttds) . In some embodiments, the non-peptide linker comprises one or more units of polyethylene glycol (PEG) , which is commonly used as a linker for conjugation of polypeptide domains due to its water solubility, lack of toxicity, low immunogenicity, and well-defined chain lengths. See, e.g., Ramirez-Paz, J., et al., PLoS One 13 (7) : e0197643 (2018) . The number of PEG linkage units may be selected based on the desired length of the linker.

Modified proteins comprising a non-peptide linker can be produced in a variety of ways. For example, a Cas12a protein and a heterologous domain may be produced separately (e.g., in vitro or by expression in and purification from host cells) and chemically linked in vitro. In some embodiments, a Cas12a protein, a heterologous domain, and a linker can each be produced separately and chemically linked in vitro. Various chemical linkers may be used to cross link two amino acid residues.

Also contemplated herein are embodiments in which the Cas12a protein and the heterologous domain as described above are used separately (e.g., introduced into cells separately or applied to target nucleic acids separately) and brought into proximity to form a complex without using linkers as described above. Various methods of forming complexes between two or more polypeptides are known in the art and include, but are not limited to, using protein-protein interaction strategies (e.g., SunTag, coiled-coil, etc. ) , using RNA-aptamers and associated binding proteins (e.g., MS2, N22, etc. ) , and Tag: Catcher strategies. For example, a site-directed nuclease of the present disclosure may comprise an MS2 RNA aptamer, which would facilitate interaction with a nonspecific end-processing enzyme comprising an MS2 coat protein.

In some embodiments, the fusion protein provided herein comprises an amino acid sequence having at least 70% (e.g., at least 75%, at least 80%, at least 85%, at least 90%, at least 91%, at least 92%, at least 93%, at least 94%, at least 95%, at least 96%, at least 97%, at least 98%, at least 99%, or 100%) identity to any one of SEQ ID NOs: 1-4. In some embodiments, the fusion protein provided herein comprises an amino acid sequence as set forth in any one of SEQ ID NOs: 5-19.

Any of the proteins and fusion proteins described herein can further comprise a targeting sequence which mediates the localization (or retention) of the protein to a sub-cellular location, e.g., plasma membrane or membrane of a given organelle, nucleus, cytosol, mitochondria, endoplasmic reticulum (ER) , Golgi, chloroplast, apoplast, peroxisome or other organelle. For example, a targeting sequence can direct a protein (e.g., a nuclease) to a nucleus utilizing a nuclear localization signal (NLS) ; outside of a nucleus of a cell, for example to the cytoplasm, utilizing a nuclear export signal (NES) ; mitochondria utilizing a mitochondrial targeting signal; the endoplasmic reticulum (ER) utilizing an ER-retention signal; a peroxisome utilizing a peroxisomal targeting signal; plasma membrane utilizing a membrane localization signal; or combinations thereof. In some embodiments, the protein comprises a nuclear localization signal. Non-limiting examples of NLSs include an NLS sequence derived from: the NLS of the SV40 virus large T-antigen, having the amino acid sequence PKKKRKV (SEQ ID NO: 56) ; the NLS from nucleoplasmin (e.g. the nucleoplasmin bipartite NLS with the sequence KRPAATKKAGQAKKKK (SEQ ID NO: 57) ) ; the c-myc NLS having the amino acid sequence PAAKRVKLD (SEQ ID NO: 58) or RQRRNELKRSP (SEQ ID NO: 59) ; the hRNPA1 M9 NLS having the sequence NQSSNFGPMKGGNFGGRSSGPYGGGGQYFAKPRNQGGY (SEQ ID NO: 60) ; the sequence RMRIZFKNKGKDTAELRRRRVEVSVELRKAKKDEQILKRRNV (SEQ ID NO: 61) of the IBB domain from importin-alpha; the sequences VSRKRPRP (SEQ ID NO: 62) and PPKKARED (SEQ ID NO: 63) of the myoma T protein; the sequence PQPKKKPL (SEQ ID NO: 64) of human p53; the sequence SALIKKKKKMAP (SEQ ID NO: 65) of mouse c-abl IV; the sequences DRLRR (SEQ ID NO: 66) and PKQKKRK (SEQ ID NO: 67) of the influenza virus NS1; the sequence RKLKKKIKKL (SEQ ID NO: 68) of the Hepatitis virus delta antigen; the sequence REKKKFLKRR (SEQ ID NO: 69) of the mouse Mx1 protein; the sequence KRKGDEVDGVDEVAKKKSKK (SEQ ID NO: 70) of the human poly (ADP-ribose) polymerase; the sequence RKCLQAGMNLEARKTKK (SEQ ID NO: 71) of the steroid hormone receptors (human) glucocorticoid; and the sequence KRPRDRHDGELGGRKRAR (SEQ ID NO: 72) of the Agrobacterium VirD2 protein.

Any of the proteins and fusion proteins described herein can further comprise a detectable moiety, for example, a fluorescent protein or fragment thereof. Examples of fluorescent proteins include, but are not limited to, yellow fluorescent protein (YFP, for example, Venus) , green fluorescent protein (GFP) , and red fluorescent protein (RFP) as well as derivatives, for example, mutant derivatives, of these proteins. See, for example, Chudakov et al. “Fluorescent Proteins and Their Applications in Imaging Living Cells and Tissues, ” Physiological Reviews 90 (3) : 1103-1163 (2010) ; and Specht et al., “A Critical and Comparative Review of Fluorescent Tools for Live-Cell Imaging, ” Annual Review of Physiology 79: 93-117 (2017)) .

Any of the proteins and fusion proteins described herein can further comprise an affinity tag, for example, a polyhistidine tag (e.g., (His) ₆ (SEQ ID NO: 73) ) , an HA tag (e.g., YPYDVPDYA (SEQ ID NO: 74) ) , albumin-binding protein, alkaline phosphatase, an AU1 epitope, an AU5 epitope, a biotin-carboxy carrier protein (BCCP) , a FLAG epitope (e.g., DYKDDDDK (SEQ ID NO: 75) , or a MYC epitope (e.g., EQKLISEEDL (SEQ ID NO: 76)) , to name a few. See, Kimple et al. “Overview of Affinity Tags for Protein Purification, Curr. Protoc. Protein Sci. 73: Unit-9.9 (2013) .

Also provided herein are variants of the polypeptides (e.g., proteins and fusion proteins) of this disclosure. Polypeptide variants retain their respective biological activity, unless explicitly noted otherwise. For example, variants of a Cas12a polypeptide retain the biological function of the full length, native sequence site directed Cas12a protein. In another example, variants of the heterologous domain retain the biological function of the full length, native sequence heterologous domain.

Modifications to any of the polypeptides or proteins provided herein are made by known methods. By way of example, modifications are made by site specific mutagenesis of nucleotides in a nucleic acid encoding the polypeptide, thereby producing a DNA encoding the modification, and thereafter expressing the DNA in recombinant cell culture to produce the encoded polypeptide. Techniques for making substitution mutations at predetermined sites in DNA having a known sequence are well known. For example, M13 primer mutagenesis and PCR-based mutagenesis methods can be used to make one or more substitution mutations. Any of the nucleic acid sequences provided herein can be codon-optimized to alter, for example, maximize expression, in a host cell or organism.

The amino acids in the polypeptides described herein can be any of the 20 naturally occurring amino acids, D-stereoisomers of the naturally occurring amino acids, unnatural amino acids and chemically modified amino acids. Unnatural amino acids (that is, those that are not naturally found in proteins) are also known in the art, as set forth in, for example, Zhang et al. “Protein engineering with unnatural amino acids, ” Curr. Opin. Struct. Biol. 23 (4) : 581-587 (2013) ; Xie et la. “Adding amino acids to the genetic repertoire, ” 9 (6) : 548-54 (2005) ) ; and all references cited therein. Β and γ amino acids are known in the art and are also contemplated herein as unnatural amino acids.

As used herein, a chemically modified amino acid refers to an amino acid whose side chain has been chemically modified. For example, a side chain can be modified to comprise a signaling moiety, such as a fluorophore or a radiolabel. A side chain can also be modified to comprise a new functional group, such as a thiol, carboxylic acid, or amino group. Post-translationally modified amino acids are also included in the definition of chemically modified amino acids.

Also contemplated are conservative amino acid substitutions. By way of example, conservative amino acid substitutions can be made in one or more of the amino acid residues, for example, in one or more lysine residues of any of the polypeptides provided herein. One of skill in the art would know that a conservative substitution is the replacement of one amino acid residue with another that is biologically and/or chemically similar. The following eight groups each contain amino acids that are conservative substitutions for one another:

1) Alanine (A) , Glycine (G) ;

2) Aspartic acid (D) , Glutamic acid (E) ;

3) Asparagine (N) , Glutamine (Q) ;

4) Arginine (R) , Lysine (K) ;

5) Isoleucine (I) , Leucine (L) , Methionine (M) , Valine (V) ;

6) Phenylalanine (F) , Tyrosine (Y) , Tryptophan (W) ;

7) Serine (S) , Threonine (T) ; and

8) Cysteine (C) , Methionine (M) .

By way of example, when an arginine to serine is mentioned, also contemplated is a conservative substitution for the serine (e.g., threonine) . Nonconservative substitutions, for example, substituting a lysine with an asparagine, are also contemplated.

IV. Recombinant nucleic acids, constructs, vectors, and host cells

Also provided herein are recombinant nucleic acids encoding any of the variant Cas12a proteins or fusion proteins described herein. For example, a recombinant nucleic acid encoding a polypeptide that has at least 70% (e.g., at least 75%, at least 80%, at least 85%, at least 90%, at least 91%, at least 92%, at least 93%, at least 94%, at least 95%, at least 96%, at least 97%, at least 98%, at least 99%, or 100%) identity to any of SEQ ID NOs: 20-34 is also provided. Also provided are recombinant nucleic acids having at least 70%identity to any of SEQ ID NOs: 20-34.

Also provided is a DNA construct comprising a promoter operably linked to a recombinant nucleic acid encoding a fusion protein or domains thereof as described herein. A nucleic acid is “operably linked” when it is placed into a functional relationship with another nucleic acid sequence. Numerous promoters can be used in the constructs described herein. A promoter is a region or a sequence located upstream and/or downstream from the start of transcription that is involved in recognition and binding of RNA polymerase and other proteins to initiate transcription.

The term “promoter” as used herein refers to a nucleotide sequence, usually upstream (5’ ) to its coding sequence, which controls the expression of the coding sequence by providing the recognition for RNA polymerase and other factors required for proper transcription. “Promoter regulatory sequences” consist of proximal and more distal upstream elements. Promoter regulatory sequences influence the transcription, RNA processing or stability, or translation of the associated coding sequence. Regulatory sequences include enhancers, promoters, untranslated leader sequences, introns, and polyadenylation signal sequences. They include natural and synthetic sequences as well as sequences that may be a combination of synthetic and natural sequences. An “enhancer” is a DNA sequence that can stimulate promoter activity and may be an innate element of the promoter or a heterologous element inserted to enhance the level or tissue specificity of a promoter. It is capable of operating in both orientations (normal or flipped) and is capable of functioning even when moved either upstream or downstream from the promoter. The meaning of the term “promoter” includes “promoter regulatory sequences. ”

The choice of promoters to be included depends upon several factors, including, but not limited to, efficiency, selectability, inducibility, desired expression level, and cell-or tissue-preferential expression. It is a routine matter for one of skill in the art to modulate the expression of a sequence by appropriately selecting and positioning promoters and other regulatory regions relative to that sequence.

It has been shown that certain promoters are able to direct RNA synthesis at a higher rate than others. These are called "strong promoters". Certain other promoters have been shown to direct RNA synthesis at higher levels only in particular types of cells or tissues and are often referred to as "tissue specific promoters", or "tissue-preferred promoters", if the promoters direct RNA synthesis preferentially in certain tissues (RNA synthesis may occur in other tissues at reduced levels) . Since patterns of expression of a chimeric gene (or genes) introduced into a plant are controlled using promoters, there is an ongoing interest in the isolation of novel promoters that are capable of controlling the expression of a chimeric gene (or genes) at certain levels in specific tissue types or at specific plant developmental stages.

Certain promoters are able to direct RNA synthesis at relatively similar levels across all tissues of a plant. These are called "constitutive promoters" or "tissue-independent" promoters. Constitutive promoters can be divided into strong, moderate, and weak categories according to their effectiveness to directing RNA synthesis. Since it is necessary in many cases to simultaneously express a chimeric gene (or genes) in different tissues of a plant to get the desired functions of the gene (or genes) , constitutive promoters are especially useful in this regard. Though many constitutive promoters have been discovered from plants and plant viruses and characterized, there is still an ongoing interest in the isolation of more novel constitutive promoters, synthetic or native, which are capable of controlling the expression of a chimeric gene (or genes) at different levels and the expression of multiple genes in the same transgenic plant for gene stacking.

Among the most commonly used promoters are the nopaline synthase (NOS) promoter (Ebert et al., Proc. Natl. Acad. Sci. USA 84: 5745-5749 (1987) ) ; the octapine synthase (OCS) promoter; caulimovirus promoters such as the cauliflower mosaic virus (CaMV) 19S promoter (Lawton et al., Plant Mol. Biol. 9: 315-324 (1987) ) ; the light inducible promoter from the small subunit of rubisco (Pellegrineschi et al., Biochem. Soc. Trans. 23 (2) : 247-250 (1995) ) ; the Adh promoter (Walker et al., Proc. Natl. Acad. Sci. USA 84: 6624-66280 (1987) ) ; the sucrose synthase promoter (Yang et al., Proc. Natl. Acad. Sci. USA 87: 414-44148 (1990) ) ; the R gene complex promoter (Chandler et al., Plant Cell 1: 1175-1183 (1989) ) ; the chlorophyll a/b binding protein gene promoter; and the like. ”

Furthermore, it is contemplated that promoters combining elements from more than one promoter may be useful. For example, U.S. Pat. No. 5,491,288 discloses combining a Cauliflower Mosaic Virus promoter with a histone promoter. Thus, the elements from the promoters disclosed herein may be combined with elements from other promoters. Promoters which are useful for plant transgene expression include those that are inducible, viral, synthetic, constitutive (Odell Nature 313: 810–812 (1985) ) , temporally regulated, spatially regulated, tissue specific, and spatial temporally regulated. Using the regulatory elements described herein, numerous agronomic genes can be expressed in transformed plants. More particularly, plants can be genetically engineered to express various phenotypes of agronomic interest. ”

In some embodiments of the DNA constructs provided herein, the promoter can be a eukaryotic or a prokaryotic promoter. In some embodiments, the promoter is an inducible promoter, a native inducible promoter (e.g., drought-inducible Rab17) , a synthetic inducible promoter (e.g., auxin-inducible DR5, estradiol-inducible XVE/pLex, dexamethasone- inducible GVG/Gal4) , a constitutive promoter (e.g., ZmUbq1, OsAct1, OsTub3, EF, EF1α) , an egg cell-specific promoter (e.g., EC1, EC2, EC3, EC4, EC5) , a pollen-specific promoter, an apical meristem tissue-specific promoter, or a promoter with enriched expression in the zygote. In some embodiments, the promoter is a floral mosaic promoter (e.g., ZmBde1, OsAP1) . In some embodiments, the promoter is a ubiquitin 4 promoter (e.g., a sugarcane ubiquitin 4 promoter) , an actin promoter, a tubulin promoter, a MADS box promoter, or a plant virus promoter. Suitable promoters are disclosed, e.g., in U.S. Pat. No. 10,519,456, the entire content of which is herein incorporated by reference, and PCT/US2022/020690, incorporated herein by reference.

The recombinant nucleic acids provided herein can be included in expression cassettes for expression in a host cell or an organism of interest. The cassette will include 5′and 3′regulatory sequences operably linked to a recombinant nucleic acid provided herein that allows for expression of a fusion protein. The cassette may additionally contain at least one additional gene or genetic element to be cotransformed into the cell or organism. Where additional genes or elements are included, the components are operably linked. Alternatively, the additional gene (s) or element (s) can be provided on multiple expression cassettes. Such an expression cassette is provided with a plurality of restriction sites and/or recombination sites for insertion of the polynucleotides to be under the transcriptional regulation of the regulatory regions. The expression cassette may additionally contain a selectable marker gene. The expression cassette will include in the 5′ to 3′ direction of transcription: a transcriptional and translational initiation region (i.e., a promoter) , a polynucleotide of the invention, and a transcriptional and translational termination region (i.e., termination region) functional in the cell or organism of interest. The promoters of the invention are capable of directing or driving expression of a coding sequence (i.e., a nucleic acid sequence that is transcribed into RNA such as mRNA, rRNA, tRNA, snRNA, ncRNA, lncRNA, sense RNA, or antisense RNA, regardless of whether the RNA is then translated to produce a protein) in a host cell. The regulatory regions (i.e., promoters, transcriptional regulatory regions, and translational termination regions) may be endogenous or heterologous to the host cell or to each other. As used herein, “heterologous” in reference to a sequence is a sequence that originates from a foreign species, or, if from the same species, is substantially modified from its native form in composition and/or genomic locus by deliberate human intervention.

Additional regulatory signals include, but are not limited to, transcriptional initiation start sites, operators, activators, enhancers, other regulatory elements, ribosomal binding sites, an initiation codon, termination signals, and the like. See Sambrook et al. (1992) Molecular Cloning: A Laboratory Manual, ed. Maniatis et al. (Cold Spring Harbor Laboratory Press, Cold Spring Harbor, N.Y. ) ; Davis et al., eds. (1980) Advanced Bacterial Genetics (Cold Spring Harbor Laboratory Press) , Cold Spring Harbor, N.Y., and the references cited therein.

The expression cassette can also comprise a selectable marker gene for the selection of transformed cells. Marker genes include genes conferring antibiotic resistance, such as those conferring hygromycin resistance, ampicillin resistance, gentamicin resistance, neomycin resistance, to name a few. Additional selectable markers are known and any can be used.

In preparing the expression cassette, the various DNA fragments may be manipulated, so as to provide for the DNA sequences in the proper orientation and, as appropriate, in the proper reading frame. Toward this end, adapters or linkers may be employed to join the DNA fragments or other manipulations may be involved to provide for convenient restriction sites, removal of superfluous DNA, removal of restriction sites, or the like. For this purpose, in vitro mutagenesis, primer repair, restriction, annealing, resubstitutions, e.g., transitions and transversions, may be involved.

In preparing the expression cassette, the various DNA fragments may be manipulated, so as to provide for the DNA sequences in the proper orientation and, as appropriate, in the proper reading frame. Toward this end, adapters or linkers may be employed to join the DNA fragments or other manipulations may be involved to provide for convenient restriction sites, removal of superfluous DNA, removal of restriction sites, or the like. For this purpose, in vitro mutagenesis, primer repair, restriction, annealing, resubstitutions, e.g., transitions and transversions, may be used.

Further provided is a vector comprising a recombinant nucleic acid or DNA construct set forth herein. The vector is contemplated to have the necessary functional elements that direct and regulate transcription of the inserted nucleic acid. These functional elements include, but are not limited to, a promoter, regions upstream or downstream of the promoter, such as enhancers and terminators, that may regulate the transcriptional activity of the promoter, an origin of replication, appropriate restriction sites to facilitate cloning of inserts adjacent to the promoter, antibiotic resistance genes or other markers which can serve to select for cells containing the vector or the vector containing the insert, RNA splice junctions, a transcription termination region, or any other region which may serve to facilitate the expression of the inserted gene or hybrid gene. See generally, Sambrook et al. Molecular Cloning: A Laboratory Manual, 4^th ed., Cold Spring Harbor Laboratory Press, Cold Spring Harbor, 2012. The vector, for example, can be a plasmid. In some embodiments of the DNA constructs and vectors provided herein, the constructs and vectors comprise a nopaline synthase gene terminator sequence (e.g., an Agrobacterium tumefaciens nopaline synthase gene terminator sequence) .

There are numerous E. coli expression vectors known to one of ordinary skill in the art, which are useful for the expression of a nucleic acid. Other microbial hosts suitable for use include bacilli, such as Bacillus subtilis, and other enterobacteriaceae, such as Salmonella, Senatia, and various Pseudomonas species. In these prokaryotic hosts, one can also make expression vectors, which will typically contain expression control sequences compatible with the host cell (e.g., an origin of replication) . In addition, any number of a variety of well-known promoters will be present, such as the lactose promoter system, a tryptophan (Trp) promoter system, a beta-lactamase promoter system, or a promoter system from phage lambda. Additionally, yeast expression can be used. Provided herein is a nucleic acid encoding a polypeptide of the present invention, wherein the nucleic acid can be expressed by a yeast cell. More specifically, the nucleic acid can be expressed by Pichia pastoris or S. cerevisiae.

Mammalian cells also permit the expression of proteins in an environment that favors important post-translational modifications such as folding and cysteine pairing, addition of complex carbohydrate structures, and secretion of active protein. Vectors useful for the expression of active proteins in mammalian cells are known in the art and can contain genes conferring hygromycin resistance, geneticin or G418 resistance, or other genes or phenotypes suitable for use as selectable markers, or methotrexate resistance for gene amplification. A number of suitable host cell lines capable of secreting intact human proteins have been developed in the art, and include CHO cells, HeLa cells, HEK-293 cells, HEK-293T cells, U2OS cells, or any other primary or transformed cell line. Other suitable host cell lines include COS-7 cells, myeloma cell lines, Jurkat cells, etc. Expression vectors for these cells can include expression control sequences, such as an origin of replication, a promoter, an enhancer, and necessary information processing sites, such as ribosome binding sites, RNA splice sites, polyadenylation sites, and transcriptional terminator sequences. Preferred expression control sequences are promoters derived from immunoglobulin genes, SV40, Adenovirus, Bovine Papilloma Virus, etc.

The expression vectors described herein can also include the nucleic acids as described herein under the control of an inducible promoter such as the tetracycline inducible promoter or a glucocorticoid inducible promoter. The nucleic acids of the present invention can also be under the control of a tissue-specific promoter to promote expression of the nucleic acid in specific cells, tissues or organs. Any regulatable promoter, such as a metallothionein promoter, a heat-shock promoter, and other regulatable promoters, of which many examples are well known in the art are also contemplated. Furthermore, a Cre-loxP inducible system can also be used, as well as a Flp recombinase inducible promoter system, both of which are known in the art.

Insect cells also permit the expression of the polypeptides. Recombinant proteins produced in insect cells with baculovirus vectors undergo post-translational modifications similar to that of wild-type mammalian proteins.

Also provided herein are host cells comprising the recombinant nucleic acids, DNA constructs, and/or vectors described herein as well as methods of making such cells. In some embodiments, the cell is a plant cell. In some embodiments, the plant cell is a maize plant cell, a wheat plant cell, a rice plant cell, a soybean plant cell, a sunflower plant cell, or a tomato plant cell.

A host cell comprising a nucleic acid or a vector described herein is provided. The host cell can be an in vitro, ex vivo, or in vivo host cell. Host cells as provided herein are capable of expressing the fusion protein. Cell populations of any of the host cells described herein are also provided. In some embodiments, the cell population comprises a plurality of cells, wherein the plurality of cells comprise a recombinant nucleic acid encoding the fusion protein as described herein. In some embodiments, the cell population comprises a plurality of cells, wherein the plurality of cells comprises a DNA construct encoding the protein and/or fusion protein as described herein. In some embodiments, the cell population comprises a plurality of cells, wherein the plurality of cells comprises a vector comprising a recombinant nucleic acid or a DNA construct encoding the protein and/or fusion protein as described herein. In some embodiments, the cell population comprises a plurality of cells, wherein the plurality of cells comprise a plurality of any of the host cells described herein. In some embodiments, a plurality of cells of any of the cell populations described herein express a protein and/or fusion protein as described herein.

In some embodiments, the provided cells express the protein and/or fusion protein stably or transiently. Stable expression of the protein and/or fusion protein in a cell refers to integration of any of the nucleic acids, DNA constructs, or vectors described herein into the genome of the cell, thereby allowing the cell to express the protein and/or fusion protein. Transient expression refers to expression of the protein and/or fusion protein directly from any of the nucleic acids, DNA constructs, and/or vectors following introduction into the cell (i.e., the gene encoding the protein and/or fusion protein is not integrated into the genome of the cell) .

In some embodiments, the provided cells express the protein and/or fusion protein constitutively or inducibly. Constitutive expression refers to ongoing, continuous expression of a gene (i.e., of a protein) , whereas inducible expression refers to gene (protein) expression that is responsive to a stimulus. Inducible expression is generally regulated via an inducible promoter, a description of which is included above.

A cell culture comprising one or more host cells described herein is also provided. Methods for the culture and production of many cells, including cells of bacterial (for example E. coli and other bacterial strains) , animal (especially mammalian) , and archebacterial origin are available in the art. See e.g., Sambrook, supra; Ausubel, ed. (1995) Current Protocols in Molecular Biology, John Wiley &Sons, as well as Freshney (1994) Culture of Animal Cells, a Manual of Basic Technique, 3^rd Ed., Wiley-Liss, New York and the references cited therein; Doyle and Griffiths (1997) Mammalian Cell Culture: Essential Techniques John Wiley and Sons, NY; Humason (1979) Animal Tissue Techniques, 4^th Ed. W.H. Freeman and Company; and Ricciardelli, et al., (1989) In vitro Cell Dev. Biol. 25: 1016-1024.

The host cell can be a prokaryotic cell, including, for example, a bacterial cell. Alternatively, the cell can be a eukaryotic cell, for example, a mammalian cell. In some embodiments, the cell can be a HEK-293T cell, a HEK-293 cell, a Chinese hamster ovary (CHO) cell, a U2OS cell, or any other primary or transformed cell. In some embodiments, the cell can be a COS-7 cell, a HELA cell, an avian cell, a myeloma cell, a Pichia cell, an insect cell or a plant cell. A number of other suitable host cell lines have been developed and include myeloma cell lines, fibroblast cell lines, and a variety of tumor cell lines such as melanoma cell lines. The vectors containing the nucleic acid segments of interest can be transferred or introduced into the host cell by well-known methods, which vary depending on the type of cellular host.

As used herein, the phrase “introducing” in the context of introducing a nucleic acid into a cell (e.g., a prokaryotic cell, a bacterial cell, a eukaryotic cell, a plant cell) refers to the translocation of the nucleic acid sequence from outside a cell to inside the cell. In some cases, introducing refers to translocation of the nucleic acid from outside the cell to inside the nucleus of the cell. Where more than one nucleic acid molecule is to be introduced, these nucleic acid molecules can be assembled as part of a single polynucleotide or nucleic acid construct, or as separate polynucleotide or nucleic acid constructs, and can be located on the same or different nucleic acie constructs. Accordingly, such polynucleotides can be introduced into cells (e.g., plant cells) in a single transformation event, in separate transformation events, or, e.g., as part of a breeding protocol. Various methods of introducing a nucleic acid into a cell are contemplated, including but not limited to, electroporation, nanoparticle delivery, biolistic transformation, viral delivery, contact with nanowires or nanotubes, receptor mediated internalization, translocation via cell penetrating peptides, liposome mediated translocation, DEAE dextran, lipofectamine, calcium phosphate or any method now known or identified in the future for introduction of nucleic acids into prokaryotic or eukaryotic cellular hosts. A targeted nuclease system (e.g., an RNA-guided nuclease, a transcription activator-like effector nuclease (TALEN) , a zinc finger nuclease (ZFN) , or a megaTAL (MT) can also be used to introduce a nucleic acid, for example, a nucleic acid encoding a protein and/or fusion protein described herein, into a host cell. See Li et al. Signal Transduction and Targeted Therapy 5, Article No. 1 (2020) .

Transformation of a cell may be stable or transient. Thus, a transgenic cell, plant cell, plant and/or plant part of the invention can be stably transformed or transiently transformed. Transformation can refer to the transfer of a nucleic acid molecule into the genome of a host cell, resulting in genetically stable inheritance. In some embodiments, the introduction into a plant, plant part and/or plant cell is via bacterial-mediated transformation, particle bombardment transformation, calcium-phosphate-mediated transformation, cyclodextrin-mediated transformation, electroporation, liposome-mediated transformation, nanoparticle-mediated transformation, polymer-mediated transformation, virus-mediated nucleic acid delivery, whisker-mediated nucleic acid delivery, microinjection, sonication, infiltration, polyethylene glycol-mediated transformation, protoplast transformation, or any other electrical, chemical, physical and/or biological mechanism that results in the introduction of nucleic acid into the plant, plant part and/or cell thereof, or any combination thereof.

Procedures for transforming plants are well known and routine in the art and are described throughout the literature. Non-limiting examples of methods for transformation of plants include transformation via bacterial-mediated nucleic acid delivery (e.g. via bacteria from the genus Agrobacterium) , viral-mediated nucleic acid delivery, silicon carbide or nucleic acid whisker-mediated nucleic acid delivery, liposome mediated nucleic acid delivery, microinjection, microparticle bombardment, calcium-phosphate-mediated transformation, cyclodextrin-mediated transformation, electroporation, nanoparticle-mediated transformation, , sonication, infiltration, PEG-mediated nucleic acid uptake, as well as any other electrical, chemical, physical (mechanical) and/or biological mechanism that results in the introduction of nucleic acid into the plant cell, including any combination thereof. General guides to various plant transformation methods known in the art include Miki et al. ( “Procedures for Introducing Foreign DNA into Plants” in Methods in Plant Molecular Biology and Biotechnology, Glick, B.R. and Thompson, J.E., Eds. (CRC Press, Inc., Boca Raton, 1993) , pages 67-88) and Rakowoczy-Trojanowska (Cell Mol Biol Lett 7: 849-858 (2002)) .

Agrobacterium-mediated transformation is a commonly used method for transforming plants because of its high efficiency of transformation and because of its broad utility with many different species. Agrobacterium-mediated transformation typically involves transfer of the binary vector carrying the foreign DNA of interest to an appropriate Agrobacterium strain that may depend on the complement of vir genes carried by the host Agrobacterium strain either on a co-resident Ti plasmid or chromosomally (Uknes et al. 1993, Plant Cell 5: 159-169) . The transfer of the recombinant binary vector to Agrobacterium can be accomplished by a tri-parental mating procedure using Escherichia coli carrying the recombinant binary vector, a helper E. coli strain that carries a plasmid that is able to mobilize the recombinant binary vector to the target Agrobacterium strain. Alternatively, the recombinant binary vector can be transferred to Agrobacterium by nucleic acid transformation (and Willmitzer 1988, Nucleic Acids Res 16: 9877) .

Transformation of a plant by recombinant Agrobacterium usually involves co-cultivation of the Agrobacterium with explants from the plant and follows methods well known in the art. Transformed tissue is typically regenerated on selection medium carrying an antibiotic or herbicide resistance marker between the binary plasmid T-DNA borders.

Another method for transforming plants, plant parts and plant cells involves propelling inert or biologically active particles at plant tissues and cells. See, e.g., US Patent Nos. 4,945,050; 5,036,006 and 5,100,792. Generally, this method involves propelling inert or biologically active particles at the plant cells under conditions effective to penetrate the outer surface of the cell and afford incorporation within the interior thereof. When inert particles are utilized, the vector can be introduced into the cell by coating the particles with the vector containing the nucleic acid of interest. Alternatively, a cell or cells can be surrounded by the vector so that the vector is carried into the cell by the wake of the particle. Biologically active particles (e.g., dried yeast cells, dried bacteria or a bacteriophage, each containing one or more nucleic acids sought to be introduced) also can be propelled into plant tissue. As used herein, the phrase “biolistic transformation” refers to a method of introducing RNA or DNA into cells (e.g., plant cells) directly, in which RNA or DNA is mixed with heavy metal particles (e.g., tungsten or gold) and released into the cell (e.g., the plant cell) using high speed pressure to allow the RNA or DNA to penetrate the cell (e.g., to penetrate the plant cell wall) .

The CRISPR/Cas system can also be used to edit the genome of a host cell or organism. As detailed above, the “CRISPR/Cas” system refers to a widespread class of bacterial systems for defense against foreign nucleic acid. Any of the CRISPR/Cas system components described herein may be used to introduce proteins, fusion proteins, recombinant nucleic acids, or systems into the genome of a host cell or organism. Methods for CRISPR/Cas system mediated genome editing are known in the art. It will be understood that use of a CRISPR/Cas system for introduction of proteins, fusion proteins, recombinant nucleic acids, or systems described herein into the genome of a host cell or organism is different from the particular methods and systems provided herein.

Any of the proteins and/or fusion proteins described herein can be purified or isolated from a host cell or population of host cells. For example, a recombinant nucleic acid encoding any of the proteins and/or fusion proteins described herein can be introduced into a host cell under conditions that allow expression of the protein and/or fusion protein. In some embodiments, the recombinant nucleic acid is codon-optimized for expression. After expression in the host cell, the protein and/or fusion protein can be isolated or purified using purification methods known in the art.

V. Systems

In another aspect, provided herein are systems useful for editing one or more nucleic acids. The systems comprise one or more of the Cas12a proteins and/or fusion proteins (or recombinant nucleic acids, constructs, vectors, or host cells) described above. In some embodiments, the systems further comprise one or more additional elements that are useful for editing one or more nucleic acids. For example, a system comprising a fusion protein comprising a Cas nuclease may further comprise one or more guide nucleic acids, which are detailed below. The systems provided herein are useful for performing the methods described in Section VI of this disclosure.

In some cases, the systems and methods described herein comprise at least one guide nucleic acid polynucleotide. In some cases, the systems and methods described herein comprise a plurality of guide nucleic acids. In some embodiments, the polynucleotide can be deoxyribonucleic acid (DNA) . In some cases, the DNA sequence can be single-stranded or doubled-stranded. In some embodiments, the at least one guide nucleic acid polynucleotide can be ribonucleic acid (guide RNA) .

In some embodiments, the Cas12a protein can be complexed with the at least one guide RNA polynucleotide. The at least one guide RNA polynucleotide can comprise a nucleic-acid targeting region that comprises a complementary sequence to a nucleic acid sequence on the targeted polynucleotide such as the targeted genomic loci or genes to confer sequence specificity of nuclease targeting. In some embodiments, the at least one guide RNA polynucleotide can comprise two separate nucleic acid molecules, which can be referred to as a double guide nucleic acid or a single nucleic acid molecule, which can be referred to as a single guide nucleic acid (e.g., single guide RNA or sgRNA) .

The Cas protein-binding segment of a guide nucleic acid can comprise two stretches of nucleotides (e.g., crRNA and tracrRNA) that are complementary to one another. The two stretches of nucleotides (e.g., crRNA and tracrRNA) that are complementary to one another can be covalently linked by intervening nucleotides (e.g., a linker in the case of a single guide nucleic acid) . The two stretches of nucleotides (e.g., crRNA and tracrRNA) that are complementary to one another can hybridize to form a double stranded RNA duplex or hairpin of the Cas protein-binding segment, thus resulting in a stem-loop structure. The crRNA and the tracrRNA can be covalently linked via the 3′ end of the crRNA and the 5′ end of the tracrRNA. Alternatively, tracrRNA and crRNA can be covalently linked via the 5′ end of the tracrRNA and the 3′ end of the crRNA. A crRNA can comprise the nucleic acid-targeting segment (e.g., spacer region) of the guide nucleic acid and a stretch of nucleotides that can form one half of a double-stranded duplex of the Cas protein-binding segment of the guide nucleic acid. The crRNA can also provide a single-stranded nucleic acid targeting segment (e.g., a spacer region) that hybridizes to a target nucleic acid recognition sequence (e.g., protospacer) . Whether a nuclease requires a crRNA molecule only or whether it requires both a crRNA molecule and a tracrRNA molecule (whether covalently linked or not) depends on the CRISPR-associated nuclease used. Cas12 proteins typically do not require a tracrRNA.

In some embodiments, the nucleic acid-targeting region of a guide nucleic acid can be between 18 to 72 nucleotides in length. The nucleic acid-targeting region of a guide nucleic acid (e.g., spacer region) can have a length of from about 12 nucleotides to about 100 nucleotides. For example, the nucleic acid-targeting region of a guide nucleic acid (e.g., spacer region) can have a length of from about 12 nucleotides (nt) to about 80 nt, from about 12 nt to about 50 nt, from about 12 nt to about 40 nt, from about 12 nt to about 30 nt, from about 12 nt to about 25 nt, from about 12 nt to about 20 nt, from about 12 nt to about 19 nt, from about 12 nt to about 18 nt, from about 12 nt to about 17 nt, from about 12 nt to about 16 nt, or from about 12 nt to about 15 nt. Alternatively, the DNA-targeting segment can have a length of from about 18 nt to about 20 nt, from about 18 nt to about 25 nt, from about 18 nt to about 30 nt, from about 18 nt to about 35 nt, from about 18 nt to about 40 nt, from about 18 nt to about 45 nt, from about 18 nt to about 50 nt, from about 18 nt to about 60 nt, from about 18 nt to about 70 nt, from about 18 nt to about 80 nt, from about 18 nt to about 90 nt, from about 18 nt to about 100 nt, from about 20 nt to about 25 nt, from about 20 nt to about 30 nt, from about 20 nt to about 35 nt, from about 20 nt to about 40 nt, from about 20 nt to about 45 nt, from about 20 nt to about 50 nt, from about 20 nt to about 60 nt, from about 20 nt to about 70 nt, from about 20 nt to about 80 nt, from about 20 nt to about 90 nt, or from about 20 nt to about 100 nt. The length of the nucleic acid-targeting region can be at least 5, 10, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 30 or more nucleotides. The length of the nucleic acid-targeting region (e.g., spacer sequence) can be at most 5, 10, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 30 or more nucleotides.

In some embodiments, the nucleic acid-targeting region of a guide nucleic acid (e.g., spacer) is 20 nucleotides in length. In some embodiments, the nucleic acid-targeting region of a guide nucleic acid is 19 nucleotides in length. In some embodiments, the nucleic acid-targeting region of a guide nucleic acid is 18 nucleotides in length. In some embodiments, the nucleic acid-targeting region of a guide nucleic acid is 17 nucleotides in length. In some embodiments, the nucleic acid-targeting region of a guide nucleic acid is 16 nucleotides in length. In some embodiments, the nucleic acid-targeting region of a guide nucleic acid is 21 nucleotides in length. In some embodiments, the nucleic acid-targeting region of a guide nucleic acid is 22 nucleotides in length.

The nucleotide sequence of the guide nucleic acid that is complementary to a nucleotide sequence (target sequence) of the target nucleic acid can have a length of, for example, at least about 12 nt, at least about 15 nt, at least about 18 nt, at least about 19 nt, at least about 20 nt, at least about 25 nt, at least about 30 nt, at least about 35 nt or at least about 40 nt. The nucleotide sequence of the guide nucleic acid that is complementary to a nucleotide sequence (target sequence) of the target nucleic acid can have a length of from about 12 nucleotides (nt) to about 80 nt, from about 12 nt to about 50 nt, from about 12 nt to about 45 nt, from about 12 nt to about 40 nt, from about 12 nt to about 35 nt, from about 12 nt to about 30 nt, from about 12 nt to about 25 nt, from about 12 nt to about 20 nt, from about 12 nt to about 19 nt, from about 19 nt to about 20 nt, from about 19 nt to about 25 nt, from about 19 nt to about 30 nt, from about 19 nt to about 35 nt, from about 19 nt to about 40 nt, from about 19 nt to about 45 nt, from about 19 nt to about 50 nt, from about 19 nt to about 60 nt, from about 20 nt to about 25 nt, from about 20 nt to about 30 nt, from about 20 nt to about 35 nt, from about 20 nt to about 40 nt, from about 20 nt to about 45 nt, from about 20 nt to about 50 nt, or from about 20 nt to about 60 nt.

A protospacer sequence of a targeted polynucleotide can be identified by identifying a protospacer-adjacent motif (PAM) within a region of interest and selecting a region of a desired size upstream or downstream of the PAM as the protospacer. A corresponding spacer sequence can be designed by determining the complementary sequence of the protospacer region.

A spacer sequence can be identified using a computer program (e.g., machine readable code) . The computer program can use variables such as predicted melting temperature, secondary structure formation, and predicted annealing temperature, sequence identity, genomic context, chromatin accessibility, %GC, frequency of genomic occurrence, methylation status, presence of SNPs, and the like.

The percent complementarity between the nucleic acid-targeting sequence (e.g., a spacer sequence of the at least one guide polynucleotide as disclosed herein) and the target nucleic acid (e.g., a protospacer sequence of the one or more target loci as disclosed herein) can be at least 50%, at least 60%, at least 70%, at least 75%, at least 80%, at least 85%, at least 90%, at least 95%, at least 97%, at least 98%, at least 99%, or 100%. The percent complementarity between the nucleic acid-targeting sequence and the target nucleic acid can be at least 60%, at least 70%, at least 75%, at least 80%, at least 85%, at least 90%, at least 95%, at least 97%, at least 98%, at least 99%, or 100%over about 20 contiguous nucleotides.

The Cas protein binding segment of a guide nucleic acid can have a length of from about 10 nucleotides to about 100 nucleotides, e.g., from about 10 nucleotides (nt) to about 20 nt, from about 20 nt to about 30 nt, from about 30 nt to about 40 nt, from about 40 nt to about 50 nt, from about 50 nt to about 60 nt, from about 60 nt to about 70 nt, from about 70 nt to about 80 nt, from about 80 nt to about 90 nt, or from about 90 nt to about 100 nt. For example, the Cas protein-binding segment of a guide nucleic acid can have a length of from about 15 nucleotides (nt) to about 80 nt, from about 15 nt to about 50 nt, from about 15 nt to about 40 nt, from about 15 nt to about 30 nt or from about 15 nt to about 25 nt.

The dsRNA duplex of the Cas protein-binding segment of the guide nucleic acid can have a length from about 6 base pairs (bp) to about 50 bp. For example, the dsRNA duplex of the protein-binding segment can have a length from about 6 bp to about 40 bp, from about 6 bp to about 30 bp, from about 6 bp to about 25 bp, from about 6 bp to about 20 bp, from about 6 bp to about 15 bp, from about 8 bp to about 40 bp, from about 8 bp to about 30 bp, from about 8 bp to about 25 bp, from about 8 bp to about 20 bp or from about 8 bp to about 15 bp. For example, the dsRNA duplex of the Cas protein-binding segment can have a length from about from about 8 bp to about 10 bp, from about 10 bp to about 15 bp, from about 15 bp to about 18 bp, from about 18 bp to about 20 bp, from about 20 bp to about 25 bp, from about 25 bp to about 30 bp, from about 30 bp to about 35 bp, from about 35 bp to about 40 bp, or from about 40 bp to about 50 bp.

In some embodiments, the dsRNA duplex of the Cas protein-binding segment can have a length of 36 base pairs. The percent complementarity between the nucleotide sequences that hybridize to form the dsRNA duplex of the protein-binding segment can be at least about 60%. For example, the percent complementarity between the nucleotide sequences that hybridize to form the dsRNA duplex of the protein-binding segment can be at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 95%, at least about 98%, or at least about 99%. In some cases, the percent complementarity between the nucleotide sequences that hybridize to form the dsRNA duplex of the protein-binding segment is 100%.

Guide nucleic acids of the systems of the disclosure can include modifications or sequences that provide for additional desirable features (e.g., modified or regulated stability; subcellular targeting; tracking with a fluorescent label; a binding site for a protein or protein complex; and the like) . Examples of such modifications include, for example, a 5′cap (a7-methylguanylate cap (m7G) ) ; a 3′ polyadenylated tail (a 3′ poly (A) tail) ; a riboswitch sequence (e.g., to allow for regulated stability and/or regulated accessibility by proteins and/or protein complexes) ; a stability control sequence; a sequence that forms a dsRNA duplex (a hairpin) ) ; a modification or sequence that targets the RNA to a subcellular location (e.g., nucleus, mitochondria, chloroplasts, and the like) ; a modification or sequence that provides for tracking (e.g., direct conjugation to a fluorescent molecule, conjugation to a moiety that facilitates fluorescent detection, a sequence that allows for fluorescent detection, and so forth) ; a modification or sequence that provides a binding site for proteins (e.g., proteins that act on DNA, including transcriptional activators, transcriptional repressors, DNA methyl transferases, DNA demethylases, histone acetyltransferases, histone deacetylases, and combinations thereof.

A guide nucleic acid can comprise one or more modifications (e.g., a base modification, a backbone modification) , to provide the nucleic acid with a new or enhanced feature (e.g., improved stability) . A guide nucleic acid can comprise a nucleic acid affinity tag. A nucleoside can be a base-sugar combination. The base portion of the nucleotide can be a heterocyclic base. The two most common classes of such heterocyclic bases are the purines and the pyrimidines. Nucleotides can be nucleosides that further include a phosphate group covalently linked to the sugar portion of the nucleoside. For those nucleosides that include a pentofuranosyl sugar, the phosphate group can be linked to the 2′, the 3′, or the 5′ hydroxyl moiety of the sugar. In forming guide nucleic acids, the phosphate groups can covalently link adjacent nucleosides to one another to form a linear polymeric compound. In turn, the respective ends of this linear polymeric compound can be further joined to form a circular compound; however, linear compounds can be suitable. In addition, linear compounds can have internal nucleotide base complementarity and can therefore fold in a manner as to produce a fully or partially double-stranded compound. Further, within guide nucleic acids, the phosphate groups can commonly be referred to as forming the internucleoside backbone of the guide nucleic acid. The linkage or backbone of the guide nucleic acid can be a 3′ to 5′ phosphodiester linkage.

A guide nucleic acid can comprise a modified backbone and/or modified internucleoside linkages. Modified backbones can include those that retain a phosphorus atom in the backbone and those that do not have a phosphorus atom in the backbone.

Suitable modified guide nucleic acid backbones containing a phosphorus atom therein can include, for example, phosphorothioates, chiral phosphorothioates, phosphorodithioates, phosphotriesters, aminoalkylphosphotriesters, methyl and other alkyl phosphonates such as 3′-alkylene phosphonates, 5′-alkylene phosphonates, chiral phosphonates, phosphinates, phosphoramidates including 3′-amino phosphoramidate and aminoalkylphosphoramidates, phosphorodiamidates, thionophosphoramidates, thionoalkylphosphonates, thionoalkylphosphotriesters, selenophosphates, and boranophosphates having normal 3′-5′ linkages, 2′-5′ linked analogs, and those having inverted polarity wherein one or more internucleotide linkages is a 3′ to 3′, a 5′ to 5′ or a 2′ to 2′ linkage. Suitable guide nucleic acids having inverted polarity can comprise a single 3′ to 3′ linkage at the 3′-most internucleotide linkage (such as a single inverted nucleoside residue in which the nucleobase is missing or has a hydroxyl group in place thereof) . Various salts (e.g., potassium chloride or sodium chloride) , mixed salts, and free acid forms can also be included.

A guide nucleic acid can comprise one or more phosphorothioate and/or heteroatom internucleoside linkages, in particular -CH2-NH-O-CH2-, -CH2-N (CH3) -O-CH2- (a methylene (methylimino) or MMI backbone) , -CH2-O-N (CH3) -CH2-, -CH2-N (CH3) -N (CH3) -CH2-and -O-N (CH3) -CH2-CH2- (wherein the native phosphodiester internucleotide linkage is represented as -O-P (=O) (OH) -O-CH2-) .

A guide nucleic acid can comprise a morpholino backbone structure. For example, a nucleic acid can comprise a 6-membered morpholino ring in place of a ribose ring. In some of these embodiments, a phosphorodiamidate or other non-phosphodiester internucleoside linkage replaces a phosphodiester linkage.

A guide nucleic acid can comprise polynucleotide backbones that are formed by short chain alkyl or cycloalkyl internucleoside linkages, mixed heteroatom and alkyl or cycloalkyl internucleoside linkages, or one or more short chain heteroatomic or heterocyclic internucleoside linkages. These can include those having morpholino linkages (formed in part from the sugar portion of a nucleoside) ; siloxane backbones; sulfide, sulfoxide and sulfone backbones; formacetyl and thioformacetyl backbones; methylene formacetyl and thioformacetyl backbones; riboacetyl backbones; alkene containing backbones; sulfamate backbones; methyleneimino and methylenehydrazino backbones; sulfonate and sulfonamide backbones; amide backbones; and others having mixed N, O, S and CH2 component parts.

A guide nucleic acid can comprise a nucleic acid mimetic. The term “mimetic” can be intended to include polynucleotides wherein only the furanose ring or both the furanose ring and the internucleotide linkage are replaced with non-furanose groups, replacement of only the furanose ring can also be referred as being a sugar surrogate. The heterocyclic base moiety or a modified heterocyclic base moiety can be maintained for hybridization with an appropriate target nucleic acid. One such nucleic acid can be a peptide nucleic acid (PNA) . In a PNA, the sugar-backbone of a polynucleotide can be replaced with an amide containing backbone, in particular an aminoethylglycine backbone. The nucleotides can be retained and are bound directly or indirectly to aza nitrogen atoms of the amide portion of the backbone. The backbone in PNA compounds can comprise two or more linked aminoethylglycine units which gives PNA an amide containing backbone. The heterocyclic base moieties can be bound directly or indirectly to aza nitrogen atoms of the amide portion of the backbone.

A guide nucleic acid can comprise linked morpholino units (morpholino nucleic acid) having heterocyclic bases attached to the morpholino ring. Linking groups can link the morpholino monomeric units in a morpholino nucleic acid. Non-ionic morpholino-based oligomeric compounds can have less undesired interactions with cellular proteins. Morpholino-based polynucleotides can be non-ionic mimics of guide nucleic acids. A variety of compounds within the morpholino class can be joined using different linking groups. A further class of polynucleotide mimetic can be referred to as cyclohexenyl nucleic acids (CeNA) . The furanose ring normally present in a nucleic acid molecule can be replaced with a cyclohexenyl ring. CeNA DMT protected phosphoramidite monomers can be prepared and used for oligomeric compound synthesis using phosphoramidite chemistry. The incorporation of CeNA monomers into a nucleic acid chain can increase the stability of a DNA/RNA hybrid. CeNA oligoadenylates can form complexes with nucleic acid complements with similar stability to the native complexes. A further modification can include Locked Nucleic Acids (LNAs) in which the 2′-hydroxyl group is linked to the 4′carbon atom of the sugar ring thereby forming a 2′-C, 4′-C-oxymethylene linkage thereby forming a bicyclic sugar moiety. The linkage can be a methylene (-CH2-) , group bridging the 2′oxygen atom and the 4′ carbon atom wherein n is 1 or 2. LNA and LNA analogs can display very high duplex thermal stabilities with complementary nucleic acid (Tm=+3 to +10℃) , stability towards 3′-exonucleolytic degradation and good solubility properties.

A guide nucleic acid can comprise one or more substituted sugar moieties. Suitable polynucleotides can comprise a sugar substituent group selected from: OH; F; O-, S-, or N-alkyl; O-, S-, or N-alkenyl; O-, S-or N-alkynyl; or O-alkyl-O-alkyl, wherein the alkyl, alkenyl and alkynyl can be substituted or unsubstituted C₁ to C₁₀ alkyl or C₂ to C₁₀ alkenyl and alkynyl. Particularly suitable are O ( (CH₂) _nO) _mCH₃, O (CH₂) _nOCH₃, O (CH₂) _nNH₂, O (CH₂) _nCH₃, O (CH₂) _nONH₂, and O (CH₂) _nON ( (CH₂) _nCH₃) ₂, where n and m are from 1 to about 10. A sugar substituent group can be selected from: C₁ to C₁₀ lower alkyl, substituted lower alkyl, alkenyl, alkynyl, alkaryl, aralkyl, O-alkaryl or O-aralkyl, SH, SCH₃, OCN, Cl, Br, CN, CF₃, OCF₃, SOCH₃, SO₂CH₃, ONO₂, NO₂, N₃, NH₂, heterocycloalkyl, heterocycloalkaryl, aminoalkylamino, polyalkylamino, substituted silyl, an RNA cleaving group, a reporter group, an intercalator, a group for improving the pharmacokinetic properties of an guide nucleic acid, or a group for improving the pharmacodynamic properties of an guide nucleic acid, and other substituents having similar properties. A suitable modification can include 2′-methoxyethoxy (2′-O-CH₂ CH₂OCH₃, also known as 2′-O- (2-methoxyethyl) or 2′-MOE, an alkoxyalkoxy group) . A further suitable modification can include 2′-dimethylaminooxyethoxy, (a O (CH₂) ₂ON (CH₃) ₂ group, also known as 2′-DMAOE) , 2′-dimethylaminoethoxyethoxy (also known as 2′-O-dimethyl-amino-ethoxy-ethyl or 2′-DMAEOE) , or 2′-O-CH₂-O-CH₂-N (CH₃) ₂.

Other suitable sugar substituent groups can include methoxy (-O-CH₃) , aminopropoxy (--O CH₂ CH₂NH₂) , allyl (-CH₂-CH=CH₂) , -O-allyl (--O--CH₂-CH=CH₂) and fluoro (F) . 2′-sugar substituent groups can be in the arabino (up) position or ribo (down) position. A suitable 2′-arabino modification is 2′-F. Similar modifications can also be made at other positions on the oligomeric compound, particularly the 3′ position of the sugar on the 3′ terminal nucleoside or in 2′-5′ linked nucleotides and the 5′ position of 5′ terminal nucleotide. Oligomeric compounds can also have sugar mimetics such as cyclobutyl moieties in place of the pentofuranosyl sugar.

A guide nucleic acid can also include nucleobase (or “base” ) modifications or substitutions. As used herein, “unmodified” or “natural” nucleobases can include the purine bases, (e.g. adenine (A) and guanine (G) ) , and the pyrimidine bases, (e.g. thymine (T) , cytosine (C) and uracil (U) ) . Modified nucleobases can include other synthetic and natural nucleobases such as 5-methylcytosine (5-me-C) , 5-hydroxymethyl cytosine, xanthine, hypoxanthine, 2-aminoadenine, 6-methyl and other alkyl derivatives of adenine and guanine, 2-propyl and other alkyl derivatives of adenine and guanine, 2-thiouracil, 2-thiothymine and 2-thiocytosine, 5-halouracil and cytosine, 5-propynyl (-C=C-CH₃) uracil and cytosine and other alkynyl derivatives of pyrimidine bases, 6-azo uracil, cytosine and thymine, 5-uracil (pseudouracil) , 4-thiouracil, 8-halo, 8-amino, 8-thiol, 8-thioalkyl, 8-hydroxyl and other 8-substituted adenines and guanines, 5-halo particularly 5-bromo, 5-trifluoromethyl and other 5-substituted uracils and cytosines, 7-methylguanine and 7-methyladenine, 2-F-adenine, 2-amino-adenine, 8-azaguanine and 8-azaadenine, 7-deazaguanine and 7-deazaadenine and 3-deazaguanine and 3-deazaadenine. Modified nucleobases can include tricyclic pyrimidines such as phenoxazine cytidine (1H-pyrimido (5, 4-b) (1, 4) benzoxazin-2 (3H) -one) , phenothiazine cytidine (1H-pyrimido (5, 4-b) (1, 4) benzothiazin-2 (3H) -one) , G-clamps such as a substituted phenoxazine cytidine (e.g. 9- (2-aminoethoxy) -H-pyrimido (5, 4- (b) (1, 4) benzoxazin-2 (3H) -one) , carbazole cytidine (2H-pyrimido (4, 5-b) indol-2-one) , pyridoindole cytidine ( (3′, 2′: 4, 5) pyrrolo (2, 3-d) pyrimidin-2-one) .

Heterocyclic base moieties can include those in which the purine or pyrimidine base is replaced with other heterocycles, for example 7-deaza-adenine, 7-deazaguanosine, 2-aminopyridine and 2-pyridone. Nucleobases can be useful for increasing the binding affinity of a polynucleotide compound. These can include 5-substituted pyrimidines, 6-azapyrimidines and N-2, N-6 and O-6 substituted purines, including 2-aminopropyladenine, 5-propynyluracil and 5-propynylcytosine. 5-methylcytosine substitutions can increase nucleic acid duplex stability by 0.6-1.2℃ and can be suitable base substitutions (e.g., when combined with 2′-O-methoxyethyl sugar modifications) .

A modification of a guide nucleic acid can comprise chemically linking to the guide nucleic acid one or more moieties or conjugates that can enhance the activity, cellular distribution or cellular uptake of the guide nucleic acid. These moieties or conjugates can include conjugate groups covalently bound to functional groups such as primary or secondary hydroxyl groups. Conjugate groups can include, but are not limited to, intercalators, reporter molecules, polyamines, polyamides, polyethylene glycols, polyethers, groups that enhance the pharmacodynamic properties of oligomers, and groups that can enhance the pharmacokinetic properties of oligomers. Conjugate groups can include, but are not limited to, cholesterols, lipids, phospholipids, biotin, phenazine, folate, phenanthridine, anthraquinone, acridine, fluoresceins, rhodamines, coumarins, and dyes. Groups that enhance the pharmacodynamic properties include groups that improve uptake, enhance resistance to degradation, and/or strengthen sequence-specific hybridization with the target nucleic acid. Groups that can enhance the pharmacokinetic properties include groups that improve uptake, distribution, metabolism or excretion of a nucleic acid. Conjugate moieties can include but are not limited to lipid moieties such as a cholesterol moiety, cholic acid a thioether, (e.g., hexyl-S-tritylthiol) , a thiocholesterol, an aliphatic chain (e.g., dodecandiol or undecyl residues) , a phospholipid (e.g., di-hexadecyl-rac-glycerol or triethylammonium 1, 2-di-O-hexadecyl-rac-glycero-3-H-phosphonate) , a polyamine or a polyethylene glycol chain, or adamantane acetic acid, a palmityl moiety, or an octadecylamine or hexylamino-carbonyl-oxycholesterol moiety.

In some embodiments, the at least one guide RNA polynucleotide of a system or method provided herein can bind to at least a portion of a genome (e.g., a plant genome) or a gene (e.g., a plant gene) . In some cases, the at least one guide RNA polynucleotide is capable of forming a complex with a Cas12a protein to direct the protein to target the portion of a target nucleic acid (e.g., a site in a genome or a gene) .

In some embodiments, the systems described herein comprise at least one guide RNA polynucleotide that is able to form a complex with a Cas12a protein or fusion protein of the system. In some embodiments, the systems described herein comprise at least two (e.g., at least three, at least four, at least five, or at least six) different guide RNA polynucleotides that are able to form a complex with a site-directed nuclease portion of a fusion protein of the system.

In some embodiments, the guide nucleic acid comprises a nucleotide sequence having at least 70% (e.g., at least 75%, at least 80%, at least 85%, at least 90%, at least 91%, at least 92%, at least 93%, at least 94%, at least 95%, at least 96%, at least 97%, at least 98%, at least 99%, or 100%) identity to any one of SEQ ID NOs: 27-34 as set forth in Table 5.

Table 5. Exemplary gRNA sequences

Also provided herein are kits that include the components of the systems described in this disclosure. In some embodiments, the kits include one or more of the fusion proteins and/or polynucleotides described herein.

VI. Methods

In another aspect, provided herein are methods for editing one or more nucleic acids using the Cas12a proteins, fusion proteins and/or systems described herein. In some embodiments, the methods comprise contacting a nucleic acid (i.e., the nucleic acid to be edited) with at least one Cas12a protein and/or fusion protein as described herein. In some embodiments, the methods further comprise contacting the nucleic acid with a guide RNA (e.g., as described in Section V above) having a region complementary to a selected portion of the nucleic acid. In some embodiments, contacting the nucleic acid with the Cas12a protein and/or fusion protein and the guide RNA results in an edit to the nucleic acid. The nucleic acid (i.e., the nucleic acid to be edited) can be any suitable nucleic acid. In some embodiments, the nucleic acid is a portion of a chromosome. In some embodiments, the nucleic acid is a portion of a genome (e.g., a plant genome) .

As described herein and demonstrated in the Examples below, the methods provided herein can result in increased frequency of one or more desired nucleic acid editing outcomes (e.g., SDN-1 editing) . In some embodiments, SDN-1 editing efficiency can be measured by dividing the number of plants with an insertion or deletion ( “indel” ) by the total number of transgenic plants. In some embodiments, use of a Cas12a protein or fusion protein provided herein results in an increase in SDN-1 editing efficiency relative to use of an unmodified (i.e., wild-type) Cas12a protein. In some embodiments, indel events can be further analyzed for the occurrence of homozygous edits (i.e., the same indel is present at both alleles of the target nucleic acid) and biallelic edits (i.e., different indels are present at each allele of the target nucleic acid) . In some embodiments, the rate of homozygous/biallelic edits can be measured by dividing the number of plants with homozygous/biallelic edits by the total number of plants with indels. In some embodiments, use of a Cas12a protein or fusion protein provided herein results in an increase in the rate of homozygous/biallelic edits.

The methods herein comprise providing a Cas12a protein and/or fusion protein and a nucleic acid to be edited and can also comprise providing at least one guide RNA. These various components can be provided using any suitable technique. For example, providing a Cas12a protein or fusion protein can comprise introducing the Cas12a protein or fusion protein into a cell or introducing a recombinant nucleic acid, construct, or vector encoding the Cas12a protein or fusion protein into a cell. Similarly, a gRNA can be provided by introducing the gRNA itself or a nucleic acid sequence encoding the gRNA. In some embodiments, a Cas12a protein and/or fusion protein and a gRNA can be encoded by the same DNA construct or vector.

EXAMPLES

Example 1. C965S acts synergistically with D156R to improve the efficiency of SDN1 editing at difficult target sites in maize

By analyzing the crystal structure of LbCas12a (PDB entry 5XUS) , two surface-exposed cysteine residues (Cys965 and Cys1090) and another close to the N-terminus (Cys10) of the native LbCas12a protein (SEQ ID NO: 1) were selected for site-directed mutagenesis (FIG. 1) . A total of five LbCas12a variants were generated: in the first variant, the Cys965 residue was mutated to a serine residue, named LbCas12a-C965S; in the second variant, both the Cys10 and Cys965 residues were mutated to serine residues, named LbCas12a-C10S-C965S; in the third variant, both the Cys965 and Cys1090 residues were mutated to serine residues, named LbCas12a-C965S-C1090S; in the fourth variant, the Cys965 residue was mutated a serine residue, while the Asp156 residue was mutated to an arginine residue, named LbCas12a-D156R-C965S; in the fifth variant, only the Asp156 residue was mutated to an arginine residue, named LbCas12a-D156R. The coding sequence of LbCas12a was optimized based on maize-preferred codon usage; the codon triplet of selected cysteine residues, TGC, was mutated to serine-coding TCC by introducing the mutation in overlapping PCR primers. For all five variants plus the wildtype LbCas12a as a control, an SV40 NLS (SEQ ID NO: 56) was fused to the N-terminus via a flexible 30-amino acid (GSSSS) ₆ (SEQ ID NO: 46) peptide linker, while two SV40 NLS’s separated by an 8-amino acid (SGGS) ₂ (SEQ ID NO: 78) peptide linker were fused to the C-terminus via a flexible 30-amino acid (GSSSS) ₆ (SEQ ID NO: 46) peptide linker as well.

For each of the five variants plus the wildtype control, a binary vector was constructed to express one variant (or the control) in stable transgenic maize plants, in order to assess the SDN1-generation performance of the variant. In each construct, the coding sequence of one variant, fused with a NLS, was operably linked to a sugarcane ubiquitin 4 gene promoter and an Agrobacterium tumefaciens nopaline synthase gene terminator, for strong constitutive expression in maize cells. In all constructs, a same gRNA array, driven by an Oryza sativa U6 promoter, was designed to express a gRNA targeting the maize gene Starch Branching Enzyme IIb (ZmSBEIIb) . The gRNA was based on the mature crRNA scaffold of LbCas12a. Transgenic maize plants were generated by infecting calli derived from immature maize embryos with Agrobacterium tumefaciens strain harboring one of the binary vectors described above, followed by tissue culture procedures.

The leaf sheaths of regenerated plantlets were sampled for DNA extraction, and the transgenic plants were identified by TaqMan qPCR assays. The sequence spanning the target site was PCR-amplified and Sanger-sequenced, in order to determine the genotype and the SDN1 efficiency at the target site. As summarized in Table 6 and 7, the SDN1 efficiencies at the SBEIIb target site and the Wx1 target site were compared. The SBEIIb target site is difficult to edit with Cas12a. In comparison with the wildtype, C965S alone slightly improved the overall SDN1 efficiency of SBEIIb but not the rate of homozygous/biallelic edits. Neither C10 nor C1090S exhibited a positive effect on top of C965S. In contrast, when paired with D156R, C965S increased the overall SDN1 efficiency by 4 folds over the wildtype, with more than half being homozygous or biallelic edits; in comparison, D156R alone increased the SDN1 efficiency at ZmSBEIIb target site by 3 folds, with about half being homozygous or biallelic edits. With respect to Wx1, SDN1 editing efficiencies were similar, if only modestly improved, over the wildtype.

Table 6. SDN1 editing efficiencies of LbCas12a variants in maize.

Table 7. SDN1 editing efficiencies of LbCas12a variants in maize.

Example 2. C965S acts synergistically with D156R to improve the efficiency of SDN1 editing in soybean

In order to assess the efficacy of C965S in improving SDN1-inducing efficiency of LbCas12a in soybean, two LbCas12a variants, LbCas12a-D156R and LbCas12a-D156R-C965S, will be compared for their SDN1-generating performance. The two variants are identical to those tested in maize as described in Example 1, except that the coding sequences were optimized based on Arabidopsis-preferred codon usage.

For each variant, two binary vectors were constructed to test the SDN1 efficiency at different target loci. In each construct, the coding sequence of one variant, fused with a NLS, is operably linked to a promoter, such as an Arabidopsis elongation factor 1 alpha (EF1α) promoter, and a terminator, such as an Agrobacterium tumefaciens nopaline synthase gene terminator, for strong constitutive expression in soybean cells. In all constructs, a gRNA or a gRNA array driven by a soybean ubiquitin 1 promoter, is designed to express gRNA (s) targeting a sitethe soybean genome, such as FAD2 (SEQ ID NO: 38 provides the LbCas12a gRNA targeting soybean FAD2-1A gene) . The gRNA (s) are based on the mature crRNA scaffold of LbCas12a, and are processed by self-cleaving ribozymes on the flanks. Transgenic soybean plants are generated by infecting mature soybean seeds with Agrobacterium tumefaciens strain harboring one of the binary vectors described above, followed by tissue culture procedures.

The leaves of regenerated plantlets will be sampled for DNA extraction, and the transgenic plants will be identified by TaqMan assays. The sequences spanning each of the three target sites will be PCR-amplified and Sanger-sequenced, in order to determine the genotypes and the SDN1 efficiencies at the target sites.

Example 3. Generation and identification of FnCas12a cysteine-substituted variants with enhanced in planta SDN1 editing efficiency.

A total of 9 cysteine residues (Cys70, Cys473, Cys568, Cys717, Cys882, Cys1086, Cys1116, Cys1190 and Cys1196) exist in the FnCas12a primary sequence (SEQ ID NO: 2) . The crystal structure of FnCas12a (PDB entries 5NFV and 6I1K) suggested four cysteine residues (Cys70, Cys473, Cys1116, and Cys1190) are most likely surface-exposed, and thus might be prone to undesired interactions and/or modifications. The surface topography around Cys473 suggests it was difficult for an interacting protein or a modification enzyme to access, while our PyMOL analysis suggest that Cys1116 and Cys1190 are likely to form intramolecular disulfide bond (FIG. 2) . Therefore Cys70, Cys1116, and Cys1190 were selected for substitution.

All cysteine-substituted variants were generated on the basis of FnCas12a-E184R variant. Three variants carrying single Cys-to-Ser substitution (FnCas12a-E184R-C70S, FnCas12a-E184R-C1116S, FnCas12a-E184R-C1190S) were generated, as well as three variants carrying double Cys-to-Ser substitutions (FnCas12a-E184R-C70S-C1116S, FnCas12a-E184R-C70S-C1190S, FnCas12a-E184R-C1116S-C1190S) . The coding sequence of FnCas12a was optimized based on maize-preferred codon usage; the codon triplet of selected cysteine residues, TGC, was mutated to serine-coding TCC for C1116 and AGC for C70 and C1190 by introducing the mutation in overlapping PCR primers. For all six variants plus the FnCas12a-E184R as a control, an SV40 NLS (SEQ ID NO: 56) was fused to the N-terminus via a flexible, 30-amino acid (GSSSS) ₆ (SEQ ID NO: 46) peptide linker, while two SV40 NLS’s separated by an 8-amino acid (SGGS) ₂ (SEQ ID NO: 78) peptide linker were fused to the C-terminus via a flexible, 30-amino acid (GSSSS) ₆ (SEQ ID NO: 46) as well.

For each of the six variants plus the FnCas12a-E184R control, a binary vector was constructed to express one variant (or the control) in stable transgenic maize plants, in order to assess the SDN1-generation performance of the variant. In each construct, the coding sequence of one variant, fused with a NLS, was operably linked to a sugarcane ubiquitin 4 gene promoter and an Agrobacterium tumefaciens nopaline synthase gene terminator, for strong constitutive expression in maize cells. In all constructs, a same gRNA array, driven by an Oryza sativa U6 promoter, was designed to express three gRNAs targeting three different maize genes: Waxy1 (ZmWx1) , Glossy2 (ZmGL2) , and Starch Branching Enzyme IIb (ZmSBEIIb) . The gRNAs were based on the mature crRNA scaffold of FnCas12a. Transgenic maize plants will be generated by infecting calli derived from immature maize embryos with Agrobacterium tumefaciens strain harboring one of the binary vectors described above, followed by tissue culture procedures.

The leaf sheath of regenerated plantlets will be sampled for DNA extraction, and the transgenic plants will be identified by TaqMan assays. The sequences spanning each of the three target sites will be PCR-amplified and Sanger-sequenced, in order to determine the genotypes and the SDN1 efficiencies at the target sites. Both the overall SDN1 editing efficiency and the rate of homozygous/biallelic mutants of each variant will be compared to those of the FnCas12a-E184R control, to assess to efficacy of the cysteine substitutions.

Example 4. Generation and identification of AsCas12a cysteine-substituted variants with enhanced in planta SDN1 editing efficiency.

A total of 8 cysteine residues (Cys65, Cys205, Cys334, Cys379, Cys608, Cys674, Cys1025, and Cys1248) exist in the AsCas12a primary sequence (SEQ ID NO: 3) . The crystal structure of AsCas12a (PDB entry 5KK5) suggests three cysteine residues: Cys334, Cys379, and Cys674 are most likely surface-exposed and thus prone to undesired interactions and/or modifications. These three residues were selected for substitution.

All cysteine-substituted variants were generated on the basis of AsCas12a-E174R variant. Three variants carrying single Cys-to-Ser substitution (AsCas12a-E174R-C334S, AsCas12a-E174R-C379S, AsCas12a-E174R-C674S) were generated, as well as three variants carrying double Cys-to-Ser substitutions (AsCas12a-E174R-C334S-C379S, AsCas12a-E174R-C334S-C674S, AsCas12a-E174R-C379S-C674S) . The coding sequence of AsCas12a was optimized based on maize-preferred codon usage; the codon triplet of selected cysteine residues, TGC, was mutated to serine-coding TCC by introducing the mutation in overlapping PCR primers. For all six variants plus the AsCas12a-E174R as a control, an SV40 NLS (SEQ ID NO: 56) was fused to the N-terminus via a flexible, 30-amino acid (GSSSS) ₆ (SEQ ID NO: 46) peptide linker, while two SV40 NLS’s separated by an 8-amino acid (SGGS) ₂ (SEQ ID NO: 78) peptide linker were fused to the C-terminus via a flexible 30-amino acid (GSSSS) ₆ (SEQ ID NO: 46) peptide linker as well.

For each of the six variants plus the AsCas12a-E174R control, a binary vector was constructed to express one variant (or the control) in stable transgenic maize plants, in order to assess the SDN1-generation performance of the variant. In each construct, the coding sequence of one variant, fused with a NLS, was operably linked to a sugarcane ubiquitin 4 gene promoter and an Agrobacterium tumefaciens nopaline synthase gene terminator, for strong constitutive expression in maize cells. In all constructs, a same gRNA array, driven by an Oryza sativa U6 promoter, was designed to express three gRNAs targeting three different maize genes: Waxy1 (ZmWx1) , Glossy2 (ZmGL2) , and Starch Branching Enzyme IIb (ZmSBEIIb) . The gRNAs were based on the mature crRNA scaffold of AsCas12a. Transgenic maize plants will be generated by infecting calli derived from immature maize embryos with Agrobacterium tumafciens strain harboring one of the binary vectors described above, followed by tissue culture procedures.

The leaf sheath of regenerated plantlets will sampled for DNA extraction, and the transgenic plants will be identified by TaqMan assays. The sequences spanning each of the three target sites will be PCR-amplified and Sanger-sequenced, in order to determine the genotypes and the SDN1 efficiencies at the target sites. Both the overall SDN1 editing efficiency and the rate of homozygous/biallelic mutants of each variant will be compared to those of the AsCas12a-E174R control, to assess to efficacy of the cysteine substitutions.

Example 5. Generation and identification of Mb2Cas12a cysteine-substituted variants with enhanced in planta SDN1 editing efficiency.

Because there is no published crystal structure of Mb2Cas12a (from Moraxella bovoculi strain 57922) to date, the crystal structure of MbCas12a (PDB entry 6IV6) from M. bovoculi strain 22581, the closest ortholog sharing 94.7%amino acid identity with Mb2Cas12a, was used as a reference structure to estimate the location of the cysteine residues in Mb2Cas12a. A total of 8 cysteine residues (Cys270, Cys307, Cys583, Cys662, Cys1068, Cys1099, Cys1149, and Cys1162) exist in the primary sequence of the Mb2Cas12a from strain 57922 (SEQ ID NO: 4) , which correspond to Cys283, Cys320, Cys593, Cys672, Cys1078, Cys1109, Cys1159, and Tyr1172 in Mb2Cas12a from strain 22581, respectively. This estimation suggests Cys270, Cys307, Cys583, Cys1068, Cys1099, Cys1149 and Cys1162 are likely exposed on the surface of Mb2Cas12a. Since Cys1162 of Mb2Cas12a aligns to Tyr1172 in MbCas12a, Tyr1172 was mutated in 6IV6 and the structure was remodeled with PyMOL. The resulting structure model suggests Cys1162 is also likely surface-exposed in Mb2Cas12a. However, the surface topology suggested Cys1162 is difficult for an interacting protein or a modification enzyme to access. Therefore Cys270, Cys583, Cys1068, Cys1099, Cys1149 were selected for site directed mutagenesis.

All cysteine-substituted variants were generated on the basis of Mb2Cas12a-D172R variant, which was the control for the new variants. Five variants carrying single Cys-to-Ser substitution (Mb2Cas12a-D172R-C270S, Mb2Cas12a-D172R-C583S, Mb2Cas12a-D172R-C1068S, Mb2Cas12a-D172R-C1099S, Mb2Cas12a-D172R-C1149S) were generated, as well as one variant carrying quintuple Cys-to-Ser substitutions (Mb2Cas12a-D172R-C270S-C583S-C1068S-C1099S-C1149S) and one carrying quintuple Cys-to-Ala substitutions (Mb2Cas12a-D172R-C270A-C583A-C1068A-C1099A-C1149A) . The coding sequence of Mb2Cas12a was optimized based on maize-preferred codon usage; the codon triplet of selected cysteine residues, TGC, was mutated to serine-coding TCC by introducing the mutation in overlapping PCR primers for five single mutation variants. For the variants with quintuple mutations, Mb2Cas12a was synthesized through introducing serine-coding TCC to replace TGC or alanine-coding GCC. For all seven variants plus the Mb2Cas12a-D172R as a control, an SV40 NLS (SEQ ID NO: 56) was fused to the N-terminus via a flexible 30-amino acid (GSSSS) ₆ (SEQ ID NO: 46) peptide linker, while two SV40 NLS’s separated by an 8-amino acid (SGGS) ₂ (SEQ ID NO: 78) peptide linker were fused to the C-terminus via a flexible 30-amino acid (GSSSS) ₆ (SEQ ID NO: 46) peptide linker as well.

For each of the six variants plus the Mb2Cas12a-D172R control, a binary vector was constructed to express one variant (or the control) in stable transgenic maize plants, in order to assess the SDN1-generation performance of the variant. In each construct, the coding sequence of one variant, fused with NLS, was operably linked to a sugarcane ubiquitin 4 gene promoter and an Agrobacterium tumefaciens nopaline synthase gene terminator, for strong constitutive expression in maize cells. In all constructs, a same gRNA array, driven by sugarcane ubiquitin 4 gene promoter and an Agrobacterium tumefaciens nopaline synthase gene terminator, was designed to express four gRNAs targeting four different maize genes: Waxy1 (ZmWx1) , Benzoxazinone synthesis 9 (ZmBx9) , Glossy2 (ZmGL2) , and ZmBINa. The gRNAs were based on the mature crRNA scaffold of LbCas12a and processed by self-cleaving ribozymes on the flanks. Transgenic maize plants were generated by infecting calli derived from immature maize embryos with Agrobacterium tumefaciens strain harboring one of the binary vectors described above, followed by tissue culture procedures.

The leaf sheath of regenerated plantlets were sampled for DNA extraction, and the transgenic plants were identified by TaqMan assays. The sequences spanning each of the three target sites were PCR-amplified and Sanger-sequenced, in order to determine the genotypes and the SDN1 efficiencies at the target sites. As summarized in Table 8, in comparison with the Mb2Cas12a-D172R control, all variants with single Cys-to-Ser mutation increased the rate of homozygous/biallelic mutants. The efficacy of stacking five cysteine mutations will be determined similarly.

Table 8. SDN1 editing efficiencies of Mb2Cas12a variants in maize.

LIST OF REFERNECED SEQUENCES

SEQ ID NO: 1 -Lachnospiraceae bacterium Cas12a protein (LbCas12a)

SEQ ID NO: 2 -Francisella novicida U112 Cas12a protein (FnCas12a)

SEQ ID NO: 3 -Acidaminococcus sp. Cas12a protein (AsCas12a)

SEQ ID NO: 4 -Moraxella bovoculi strain 57922 Cas12a protein (Mb2Cas12a)

SEQ ID NO: 5 –amino acid sequence of LbCas12a + linker:

SEQ ID NO: 6 --amino acid sequence of LbCas12a D156R:

SEQ ID NO: 7 --amino acid sequence of LbCas12a + D156R + C965S:

SEQ ID NO: 8 --amino acid sequence of LbCas12a + C10S + C965S:

SEQ ID NO: 9 --amino acid sequence of LbCas12a + C965S + C1090S:

SEQ ID NO: 10 --amino acid sequence of LbCas12a + linker + D156R:

SEQ ID NO: 11 --amino acid sequence of LbCas12a + linker + D156R + C965S:

SEQ ID NO: 12 --amino acid sequence of Mb2Cas12a + linker + D172R:

SEQ ID NO: 13 --amino acid sequence of Mb2Cas12a + linker + D172R + C270S:

SEQ ID NO: 14 --amino acid sequence of Mb2Cas12a + linker + D172R + C583S:

SEQ ID NO: 15 --amino acid sequence of Mb2Cas12a + linker + D172R + C1068S:

SEQ ID NO: 16 --amino acid sequence of Mb2Cas12a + linker + D172R + C1099S:

SEQ ID NO: 17 --amino acid sequence of Mb2Cas12a + linker + D172R + C1149S:

SEQ ID NO: 18 --amino acid sequence of Mb2Cas12a + linker + D172R + C270S + C583S + C1068S + C1099S + C1149S:

SEQ ID NO: 19 --amino acid sequence of Mb2Cas12a + linker + D172R + C270A +C583A + C1068A+ C1099A+ C1149A:

SEQ ID NO: 20 --nucleic acid sequence encoding LbCas12a + linker, maize codon-optimized:

SEQ ID NO: 21 -nucleic acid sequence encoding LbCas12a + linker + D156R, maize codon-optimized:

SEQ ID NO: 22 --nucleic acid sequence encoding LbCas12a + linker + D156R + C965S, maize codon-optimized:

SEQ ID NO: 23 --nucleic acid sequence encoding LbCas12a + linker + C10S + C965S, maize codon-optimized:

SEQ ID NO: 24 --nucleic acid sequence encoding LbCas12a + linker + C965S + C1090S, maize codon-optimized:

SEQ ID NO: 25 --nucleic acid sequence encoding LbCas12a + linker + D156R, Arabidopsis codon-optimized:

SEQ ID NO: 26 --nucleic acid sequence encoding LbCas12a + linker + D156R + C965S, Arabidopsis codon-optimized:

SEQ ID NO: 27 --nucleic acid sequence encoding Mb2Cas12a + linker + D172R, maize codon-optimized:

SEQ ID NO: 28 --nucleic acid sequence encoding Mb2Cas12a + linker + D172R + C270S, maize codon-optimized:

SEQ ID NO: 29 --nucleic acid sequence encoding Mb2Cas12a + linker + D172R + C583S, maize codon-optimized:

SEQ ID NO: 30 --nucleic acid sequence encoding Mb2Cas12a + linker + D172R +C1068S, maize codon-optimized:

SEQ ID NO: 31 --nucleic acid sequence encoding Mb2Cas12a + linker + D172R +C1099S, maize codon-optimized:

SEQ ID NO: 32 --nucleic acid sequence encoding Mb2Cas12a + linker + D172R +C1149S, maize codon-optimized:

SEQ ID NO: 33 --nucleic acid sequence encoding Mb2Cas12a + linker + D172R + C270S + C583S + C1068S + C1099S + C1149S, maize codon-optimized:

SEQ ID NO: 34 --nucleic acid sequence encoding Mb2Cas12a + linker + D172R + C270A + C583A + C1068A+ C1099A+ C1149A, maize codon-optimized:

All patents, patent publications, patent applications, journal articles, books, technical references, and the like discussed in the instant disclosure are incorporated herein by reference in their entirety for all purposes.

It is to be understood that the figures and descriptions of the disclosure have been simplified to illustrate elements that are relevant for a clear understanding of the disclosure. It should be appreciated that the figures are presented for illustrative purposes and not as construction drawings. Omitted details and modifications or alternative embodiments are within the purview of persons of ordinary skill in the art.

It can be appreciated that, in certain aspects of the disclosure, a single component may be replaced by multiple components, and multiple components may be replaced by a single component, to provide an element or structure or to perform a given function or functions. Except where such substitution would not be operative to practice certain embodiments of the disclosure, such substitution is considered within the scope of the disclosure.

The examples presented herein are intended to illustrate potential and specific implementations of the disclosure. It can be appreciated that the examples are intended primarily for purposes of illustration of the disclosure for those skilled in the art. There may be variations to these diagrams or the operations described herein without departing from the spirit of the disclosure. For instance, in certain cases, method steps or operations may be performed or executed in differing order, or operations may be added, deleted or modified.

Where a range of values is provided, it is understood that each intervening value, to the smallest fraction of the unit of the lower limit, unless the context clearly dictates otherwise, between the upper and lower limits of that range is also specifically disclosed. Any narrower range between any stated values or unstated intervening values in a stated range and any other stated or intervening value in that stated range is encompassed. The upper and lower limits of those smaller ranges may independently be included or excluded in the range, and each range where either, neither, or both limits are included in the smaller ranges is also encompassed within the technology, subject to any specifically excluded limit in the stated range. Where the stated range includes one or both of the limits, ranges excluding either or both of those included limits are also included.

In the foregoing description, numerous specific details are set forth to provide a more thorough understanding of the present invention. However, it will be apparent to one of skill in the art that the invention described in this disclosure may be practiced without one or more of these specific details. In other instances, well-known features and procedures well known to those skilled in the art have not been described in order to avoid obscuring the invention. Embodiments of the disclosure have been described for illustrative and not restrictive purposes. Although the present invention is described primarily with reference to specific embodiments, it is also envisioned that other embodiments will become apparent to those skilled in the art upon reading the present disclosure, and it is intended that such embodiments be contained within the present inventive methods. Accordingly, the present disclosure is not limited to the embodiments described above or depicted in the drawings, and various embodiments and modifications can be made without departing from the scope of the claims below.

Claims

A Cas12a protein comprising a sequence that is at least 80%identical to the amino acid sequence of SEQ ID NO: 1 and a human-induced mutation at position C965.
The Cas12a protein of claim 1, wherein the human-induced mutation is a cysteine to serine substitution.
The Cas12a protein of claim 1 or 2, further comprising a human-induced mutation at position D156.
The Cas12a protein of claim 3, wherein the human-induced mutation at position D156 is an aspartic acid to arginine substitution.
The Cas12a protein of any one of claims 1 to 4, wherein the sequence comprises any one of SEQ ID NOs: 5-11.
A Cas12a protein comprising a sequence that is at least 80%identical to the amino acid sequence of SEQ ID NO: 2 and a human-induced mutation at position C70, C1116, and/or C1190.
The Cas12a protein of claim 6, wherein the human-induced mutation is a cysteine to serine substitution.
The Cas12a protein of claim 6 or 7, further comprising a human-induced mutation at position E184.
The Cas12a protein of claim 8, wherein the human-induced mutation at position E184 is a glutamic acid to arginine substitution.
A Cas12a protein comprising a sequence that is at least 80%identical to the amino acid sequence of SEQ ID NO: 3 and a human-induced mutation at position C334, C379, and/or C674.
The Cas12a protein of claim 10, wherein the human-induced mutation is a cysteine to serine substitution.
The Cas12a protein of claim 10 or 11, further comprising a human-induced mutation at position E174.
The Cas12a protein of claim 12, wherein the human-induced mutation at position E174 is a glutamic acid to arginine substitution.
A Cas12a protein comprising a sequence that is at least 80%identical to the amino acid sequence of SEQ ID NO: 4 and a human-induced mutation at position C270, C583, C1068, C1099, and/or C1149.
The Cas12a protein of claim 14, wherein the human-induced mutation is a cysteine to serine substitution.
The Cas12a protein of claim 14 or 15, further comprising a human-induced mutation at position D172.
The Cas12a protein of claim 16, wherein the human-induced mutation at position D172 is an aspartic acid to arginine substitution.
The Cas12a protein of any one of claims 14 to 17, wherein the sequence comprises any one of SEQ ID NOs: 12-19.
The Cas12a protein of any one of claims 1 to 18, wherein the Cas12a protein is a catalytically dead Cas12a (dCas12a) protein of a nickase Cas12a (nCas12a) protein.
The Cas12a protein of any one of claims 1 to 19, further comprising a nuclear localization signal.
A fusion protein comprising the Cas12a protein of any one of claims 1 to 20 and a heterologous domain.
The fusion protein of claim 21, wherein the heterologous domain is a deaminase domain, a transcription factor domain, a nuclease domain, a reverse-transcriptase domain, a transposase domain, a integrase domain, a uracil DNA glycosylase inhibitor domain, a recombinase domain, a nickase domain, a methyltransferase domain, a methylase domain, an acetylase domain, an acetyltransferase domain, a transcriptional activator domain, or a transcriptional repressor domain.
The fusion protein of claim 21 or 22, wherein the Cas12a protein is linked to the heterologous domain by a linker sequence.
A nucleic acid encoding the Cas12a protein of any one of claims 1 to 20 or the fusion protein of any one of claims 21 to 23.
The nucleic acid of claim 24, wherein the nucleic acid sequence is any one of SEQ ID NOs: 20-34.
A DNA construct comprising a promoter operably linked to the nucleic acid of claim 24 or 25.
A vector comprising the nucleic acid of claim 24 or 25 or the DNA construct of claim 26.
A cell comprising the nucleic acid of claim 24, the DNA construct of claim 26, or the vector of claim 27.
The cell of claim 28, wherein the cell is a plant cell.
The cell of claim 29, wherein the cell is a maize plant cell, a wheat plant cell, a rice plant cell, a soybean plant cell, a sunflower plant cell, or a tomato plant cell.
A method of editing a nucleic acid, the method comprising:

contacting the nucleic acid with (i) the Cas12a protein of any one of claims 1 to 20 or the fusion protein of any one of claims 21 to 23 and (ii) a guide RNA having a region complementary to a selected portion of the nucleic acid, thereby resulting in an edit to the nucleic acid.