I apologize for writing a lengthy answer, but I get the feeling the discussions about foundations for formalized mathematics are often hindered by lack of information.
I have used proof assistants for a while now, and also worked on their design and implementation. While I will be quick to tell jokes about set theory, I am bitterly aware of the shortcomings of type theory, very likely more so than the typical set theorist. (Ha, ha, "typical set theorist"!) If anyone can show me how to improve proof assistants with set theory, I will be absolutely deligthed! But it is not enough to just have good ideas – you need to test them in practice on large projects, as many phenomena related to formalized mathematics only appear once we reach a certain level of complexity.
The components of a proof assistant
The architecture of modern proof assistants is the result of several decades of experimentation, development and practical experience. A proof assistant incorporates not one, but several formal systems.
The central component of a proof assistant is the kernel, which validates every inference step and makes sure that proofs are correct. It does so by implementing a formal system $F$ (the foundation) which is expressive enough to allow formalization of a large amount of mathematics, but also simple enough to allow an efficient and correct implementation.
The foundational system implemented in the kernel is too rudimentary to be directly usable for sophisticated mathematics. Instead, the user writes their input in a more expressive formal language $V$ (the vernacular) that is designed to be practical and useful. Typically $V$ is quite complex so that it can accommodate various notational conventions and other accepted forms of mathematical expression. A second component of the proof assistant, the elaborator, translates $V$ to $F$ and passes the translations to the kernel for verification.
A proof assistant may incorporate a third formal language $M$ (the meta-language), which is used to implement proof search, decision procedures, and other automation techniques. Because the purpose of $M$ is to implement algorithms, it typically resembles a programming language. The distinction between $M$ and $V$ may not be very sharp, and sometimes they are combined into a single formalism. From mathematical point of view, $M$ is less interesting than $F$ and $V$, so we shall ignore it.
Suitability of foundation $F$
The correctness of the entire system depends on the correctness of the kernel. A bug in the kernel allows invalid proofs to be accepted, whereas a bug in any other component is just an annoyance. Therefore, the foundation $F$ should be simple so that we can implement it reliably. It should not be so exotic that logicians cannot tell how it relates to the accepted foundations of mathematics. Computers are fast, so it does not matter (too much) if the translation from $V$ to $F$ creates verbose statements. Also, $F$ need not be directly usable by humans.
A suitable variant of set theory or type theory fits these criteria. Indeed Mizar is based on set theory, while HOL, Lean, Coq, and Agda use type theory in the kernel. Since both set theory and type theory are mathematically very well understood, and more or less equally expressive, the choice will hinge on technical criteria, such as availability and efficiency of proof-checking algorithms.
Suitability of vernacular $V$
A much more interesting question is what makes the vernacular $V$ suitable.
For the vernacular to be useful, it has to reflect mathematical practice as much as possible. It should allow expression of mathematical ideas and concepts directly in familiar terms, and without unnecessary formalistic hassle. On the other hand, $V$ should be a formal language so that the elaborator can translate it to the foundation $F$.
To learn more about what makes $V$ good, we need to carefully observe how mathematicians actually write mathematics. They produce complex webs of definitions, theorems, and constructions, therefore $V$ should support management of large collections of formalized mathematics. In this regards we can learn a great deal by looking at how programmers organize software. For instance, saying that a body of mathematics is "just a series of definitions, theorems and proofs" is a naive idealization that works in certain contexts, but certainly not in practical formalization of mathematics.
Mathematicians omit a great deal of information in their writings, and are quite willing to sacrifice formal correctness for succinctness. The reader is expected to fill in the missing details, and to rectify the imprecisions. The proof assistant is expected to do the same. To illustrate this point, consider the following snippet of mathematical text:
Let $U$ and $V$ be vector spaces and $f : U \to V$ a linear map. Then $f(2 \cdot x + y) = 2 \cdot f(x) + f(y)$ for all $x$ and $y$.
Did you understand it? Of course. But you might be quite surprised to learn how much guesswork and correction your brain carried out:
The field of scalars is not specified, but this does not prevent you from understanding the text. You simply assumed that there is some underlying field of scalars $K$. You might find out more about $K$ in subsequent text. ($K$ is an existential variable.)
Strictly speaking "$f : U \to V$" does not make sense because $U$ and $V$ are not sets, but structures $U = (|U|, 0_U, {+}_U, {-}_U, {\cdot}_U)$ and $V = (|V|, 0_V, {+}_V, {-}_V, {\cdot}_V)$. Of course, you correctly surmised that $f$ is a map between the carriers, i.e., $f : |U| \to |V|$. (You inserted an implicit coercion from a vector space to its carrier.)
What do $x$ and $y$ range over? For $f(x)$ and $f(y)$ to make sense, it must be the case that $x \in |U|$ and $y \in |U|$. (You inferred the domain of $x$ and $y$.)
In the equation, $+$ on the left-hand side means $+_{U}$, and $+$ on the right-hand side ${+}_V$, and similarly for scalar multiplication. (You reconstructed the implicit arguments of $+$.)
The symbol $2$ normally denotes a natural number, as every child knows, but clearly it is meant to denote the scalar $1_K +_K 1_K$. (You interpreed "$2$" in the notation scope appropriate for the situation at hand.)
The vernacular $V$ must support these techniques, and many more, so that they can be implemented in the elaborator. It cannot be anything as simple as ZFC with first-order logic and definitional extensions, or bare Martin-Löf type theory. You may consider the development of $V$ to be outside of scope of mathematics and logic, but then do not complain when computer scientist fashion it after their technology.
I have never seen any serious proposals for a vernacular based on set theory. Or to put it another way, as soon as we start expanding and transforming set theory to fit the requirements for $V$, we end up with a theoretical framework that looks a lot like type theory. (You may entertain yourself by thinking how set theory could be used to detect that $f : U \to V$ above does not make sense unless we insert coercions – for if everthying is a set then so are $U$ and $V$, in which case $f : U \to V$ does make sense.)
Detecting mistakes
An important aspect of suitability of foundation is its ability to detect mistakes. Of course, its purpose is to prevent logical errors, but there is more to mistakes than just violation of logic. There are formally meaningful statements which, with very high probability, are mistakes. Consider the following snippet, and read it carefully:
Definition: A set $X$ is jaberwocky when for every $x \in X$ there exists a bryllyg $U \subseteq X$ and an uffish $K \subseteq X$ such that $x \in U$ and $U \in K$.
Even if you have never read Lewis Carroll's works, you should wonder about "$U \in K$". It looks like "$U \subseteq K$" would make more sense, since $U$ and $K$ are both subsets of $X$. Nevertheless, a proof assistant whose foundation $F$ is based on ZFC will accept the above definition as valid, even though it is very unlikely that the human intended it.
A proof assistant based on type theory would reject the definition by stating that "$U \in K$" is a type error.
So suppose we use a set-theoretic foundation $F$ that accepts any syntactically valid formula as meaningful. In such a system writing "$U \in K$" is meaningful and therefore the above definition will be accepted by the kernel. If we want the proof assistant to actually assist the human, it has to contain an additional mechanism that will flag "$U \in K$" as suspect, despite the kernel being happy with it. But what is this additional mechanism, if not just a second kernel based on type theory?
I am not saying that it is impossible to design a proof assistant based on set theory. After all, Mizar, the most venerable of them all, is designed precisely in this way – set theory with a layer of type-theoretic mechanisms on top. But I cannot help to wonder: why bother with the set-theoretic kernel that requires a type-theoretic fence to insulate the user from the unintended permissiveness of set theory?