There are no good code-specific metrics in the space so far. For example, when talking about text generation, we could use the BLEU metric, but that does not work for code generation. One of the techniques to evaluate code models is to have unit tests that evaluate the generations. That's what HumanEval is! It contains 164 Python programs with 8 tests for each. The models being evaluated then generate k different solutions based on a prompt. If any of the k solutions pass the unit tests, that's counted as a win. So if we talk about pass@1, we're evaluating the models that are just generating one solution.
However, solving 160 programming questions in Python is not everything you would expect from a code model. There are translations of HumanEval to other programming languages, but that's still not enough. E.g. code explanation, docstring generation, code infilling, SO questions, writing tests, etc, is not captured by HumanEval. Real-world usage of code models is not captured by a single number based on 160 programs!