O que é “quebrado” sobre a semântica de herança cpuset cgroup no kernel do Linux?

5

Para citar o anúncio systemd de 2013 da nova interface do grupo de controle (com ênfase adicionada):

Note that the number of cgroup attributes currently exposed as unit properties is limited. This will be extended later on, as their kernel interfaces are cleaned up. For example cpuset or freezer are currently not exposed at all due to the broken inheritance semantics of the kernel logic. Also, migrating units to a different slice at runtime is not supported (i.e. altering the Slice= property for running units) as the kernel currently lacks atomic cgroup subtree moves.

Então, o que há de errado com a semântica de herança da lógica do kernel para cpuset (e como esse quebrantamento não se aplica a outros controladores cgroup como cpu )?

Existe um artigo no site da RedHat que oferece uma solução não verificada de como usar cgroups cgroups no RHEL 7, apesar de sua falta de suporte como propriedades de unidade systemd fáceis de gerenciar ... mas isso é mesmo uma boa idéia? A citação em negrito acima é preocupante.

Em outras palavras, quais são as "armadilhas" que podem se aplicar ao uso do cgroup v1 cgroups que estão sendo referenciadas aqui?

Estou começando uma recompensa por isso.

Possíveis fontes de informação para responder a esta pergunta (em nenhuma ordem especial) incluem:

  1. documentação do cgroup v1;
  2. código-fonte do kernel;
  3. resultados de testes;
  4. experiência no mundo real.

Um possível significado da linha em negrito na citação acima seria que quando um novo processo é bifurcado ele não permanece no mesmo cgroup cgroup como seu pai, ou que está no mesmo cgroup mas em algum tipo de " "não confirmado" status pelo qual ele pode realmente estar sendo executado em uma CPU diferente do que o cgroup permite. No entanto, isso é pura especulação da minha parte e eu preciso de uma resposta definitiva.

    
por Wildcard 01.09.2018 / 04:07

3 respostas

2

Pelo menos um problema definido e não resolvido com cpusets está documentado no rastreador de bugs do kernel aqui:

Bug 42789 - cpuset cgroup: when a CPU goes offline, it is removed from all cgroup's cpuset.cpus, but when it comes online, it is only restored to the root cpuset.cpus

Para citar um um comentário do ticket (estou adicionando os hiperlinks para o real commits e remover o endereço de e-mail da IBM no caso de spambots):

This was independently reported by Prashanth Nageshappa...and fixed in commit 8f2f748b0656257153bcf0941df8d6060acc5ca6, but subsequently reverted by Linus as commit 4293f20c19f44ca66e5ac836b411d25e14b9f185. According to his commit, the fix caused regressions elsewhere.

A correção de confirmação (que foi revertida posteriormente) descreve bem o problema:

Currently, during CPU hotplug, the cpuset callbacks modify the cpusets to reflect the state of the system, and this handling is asymmetric. That is, upon CPU offline, that CPU is removed from all cpusets. However when it comes back online, it is put back only to the root cpuset.

This gives rise to a significant problem during suspend/resume. During suspend, we offline all non-boot cpus and during resume we online them back. Which means, after a resume, all cpusets (except the root cpuset) will be restricted to just one single CPU (the boot cpu). But the whole point of suspend/resume is to restore the system to a state which is as close as possible to how it was before suspend.

O mesmo problema de hotplug assimétrico é descrito com mais detalhes sobre como ele se relaciona com a herança, em:

Bug 188101 - process scheduling in cpuset of cgroup is not working properly.

Citando esse ticket:

When cpuset of a container (docker/lxc both uses underlying cgroup) becomes empty (due to hotplug/hotunplug) then processes running in that container can be scheduled on any cpus in cpuset of its nearest non-empty ancestor.

But, when cpuset of a running container (docker/lxc) becomes non-empty from an empty state (adding cpu to the empty cpuset) by updating the cpuset of the running container (by using echo method), the processes running in that container still uses the same cpuset as that of its nearest non-empty ancestor.

Embora possa haver outros problemas com o cpuset, o acima é suficiente para entender e entender a afirmação de que o systemd não expõe nem utiliza o cpuset "devido à semântica da herança quebrada da lógica do kernel".

A partir desses dois relatórios de bugs, não apenas as CPUs não são adicionadas de volta a um cpuset após o currículo, mas mesmo quando são (manualmente) adicionadas, os processos nesse cgroup continuarão sendo executados em CPUs potencialmente desaprovadas pelo cpuset.

Encontrei uma mensagem do Lennart Poettering que confirma isso diretamente reason (negrito adicionado):

Em qua, 2016-08-03 às 16:56 +0200, Lennart Poettering escreveu:

On Wed, 03.08.16 14:46, Dr. Werner Fink (werner at suse.de) wrote:

problem with v228 (and I guess this is also later AFAICS from logs of current git) that repeating CPU hotplug events (offline/online). The root cause is that cpuset.cpus become not restored by machined. Please note that libvirt can not do this as it is not allowed to do so.

     

Esta é uma limitação da interface cpuset do kernel, e é uma das   as razões pelas quais não expomos os cpusets de forma alguma em systemd right   agora. Felizmente, há uma alternativa aos cpusets, que é a CPU   controles de afinidade expostos via CPUAffinity = em systemd, que faz muito   do mesmo, mas têm menos semântica borked.

     

Gostaríamos de apoiar os cpusets diretamente no systemd, mas não fazemos isso   desde que as interfaces do kernel sejam tão borked quanto são. Para   exemplo, cpusets são liberados inteiramente no momento em que o sistema   passa por um ciclo de suspensão / retomada.

    
por 06.09.2018 / 01:41
4

Eu não sou bem versado o suficiente com cgroups para dar uma resposta definitiva (e eu certamente não tenho experiência com cgroups voltando para 2013!) mas em uma vanilla Ubuntu 16.04 cgroups v1 parece ter ação juntos :

Eu criei um pequeno teste que força o bifurcação como um usuário diferente usando uma criança sudo /bin/bash desmembrada com & - a -H flag é paranoia extra para forçar sudo a executar com o ambiente doméstico do root.

cat <(whoami) /proc/self/cgroup >me.cgroup && \
sudo -H /bin/bash -c 'cat <(whoami) /proc/self/cgroup >you.cgroup' & \
sleep 2 && diff me.cgroup you.cgroup

Isso produz:

1c1
< admlocal
---
> root

Para referência, esta é a estrutura das montagens do cgroup no meu sistema:

$ mount | grep group
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
lxcfs on /var/lib/lxcfs type fuse.lxcfs (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)
$
    
por 05.09.2018 / 10:11
1

What's “broken” about cpuset cgroup inheritance semantics in the Linux kernel?

"Note that the number of cgroup attributes currently exposed as unit properties is limited. This will be extended later on, as their kernel interfaces are cleaned up. For example cpuset or freezer are currently not exposed at all due to the broken inheritance semantics of the kernel logic. Also, migrating units to a different slice at runtime is not supported (i.e. altering the Slice= property for running units) as the kernel currently lacks atomic cgroup subtree moves."

So, what's broken about the inheritance semantics of the kernel logic for cpuset (and how does this brokenness not apply to other cgroup controllers such as cpu)?

The bolded quotation above is concerning. To put it another way, what are the "gotchas" (pitfalls) that could apply to using cgroup v1 cpuset which are being referenced here?

Resposta realmente curta: O código não é multiprocessado, processos diferentes são usados e PIDs livres os retornam ao pool antes que os PIDs de seus filhos tenham terminado - deixando o upstream para acreditar que os filhos do PID estão ativos, então pule esse PID, mas que o PID não deveria ter sido reeditado antes de terminar os filhos. Em suma, bloqueios pobres.

Serviços, escopos e fatias podem ser criados livremente pelo administrador ou dinamicamente por programas. Isso pode interferir na configuração das fatias padrão pelo SO durante a inicialização.

Com o Cgroups, um processo e todos os seus filhos desenham recursos do grupo que os contém.

E muito mais ... levando a uma longa resposta ...

Várias pessoas expressaram suas preocupações:

  1. " Grupos de controle do Linux não são trabalhos " (2016) de Jonathan de Boyne Pollard:

    An operating system kernel that provides a "job" abstraction provides a way of cancelling/killing an entire "job". Witness the Win32 TerminateJobObject() mechanism, for example.

    When systemd terminates all of the processes in a cgroup, it doesn't issue a single "terminate job" system call. There isn't such a thing. It instead sits in a loop in application-mode code repeatedly scanning all of the process IDs in the cgroup (by re-reading a file full of PID numbers) and sending signals to new processes that it hasn't seen before. There are several problems with this.

    • systemd can be slower than whatever is reaping child processes within the process group, leading to the termination signals being sent to completely the wrong process: one that just happened to re-use the same process ID in between systemd reading the cgroup's process list file and it actually getting around to sending signals to the list of processes. ...

    ...

    • A program that forks new processes quickly enough within the cgroup can keep systemd spinning for a long time, in theory indefinitely as long as suitable "weather" prevails, as at each loop iteration there will be one more process to kill. Note that this does not have to be a fork bomb. It only has to fork enough that systemd sees at least one more new process ID in the cgroup every time that it runs its loop.

    • systemd keeps process IDs of processes that it has already signalled in a set, to know which ones it won't try to send signals to again. It's possible that a process with ID N could be signalled, terminate, and be cleaned out of the process table by a reaper/parent; and then something within the cgroup fork a new process that is allocated the same process ID N once again. systemd will re-read the cgroup's process ID list, think that it has already signalled the new process, and not signal it at all.

     

    These are addressed by a true "job" mechanism. But cgroups are not such. cgroups were intended as an improvement upon the traditional Unix resource limit mechanisms, addressing some of their long-standing and well-known design flaws. They weren't designed to be the equivalent of a VMS or a Windows NT Job Object.

    No, the freezer is not the answer. Not only does systemd not use the freezer, but the systemd people explicitly describe it as having "broken inheritance semantics of the kernel logic". You'll have to ask them what they mean by that, but the freezer does not, for them, magically turn cgroups into a job mechanism either.

    Moreover: This is not to mention that Docker and others will manipulate the freeze status of control groups for their own purposes, and there is no real race-free mechanism for sharing this setting amongst multiple owners, such as an atomic read-and-update for it.
    Permission is hereby granted to copy and to distribute this web page in its original, unmodified form as long as its last modification datestamp is preserved.

    • Função TerminateJobObject ()

      Terminates all processes currently associated with the job. If the  
      job is nested, this function terminates all processes currently  
      associated with the job and all of its child jobs in the hierarchy. 
      
    • Objetos de trabalho do Windows NT

      A job object allows groups of processes to be managed as a unit.  
      Job objects are namable, securable, sharable objects that control  
      attributes of the processes associated with them. Operations  
      performed on a job object affect all processes associated with the  
      job object. Examples include enforcing limits such as working set   
      size and process priority or terminating all processes associated 
      with a job.
      

    A resposta oferecida na explicação de Jonathan é:

    systemd's Resource Control Concepts

    ...

    Service, scope and slice units directly map to objects in the cgroup tree. When these units are activated they each map to directly (modulo some character escaping) to cgroup paths built from the unit names. For example, a service quux.service in a slice foobar-waldo.slice is found in the cgroup foobar.slice/foobar-waldo.slice/quux.service/.

    Services, scopes and slices may be created freely by the administrator or dynamically by programs. However by default the OS defines a number of built-in services that are necessary to start-up the system. Also, there are four slices defined by default: first of all the root slice -.slice (as mentioned above), but also system.slice, machine.slice, user.slice. By default all system services are placed in the first slice, all virtual machines and containers in the second, and user sessions in the third. However, this is just a default, and the administrator my freely define new slices and assign services and scopes to them. Also note that all login sessions automatically are placed in an individual scope unit, as are VM and container processes. Finally, all users logging in will also get an implicit slice of their own where all the session scopes are placed.

    ...

    As you can see, services and scopes contain process and are placed in slices, and slices do not contain processes of their own. Also note that the special "-.slice" is not shown as it is implicitly identified with the root of the entire tree.

    Resource limits may be set on services, scopes and slices the same way. ...

Siga os links acima para a explicação integral.

  1. " Cgroups v2: gerenciamento de recursos feito ainda pior na segunda vez "(14 de outubro de 2016), por davmac:

    ...

    You can create nested hierarchy such that there are groups within other groups, and the nested groups share the resources of their parent group (and may be further limited). You move a process into a group by writing its PID into one of the group’s control files. A group therefore potentially contains both processes and subgroups.

    The two obvious resources you might want to limit are memory and CPU time, and each of these has a “controller”, but there are potentially others (such as I/O bandwidth), and some Cgroup controllers don’t really manage resource utilisation as such (eg the “freezer” controller/subsystem). The Cgroups v1 interface allowed creating multiple hierarchies with different controllers attached to them (the value of this is dubious, but the possibility is there).

    Importantly, processes inherit their cgroup membership from their parent process, and cannot move themselves out of (or into) a cgroup unless they have appropriate privileges, which means that a process cannot escape its any limitations that have been imposed on it by forking. Compare this with the use of setrlimit, where a process’s use of memory (for example) can be limited using an RLIMIT_AS (address space) limitation, but the process can fork and its children can consume additional memory without drawing from the resources of the original process. With Cgroups on the other hand, a process and all its children draw resources from the containing group.

    ...

    cgroup controllers implemented a number of knobs which would never be accepted as public APIs because they were just adding control knobs to system-management pseudo filesystem. cgroup ended up with interface knobs which were not properly abstracted or refined and directly revealed kernel internal details.

    These knobs got exposed to individual applications through the ill-defined delegation mechanism effectively abusing cgroup as a shortcut to implementing public APIs without going through the required scrutiny.

    ...

    cgroup v1 allowed threads to be in any cgroups which created an interesting problem where threads belonging to a parent cgroup and its children cgroups competed for resources. This was nasty as two different types of entities competed and there was no obvious way to settle it. Different controllers did different things.

  2. Veja também o documento cgroup v2: " Problemas com v1 e Rationales for v2 ":

    Multiple Hierarchies

    cgroup v1 allowed an arbitrary number of hierarchies and each hierarchy could host any number of controllers. While this seemed to provide a high level of flexibility, it wasn’t useful in practice.

    For example, as there is only one instance of each controller, utility type controllers such as freezer which can be useful in all hierarchies could only be used in one. The issue is exacerbated by the fact that controllers couldn’t be moved to another hierarchy once hierarchies were populated. Another issue was that all controllers bound to a hierarchy were forced to have exactly the same view of the hierarchy. It wasn’t possible to vary the granularity depending on the specific controller.

    In practice, these issues heavily limited which controllers could be put on the same hierarchy and most configurations resorted to putting each controller on its own hierarchy. Only closely related ones, such as the cpu and cpuacct controllers, made sense to be put on the same hierarchy. This often meant that userland ended up managing multiple similar hierarchies repeating the same steps on each hierarchy whenever a hierarchy management operation was necessary.

    Furthermore, support for multiple hierarchies came at a steep cost. It greatly complicated cgroup core implementation but more importantly the support for multiple hierarchies restricted how cgroup could be used in general and what controllers was able to do.

    There was no limit on how many hierarchies there might be, which meant that a thread’s cgroup membership couldn’t be described in finite length. The key might contain any number of entries and was unlimited in length, which made it highly awkward to manipulate and led to addition of controllers which existed only to identify membership, which in turn exacerbated the original problem of proliferating number of hierarchies.

    Also, as a controller couldn’t have any expectation regarding the topologies of hierarchies other controllers might be on, each controller had to assume that all other controllers were attached to completely orthogonal hierarchies. This made it impossible, or at least very cumbersome, for controllers to cooperate with each other.

    In most use cases, putting controllers on hierarchies which are completely orthogonal to each other isn’t necessary. What usually is called for is the ability to have differing levels of granularity depending on the specific controller. In other words, hierarchy may be collapsed from leaf towards root when viewed from specific controllers. For example, a given configuration might not care about how memory is distributed beyond a certain level while still wanting to control how CPU cycles are distributed.

Por favor, veja o link da seção 3 para mais informações.

  1. Comunicação entre Lennart Poettering (systemd Developer) e Daniel P. Berrange (Redhat) em Wed, 20.07.16 12:53 recuperada de os arquivos systemd-devel intitulados:" [systemd-devel] Confinando TODOS os processos para CPUs / RAM via controlador cpuset ":

    On Wed, 20.07.16 12:53, Daniel P. Berrange (berrange at redhat.com) wrote:

    For virtualized hosts it is quite common to want to confine all host OS processes to a subset of CPUs/RAM nodes, leaving the rest available for exclusive use by QEMU/KVM. Historically people have used the "isolcpus" kernel arg todo this, but last year that had its semantics changed, so that any CPUs listed there also get excluded from load balancing by the schedular making it quite useless in general non-real-time use cases where you still want QEMU threads load-balanced across CPUs.

    So the only option is to use the cpuset cgroup controller to confine procosses. AFAIK, systemd does not have an explicit support for the cpuset controller at this time, so I'm trying to work out the "optimal" way to achieve this behind systemd's back while minimising the risk that future systemd releases will break things.

         

    Em qua, 20 de julho de 2016 às 03:29:30 PM +0200, Lennart Poettering respondeu:

         

    Sim, não suportamos isso a partir de agora, mas gostaríamos. O problema é que a interface do kernel para ela é bem simples, e até que isso não seja consertado, é improvável que possamos suportar isso no systemd. (E como eu entendi Tejun, a coisa mem vs. cpu no cpuset provavelmente não vai ficar do jeito que está).

         

    Próxima mensagem

         

    On Wed, 20.07.16 14:49, Daniel P. Berrange (berrange at redhat.com) wrote:

    cgroupsv2 is likely to break many things once distros switch over, so I assume that wouldn't be done in a minor update - only a major new distro release so, not so concerning.

Espero que isso esclareça as coisas.

    
por 07.09.2018 / 09:26