Multimodal human-computer interaction needs intelligent architectures in order to enhance the flexibility and naturelness of the user interface. These architectures have the ability to manage several multithreaded input signals from different input media in order to perform their fusion into intelligent commands. In this paper, a generic comprehensive agent-based architecture for multimodal engine fusion is proposed. The architecture is sketched in term of its relevant components. Each element is modeled using timed colored Petri networks. The generic components of the engine fusion are then included in a pipelined based-agent global architecture for which the architectural quality attributes are outlined.