Weakly-supervised temporal action localization aims to detect the temporal boundaries of action instances in untrimmed videos only by relying on video-level action labels. The main challenge of the research is to accurately segment the action from the background in the absence of frame-level labels. Previous methods consider the action-related context in the background as the main factor restricting the segmentation performance. Most of them take action labels as pseudo-labels for context and suppress context frames in class activation sequences using the attention mechanism. However, this only applies to fixed shots or videos with a single theme. For videos with frequent scene switching and complicated themes, such as casual shots of unexpected events and secret shots, the strong randomness and weak continuity of the action cause the assumption not to be valid. In addition, the wrong pseudo-labels will enhance the weight of context frames, which will affect the segmentation performance. To address above issues, in this paper, we define a new video frame division standard (action instance, action-related context, no-action background), propose an Action-aware Network with Upper and Lower loss AUL-Net, which limits the activation of context to a reasonable range through a two-branch weight-sharing framework with a three-branch attention mechanism, so that the model has wider applicability while accurately suppressing context and background. We conducted extensive experiments on the self-built food safety video dataset FS-VA, and the results show that our method outperforms the state-of-the-art model.