Network traffic classification (TC) is to classify network traffic into a specific class which plays a fundamental role in terms of network measurement, network management, and so on. In this work, we focus on packet-grained traffic classification. We find that previous packet-grained methods based on the analogy between traffic packet and image or text are not sufficiently reasonable, leading to a sub-optimal performance on both accuracy and efficiency that still can be largely improved. In this paper, we devise a new method, called BLJAN, to jointly learn from byte sequence and labels for packet-grained traffic classification. BLJAN embeds the packet’s bytes and all labels into a joint embedding space to capture their implicit correlations with a dual attention mechanism. It finally builds a more powerful packet representation with an enhancement from label embeddings to achieve high classification accuracy and interpretability. Extensive experiments on two benchmark traffic classification tasks, including application identification and traffic characterization, with three real-world datasets, demonstrate that BLJAN can achieve high performance (96.2%, 96.7%, and 99.7% Macro F1-scores on three datasets) for packet-grained traffic classification, outperforming six representative state-of-the-art baselines in terms of both accuracy and detection speed.