1. 16 Apr, 2018 2 commits
    • Eric Dumazet's avatar
      tcp: avoid extra wakeups for SO_RCVLOWAT users · 03f45c88
      Eric Dumazet authored
      SO_RCVLOWAT is properly handled in tcp_poll(), so that POLLIN is only
      generated when enough bytes are available in receive queue, after
      David change (commit c7004482
      
       "tcp: Respect SO_RCVLOWAT in tcp_poll().")
      
      But TCP still calls sk->sk_data_ready() for each chunk added in receive
      queue, meaning thread is awaken, and goes back to sleep shortly after.
      
      Tested:
      
      tcp_mmap test program, receiving 32768 MB of data with SO_RCVLOWAT set to 512KB
      
      -> Should get ~2 wakeups (c-switches) per MB, regardless of how many
      (tiny or big) packets were received.
      
      High speed (mostly full size GRO packets)
      
      received 32768 MB (100 % mmap'ed) in 8.03112 s, 34.2266 Gbit,
        cpu usage user:0.037 sys:1.404, 43.9758 usec per MB, 65497 c-switches
      
      received 32768 MB (99.9954 % mmap'ed) in 7.98453 s, 34.4263 Gbit,
        cpu usage user:0.03 sys:1.422, 44.3115 usec per MB, 65485 c-switches
      
      Low speed (sender is ratelimited and sends 1-MSS at a time, so GRO is not helping)
      
      received 22474.5 MB (100 % mmap'ed) in 6015.35 s, 0.0313414 Gbit,
        cpu usage user:0.05 sys:1.586, 72.7952 usec per MB, 44950 c-switches
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      03f45c88
    • Eric Dumazet's avatar
      tcp: fix SO_RCVLOWAT and RCVBUF autotuning · d1361840
      Eric Dumazet authored
      
      
      Applications might use SO_RCVLOWAT on TCP socket hoping to receive
      one [E]POLLIN event only when a given amount of bytes are ready in socket
      receive queue.
      
      Problem is that receive autotuning is not aware of this constraint,
      meaning sk_rcvbuf might be too small to allow all bytes to be stored.
      
      Add a new (struct proto_ops)->set_rcvlowat method so that a protocol
      can override the default setsockopt(SO_RCVLOWAT) behavior.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d1361840
  2. 02 Mar, 2018 1 commit
    • Eric Dumazet's avatar
      tcp_bbr: better deal with suboptimal GSO (II) · dcb8c9b4
      Eric Dumazet authored
      This is second part of dealing with suboptimal device gso parameters.
      In first patch (350c9f48
      
       "tcp_bbr: better deal with suboptimal GSO")
      we dealt with devices having low gso_max_segs
      
      Some devices lower gso_max_size from 64KB to 16 KB (r8152 is an example)
      
      In order to probe an optimal cwnd, we want BBR being not sensitive
      to whatever GSO constraint a device can have.
      
      This patch removes tso_segs_goal() CC callback in favor of
      min_tso_segs() for CC wanting to override sysctl_tcp_min_tso_segs
      
      Next patch will remove bbr->tso_segs_goal since it does not have
      to be persistent.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dcb8c9b4
  3. 14 Feb, 2018 1 commit
    • Eric Dumazet's avatar
      tcp: try to keep packet if SYN_RCV race is lost · e0f9759f
      Eric Dumazet authored
      
      
      배석진 reported that in some situations, packets for a given 5-tuple
      end up being processed by different CPUS.
      
      This involves RPS, and fragmentation.
      
      배석진 is seeing packet drops when a SYN_RECV request socket is
      moved into ESTABLISH state. Other states are protected by socket lock.
      
      This is caused by a CPU losing the race, and simply not caring enough.
      
      Since this seems to occur frequently, we can do better and perform
      a second lookup.
      
      Note that all needed memory barriers are already in the existing code,
      thanks to the spin_lock()/spin_unlock() pair in inet_ehash_insert()
      and reqsk_put(). The second lookup must find the new socket,
      unless it has already been accepted and closed by another cpu.
      
      Note that the fragmentation could be avoided in the first place by
      use of a correct TCP MSS option in the SYN{ACK} packet, but this
      does not mean we can not be more robust.
      
      Many thanks to 배석진 for a very detailed analysis.
      Reported-by: default avatar배석진 <soukjin.bae@samsung.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e0f9759f
  4. 06 Feb, 2018 2 commits
    • John Fastabend's avatar
      bpf: sockmap, add sock close() hook to remove socks · 1aa12bdf
      John Fastabend authored
      The selftests test_maps program was leaving dangling BPF sockmap
      programs around because not all psock elements were removed from
      the map. The elements in turn hold a reference on the BPF program
      they are attached to causing BPF programs to stay open even after
      test_maps has completed.
      
      The original intent was that sk_state_change() would be called
      when TCP socks went through TCP_CLOSE state. However, because
      socks may be in SOCK_DEAD state or the sock may be a listening
      socket the event is not always triggered.
      
      To resolve this use the ULP infrastructure and register our own
      proto close() handler. This fixes the above case.
      
      Fixes: 174a79ff
      
       ("bpf: sockmap with sk redirect support")
      Reported-by: default avatarPrashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      1aa12bdf
    • John Fastabend's avatar
      net: add a UID to use for ULP socket assignment · b11a632c
      John Fastabend authored
      
      
      Create a UID field and enum that can be used to assign ULPs to
      sockets. This saves a set of string comparisons if the ULP id
      is known.
      
      For sockmap, which is added in the next patches, a ULP is used to
      hook into TCP sockets close state. In this case the ULP being added
      is done at map insert time and the ULP is known and done on the kernel
      side. In this case the named lookup is not needed. Because we don't
      want to expose psock internals to user space socket options a user
      visible flag is also added. For TLS this is set for BPF it will be
      cleared.
      
      Alos remove pr_notice, user gets an error code back and should check
      that rather than rely on logs.
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      b11a632c
  5. 26 Jan, 2018 2 commits
  6. 19 Jan, 2018 1 commit
  7. 13 Dec, 2017 1 commit
  8. 08 Dec, 2017 1 commit
  9. 05 Dec, 2017 1 commit
    • Lawrence Brakmo's avatar
      bpf: Add access to snd_cwnd and others in sock_ops · f19397a5
      Lawrence Brakmo authored
      
      
      Adds read access to snd_cwnd and srtt_us fields of tcp_sock. Since these
      fields are only valid if the socket associated with the sock_ops program
      call is a full socket, the field is_fullsock is also added to the
      bpf_sock_ops struct. If the socket is not a full socket, reading these
      fields returns 0.
      
      Note that in most cases it will not be necessary to check is_fullsock to
      know if there is a full socket. The context of the call, as specified by
      the 'op' field, can sometimes determine whether there is a full socket.
      
      The struct bpf_sock_ops has the following fields added:
      
        __u32 is_fullsock;      /* Some TCP fields are only valid if
                                 * there is a full socket. If not, the
                                 * fields read as zero.
      			   */
        __u32 snd_cwnd;
        __u32 srtt_us;          /* Averaged RTT << 3 in usecs */
      
      There is a new macro, SOCK_OPS_GET_TCP32(NAME), to make it easier to add
      read access to more 32 bit tcp_sock fields.
      Signed-off-by: default avatarLawrence Brakmo <brakmo@fb.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      f19397a5
  10. 03 Dec, 2017 1 commit
  11. 27 Nov, 2017 1 commit
  12. 19 Nov, 2017 1 commit
    • Neal Cardwell's avatar
      tcp: when scheduling TLP, time of RTO should account for current ACK · ed66dfaf
      Neal Cardwell authored
      Fix the TLP scheduling logic so that when scheduling a TLP probe, we
      ensure that the estimated time at which an RTO would fire accounts for
      the fact that ACKs indicating forward progress should push back RTO
      times.
      
      After the following fix:
      
      df92c839 ("tcp: fix xmit timer to only be reset if data ACKed/SACKed")
      
      we had an unintentional behavior change in the following kind of
      scenario: suppose the RTT variance has been very low recently. Then
      suppose we send out a flight of N packets and our RTT is 100ms:
      
      t=0: send a flight of N packets
      t=100ms: receive an ACK for N-1 packets
      
      The response before df92c839 that was:
        -> schedule a TLP for now + RTO_interval
      
      The response after df92c839 is:
        -> schedule a TLP for t=0 + RTO_interval
      
      Since RTO_interval = srtt + RTT_variance, this means that we have
      scheduled a TLP timer at a point in the future that only accounts for
      RTT_variance. If the RTT_variance term is small, this means that the
      timer fires soon.
      
      Before df92c839 this would not happen, because in that code, when
      we receive an ACK for a prefix of flight, we did:
      
          1) Near the top of tcp_ack(), switch from TLP timer to RTO
             at write_queue_head->paket_tx_time + RTO_interval:
                  if (icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)
                         tcp_rearm_rto(sk);
      
          2) In tcp_clean_rtx_queue(), update the RTO to now + RTO_interval:
                  if (flag & FLAG_ACKED) {
                         tcp_rearm_rto(sk);
      
          3) In tcp_ack() after tcp_fastretrans_alert() switch from RTO
             to TLP at now + RTO_interval:
                  if (icsk->icsk_pending == ICSK_TIME_RETRANS)
                         tcp_schedule_loss_probe(sk);
      
      In df92c839 we removed that 3-phase dance, and instead directly
      set the TLP timer once: we set the TLP timer in cases like this to
      write_queue_head->packet_tx_time + RTO_interval. So if the RTT
      variance is small, then this means that this is setting the TLP timer
      to fire quite soon. This means if the ACK for the tail of the flight
      takes longer than an RTT to arrive (often due to delayed ACKs), then
      the TLP timer fires too quickly.
      
      Fixes: df92c839
      
       ("tcp: fix xmit timer to only be reset if data ACKed/SACKed")
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ed66dfaf
  13. 15 Nov, 2017 2 commits
    • Eric Dumazet's avatar
      tcp: highest_sack fix · 50895b9d
      Eric Dumazet authored
      syzbot easily found a regression added in our latest patches [1]
      
      No longer set tp->highest_sack to the head of the send queue since
      this is not logical and error prone.
      
      Only sack processing should maintain the pointer to an skb from rtx queue.
      
      We might in the future only remember the sequence instead of a pointer to skb,
      since rb-tree should allow a fast lookup.
      
      [1]
      BUG: KASAN: use-after-free in tcp_highest_sack_seq include/net/tcp.h:1706 [inline]
      BUG: KASAN: use-after-free in tcp_ack+0x42bb/0x4fd0 net/ipv4/tcp_input.c:3537
      Read of size 4 at addr ffff8801c154faa8 by task syz-executor4/12860
      
      CPU: 0 PID: 12860 Comm: syz-executor4 Not tainted 4.14.0-next-20171113+ #41
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:17 [inline]
       dump_stack+0x194/0x257 lib/dump_stack.c:53
       print_address_description+0x73/0x250 mm/kasan/report.c:252
       kasan_report_error mm/kasan/report.c:351 [inline]
       kasan_report+0x25b/0x340 mm/kasan/report.c:409
       __asan_report_load4_noabort+0x14/0x20 mm/kasan/report.c:429
       tcp_highest_sack_seq include/net/tcp.h:1706 [inline]
       tcp_ack+0x42bb/0x4fd0 net/ipv4/tcp_input.c:3537
       tcp_rcv_established+0x672/0x18a0 net/ipv4/tcp_input.c:5439
       tcp_v4_do_rcv+0x2ab/0x7d0 net/ipv4/tcp_ipv4.c:1468
       sk_backlog_rcv include/net/sock.h:909 [inline]
       __release_sock+0x124/0x360 net/core/sock.c:2264
       release_sock+0xa4/0x2a0 net/core/sock.c:2778
       tcp_sendmsg+0x3a/0x50 net/ipv4/tcp.c:1462
       inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:763
       sock_sendmsg_nosec net/socket.c:632 [inline]
       sock_sendmsg+0xca/0x110 net/socket.c:642
       ___sys_sendmsg+0x75b/0x8a0 net/socket.c:2048
       __sys_sendmsg+0xe5/0x210 net/socket.c:2082
       SYSC_sendmsg net/socket.c:2093 [inline]
       SyS_sendmsg+0x2d/0x50 net/socket.c:2089
       entry_SYSCALL_64_fastpath+0x1f/0x96
      RIP: 0033:0x452879
      RSP: 002b:00007fc9761bfbe8 EFLAGS: 00000212 ORIG_RAX: 000000000000002e
      RAX: ffffffffffffffda RBX: 0000000000758020 RCX: 0000000000452879
      RDX: 0000000000000000 RSI: 0000000020917fc8 RDI: 0000000000000015
      RBP: 0000000000000086 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000212 R12: 00000000006ee3a0
      R13: 00000000ffffffff R14: 00007fc9761c06d4 R15: 0000000000000000
      
      Allocated by task 12860:
       save_stack+0x43/0xd0 mm/kasan/kasan.c:447
       set_track mm/kasan/kasan.c:459 [inline]
       kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
       kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:489
       kmem_cache_alloc_node+0x144/0x760 mm/slab.c:3638
       __alloc_skb+0xf1/0x780 net/core/skbuff.c:193
       alloc_skb_fclone include/linux/skbuff.h:1023 [inline]
       sk_stream_alloc_skb+0x11d/0x900 net/ipv4/tcp.c:870
       tcp_sendmsg_locked+0x1341/0x3b80 net/ipv4/tcp.c:1299
       tcp_sendmsg+0x2f/0x50 net/ipv4/tcp.c:1461
       inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:763
       sock_sendmsg_nosec net/socket.c:632 [inline]
       sock_sendmsg+0xca/0x110 net/socket.c:642
       SYSC_sendto+0x358/0x5a0 net/socket.c:1749
       SyS_sendto+0x40/0x50 net/socket.c:1717
       entry_SYSCALL_64_fastpath+0x1f/0x96
      
      Freed by task 12860:
       save_stack+0x43/0xd0 mm/kasan/kasan.c:447
       set_track mm/kasan/kasan.c:459 [inline]
       kasan_slab_free+0x71/0xc0 mm/kasan/kasan.c:524
       __cache_free mm/slab.c:3492 [inline]
       kmem_cache_free+0x77/0x280 mm/slab.c:3750
       kfree_skbmem+0xdd/0x1d0 net/core/skbuff.c:603
       __kfree_skb+0x1d/0x20 net/core/skbuff.c:642
       sk_wmem_free_skb include/net/sock.h:1419 [inline]
       tcp_rtx_queue_unlink_and_free include/net/tcp.h:1682 [inline]
       tcp_clean_rtx_queue net/ipv4/tcp_input.c:3111 [inline]
       tcp_ack+0x1b17/0x4fd0 net/ipv4/tcp_input.c:3593
       tcp_rcv_established+0x672/0x18a0 net/ipv4/tcp_input.c:5439
       tcp_v4_do_rcv+0x2ab/0x7d0 net/ipv4/tcp_ipv4.c:1468
       sk_backlog_rcv include/net/sock.h:909 [inline]
       __release_sock+0x124/0x360 net/core/sock.c:2264
       release_sock+0xa4/0x2a0 net/core/sock.c:2778
       tcp_sendmsg+0x3a/0x50 net/ipv4/tcp.c:1462
       inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:763
       sock_sendmsg_nosec net/socket.c:632 [inline]
       sock_sendmsg+0xca/0x110 net/socket.c:642
       ___sys_sendmsg+0x75b/0x8a0 net/socket.c:2048
       __sys_sendmsg+0xe5/0x210 net/socket.c:2082
       SYSC_sendmsg net/socket.c:2093 [inline]
       SyS_sendmsg+0x2d/0x50 net/socket.c:2089
       entry_SYSCALL_64_fastpath+0x1f/0x96
      
      The buggy address belongs to the object at ffff8801c154fa80
       which belongs to the cache skbuff_fclone_cache of size 456
      The buggy address is located 40 bytes inside of
       456-byte region [ffff8801c154fa80, ffff8801c154fc48)
      The buggy address belongs to the page:
      page:ffffea00070553c0 count:1 mapcount:0 mapping:ffff8801c154f080 index:0x0
      flags: 0x2fffc0000000100(slab)
      raw: 02fffc0000000100 ffff8801c154f080 0000000000000000 0000000100000006
      raw: ffffea00070a5a20 ffffea0006a18360 ffff8801d9ca0500 0000000000000000
      page dumped because: kasan: bad access detected
      
      Fixes: 737ff314
      
       ("tcp: use sequence distance to detect reordering")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      50895b9d
    • Stephen Hemminger's avatar
      tcp: Namespace-ify sysctl_tcp_default_congestion_control · 6670e152
      Stephen Hemminger authored
      
      
      Make default TCP default congestion control to a per namespace
      value. This changes default congestion control to a pointer to congestion ops
      (rather than implicit as first element of available lsit).
      
      The congestion control setting of new namespaces is inherited
      from the current setting of the root namespace.
      Signed-off-by: default avatarStephen Hemminger <sthemmin@microsoft.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6670e152
  14. 11 Nov, 2017 1 commit
  15. 10 Nov, 2017 1 commit
  16. 05 Nov, 2017 1 commit
    • Priyaranjan Jha's avatar
      tcp: higher throughput under reordering with adaptive RACK reordering wnd · 1f255691
      Priyaranjan Jha authored
      
      
      Currently TCP RACK loss detection does not work well if packets are
      being reordered beyond its static reordering window (min_rtt/4).Under
      such reordering it may falsely trigger loss recoveries and reduce TCP
      throughput significantly.
      
      This patch improves that by increasing and reducing the reordering
      window based on DSACK, which is now supported in major TCP implementations.
      It makes RACK's reo_wnd adaptive based on DSACK and no. of recoveries.
      
      - If DSACK is received, increment reo_wnd by min_rtt/4 (upper bounded
        by srtt), since there is possibility that spurious retransmission was
        due to reordering delay longer than reo_wnd.
      
      - Persist the current reo_wnd value for TCP_RACK_RECOVERY_THRESH (16)
        no. of successful recoveries (accounts for full DSACK-based loss
        recovery undo). After that, reset it to default (min_rtt/4).
      
      - At max, reo_wnd is incremented only once per rtt. So that the new
        DSACK on which we are reacting, is due to the spurious retx (approx)
        after the reo_wnd has been updated last time.
      
      - reo_wnd is tracked in terms of steps (of min_rtt/4), rather than
        absolute value to account for change in rtt.
      
      In our internal testing, we observed significant increase in throughput,
      in scenarios where reordering exceeds min_rtt/4 (previous static value).
      Signed-off-by: default avatarPriyaranjan Jha <priyarjha@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1f255691
  17. 01 Nov, 2017 1 commit
  18. 29 Oct, 2017 1 commit
  19. 28 Oct, 2017 13 commits
  20. 27 Oct, 2017 5 commits