Implementation of Wireshark follow TCP stream function with Python

Source: Internet
Author: User
Tags ack

To make a long story short, Wireshark has a follow TCP stream feature, which is handy. The drawback is that the extracted stream data does not have time stamps and other information, in the analysis of data delay and packet loss is somewhat inadequate. In this case, a simple follow TCP stream function is implemented with Python, while the TCP information is preserved.


The principle is simple and is still based on Wireshark, which has an export packet dissection as XML ' pdml ' file. The contents of the file after export are as follows:

<proto name= "TCP" showname= "Transmission Control Protocol, SRC port:59203 (59203), Dst port:80 (+), seq:1, Ack:1, len:381 "size=" "pos=" > <field name= "tcp.srcport" Showname= "Source port:59203 (59203)" size= "2" pos= "the" s how= "59203" value= "e743"/> <field name= "Tcp.dstport" showname= "Destination port:80" size= "2" pos= "show=" "Value=" "0050"/> <field name= "Tcp.port" showname= "Source or Destination port:59203" hide= "yes" size= "2" pos= " "Show=" 59203 "value=" e743 "/> <field name=" Tcp.port "showname=" Source or Destination port:80 "hide=" yes "size= "2" pos= "show=" "value=" 0050 "/> <field name=" Tcp.stream "showname=" stream index:4 "size=" 0 "pos=" show= " "4"/> <field name= "Tcp.len" showname= "tcp Segment len:381" size= "1" pos= "$" show= "381" value= ""/> <f Ield name= "Tcp.seq" Showname= "Sequence number:1 (relative Sequence number)" Size= "4" pos= "$" show= "1" value= "3B0AC4BD "/> <field Name= "Tcp.nxtseq" showname= "Next sequence number:382 (relative sequence number)" size= "0" pos= "" show= "382"/> < Field name= "Tcp.ack" showname= "acknowledgment number:1 (relative ACK number)" Size= "4" pos= "All show=" 1 "value=" 397d75 "/> <field name=" Tcp.hdr_len "showname=" Header length:20 bytes "size=" 1 "pos=" "" show= "[+] value="/> "


Seeing the above, I don't have to say anything. Use Python to do a simple XML file parsing, extract the data out on it.

So one of the remaining questions is how does the algorithm for follow TCP stream be implemented? The essence is the process of how a TCP data is reorganized, refer to this blog post TCP packet reorganization implementation analysis

Here, for the sake of simplicity, I have made some constraints:

    1. Only data in a single direction such as a-->b can be extracted. If you need to extract b-->a data, you can re-filter the data and execute the script once.
    2. Ignores the initial SYN packet and the fin package when disconnected.
Based on the above two simplification, the actual algorithm can be simplified to sort from small to large based on the SEQ in TCP frames. A simple example: There are three TCP packets, sorted by SEQ as follows

(Seq=1, nxtseq=5, data= ' 1234 '), (seq=4, nxt=6, Data= '), (seq=7,nxt=8, data= ' 7 ')

The nxtseq of the first packet > The SEQ of the second data shows that there is data duplication between the two packets, and the same is true, repeating the number ' 4 '

Nxtseq of the second packet < seq for the third packet, indicating that there is a drop frame between the two packets. The same is true, missing the number ' 6 '

Well, here's the principle.


Just a little bit about the Wireshark filtering rules and the limitations of this algorithm

    1. According to IP filtering a direction of data, generally can be executed first wireshark follow TCP stream function, generally in the filter column will have such an expression tcp.stream eq xxx. After this expression, you can continue to follow the expression of IP filtering: tcp.steam eq xxx and ip.src==xxx and ip.dst==xxx
    2. Filter data in a direction according to the TCP port number. First with IP filtering, first fixed to a TCP connection, get Tcp.stream eq xxx. Then add Port filter: Tcp.stream eq xxx and tcp.srcport==xxx and tcp.dstport==xxx
    3. The limitation of this tool is that the XML file is parsed and extracted based on the Python element tree, so even if the 100M pcap file is parsed, the first generated PDML file will burst to hundreds of megabytes, and then the hundreds of trillion files are read into memory again (python Element tree), the total down is to generate a pdml file a little slower (a few minutes), memory consumption is very large, hundreds of trillion.


Finally, simply post some key code. The complete script can be downloaded from here for free Tcpparser--follow TCP stream by Python

This is to extract the required element information from one of the proto in the Pdml file.

def extract_element (self, Proto, Elem):        result = Dict () for        key in Elem.keys ():            result[key]= ""                FieldName   = ""        attribname  = "" For            field in Proto.findall ("field"):            fieldname = Field.get (' name ')            if fieldname in Elem:                attribname = Elem[fieldname]                result[fieldname] = Field.get (Attribname, ")                        return result


def regularize_stream (self, frame_list): "The data of the TCP stream is being formatted mainly by seq,nxtseq the missing segment and removing duplicates        Data is empty when frame.number= ' lost ' deletes duplicate data, as much as possible to retain the data received earlier, that is, the data of the previous packet, when there are many missing segment.         "' Self.reporter.title (" Tcpparser regularize timestamp ") Timer = timer (" Regularize_stream_data "). Start ()                 Reg_frame_list = [] Expectseq =-1 first = True for frame in frame_list:if first: # initial Packet first = False Expectseq = frame["Tcp.nxtseq"] reg_fr            Ame_list.append (frame) Continue # starts with a second packet seq = frame["Tcp.seq"] Nxtseq = frame["Tcp.nxtseq"] if seq = = expectseq: # data just, completely continuous, not many if Nxtse Q = = 0:continue # indicates ACK packet, meaningless expectseq = Nxtseq reg_frame_list.append (frame) Eli   F seq > Expectseq:             # The data is missing, stating that the packet dropped Self.reporter.error ("Previous TCP segment is lost:" + str (FRAME[TCPFRAME.KEY_FR AMENo]) # Newpacket = Self.new_lost_packet (frame, str (EXPECTSEQ), str (seq)) # reg_frame_list.                            Append (newpacket) reg_frame_list.append (frame) Expectseq = Nxtseq Elif seq < expectseq: # data overlap, data is re-transmitted too much data self.reporter.warning ("TCP segment Retransmis                    Sion: "+ str (Frame[tcpframe.key_frameno])) if Expectseq < nxtseq: # Current packet needs to discard part of the content # pre_packet[-(EXPECTSEQ-SEQ): -1] = = Frame[0:expectseq-seq] frame["tcp.seq"] = expect Seq frame["Data"] = frame["Data"][EXPECTSEQ-NXTSEQ:] frame["datalen"] = Len (frame[)                    Data "]) Expectseq = Nxtseq Reg_frame_list.append (frame) Else: # The contents of the current packet can beCompletely discard # Expectseq remains unchanged # pre_packet[-(NXTSEQ-SEQ):] = Frame[:nextseq-seq] Pass Timer.stop () return reg_frame_list



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.