Social engineering database applications implemented using Whoosh
When organizing the computer, I found a social engineering library written in Python. I used Whoosh to create an index for the social engineering library file based on the line word segmentation. I directly used the index to query the password and returned the results in json format.
? Many leaked databases were collected in the past, including txt, SQL, csv, html, xlsx, xls, and other file types. The file content is also diverse. I want to process the data as a social engineering database for future use. Most of the Internet uses PHP + Mysql to build a social engineering database, classify and process the data in the format, and save the formatted data to the database for search.
I planned to build a new one. I thought about it and gave up later. First, it is time-consuming to sort out the data and process the data in different formats. Second, after the data is sorted, it is stored in the database, with more data, the query efficiency may not be very high, so I gave up.
Later I saw the introduction to Whoosh, a full-text search engine fully implemented by python. The infrastructure is similar to Lucene. Using the KinoSearch Index algorithm, part of the Scoring Algorithm comes from Terrier. Although the performance is still quite different from that of xapian, it is easier to integrate and expand python only.
Whoosh also provides many pre-defined domain types to facilitate index creation :?
V ID: it can only be a unit value. It cannot be divided into several words, such as file path, URL, date, and category. V STORED: this field is STORED with the file, but cannot be indexed or queried. V KEYWORD: keywords separated by spaces or commas, which can be indexed and searched. Word search is not supported. V TEXT: TEXT content. Creates and stores text indexes and supports word search. V NUMERIC: NUMERIC type v BOOLEAN: BOOLEAN value v DATETIME: Time object type
? Is an example. It is very simple to create an index schema and then create an index using Whoosh .?
? Then I plan to use Whoosh as the core to build a social engineering database application. After all, it is for my own use and the query efficiency is acceptable even if it is a little slower. I simply planned several functions that I care about :?
1. Implement breakpoint indexing; 2. automatically identify the encoding of txt, html, SQL, csv, and other social engineering database files; 3. automatically remove duplicate files; 4. Create a web interface as the query interface; 5. Create a folder to store the social engineering library and automatically create an index for the new files in the folder;
The general idea is as follows:
1. Create a MongoDB set to store the file path and MD5 value. 2. Create a configuration file to configure the folder and index storage folder and breakpoint location of the leaked database; 3. Create two folders, one for storing collected leaked database files and the other for storing index files. 4. The program traverses the folder for storing the database every several seconds and writes the file path to MongoDB in sequence. Then, the indexing program reads the file from MongoDB in sequence, computes the file MD5 value, and compares whether the MD5 value exists in MongoDB. If yes, the file is skipped.
5. stores the MD5 value of a file, detects the file encoding, and calls different index modules based on the file format to create an index. The word segmentation function is implemented using the jieba word segmentation function;
6. If an exception occurs during the indexing process or the index is interrupted manually, record the objectID value and row number of the index file in Mongodb;
7. Create a web interface using Bottle to provide external search services. To Optimize Query Efficiency, replace all separators such as @.-_ with spaces for fuzzy search;