Skip to content

Commit 8360c21

Browse files
committed
update READM.md
updated Whats New Added a section about clustering
1 parent 8035cb6 commit 8360c21

File tree

1 file changed

+48
-32
lines changed

1 file changed

+48
-32
lines changed

‎README.md

Lines changed: 48 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -10,46 +10,49 @@ the detection of similar objects by comparing their hash values. Note that
1010
the byte stream should have a sufficient amount of complexity. For example,
1111
a byte stream of identical bytes will not generate a hash value.
1212

13-
## What's New in TLSH 4.x.x
14-
02/July/2021
15-
Release version 4.7.1
16-
Updated Python realease with additional functions:
17-
* lvalue
18-
* q1ratio
19-
* q2ratio
20-
* checksum
21-
* bucket_value
22-
* is_valid
23-
24-
23/April/2021
25-
Release version 4.6.0
26-
Issue 99 raised issues about what to do when evaluating the TLSH for files over 4GB
27-
We decided to define that TLSH is the TLSH of the first 4GB of a file
28-
29-
14/April/2021
30-
We have written technical material that focuses on 2 topics at [https://tlsh.org](https://tlsh.org/)
31-
- fast nearest neighbour search and scalable clustering
32-
- robustness to attack
13+
## What's New in TLSH 4.10.x
14+
22/09/2021
15+
16+
**4.9.3**
17+
<PRE>
18+
13/09/2021
19+
added options -thread and -private
20+
-thread the TLSH is evaluated with 2 threads (faster calculation)
21+
Only done for files / bytestreams >= 10000 bytes
22+
But this means that it is impossible to calculate the checksum
23+
So the checksum is set to zero
24+
-private
25+
Does not evaluate the checksum
26+
Useful if you do not want to leak information
27+
Slightly faster than default TLSH (code was written to optimize this)
28+
<PRE>
29+
added Python tools for clustering file
30+
using DBSCAN
31+
using HAC-T
32+
we provide scripts to show people how to cluster the Malware Bazaar dataset using TLSH
33+
</PRE>
34+
22/July/2021
35+
36+
Release version 4.8.x - merged in pull requests for more stable installation
37+
Release version 4.9.x - added -thread and -private options
38+
Both versions are faster than previous versions, but they set the checksum to 00
39+
This loses a very small part of the functionality
40+
See 4.9.3 in the Change_History to see timing comparisons.
41+
Release version 4.10.x - a Python clustering tool
42+
See the directory tlshCluster
3343

3444
2020
3545
- adopted by [Virus Total](https://developers.virustotal.com/v3.0/reference#files-tlsh)
3646
- adopted by [Malware Bazaar](https://bazaar.abuse.ch/api/#tlsh)
3747

38-
26/March/2020
39-
- adding version identifier to the digest
40-
- added output options (-o)
41-
- added json object output (-ojson)
42-
- added null digest (TNULL)
43-
4448
TLSH has gained some traction. It has been included in STIX 2.1 and been ported to a number of langauges.
4549

46-
We are adding a version identifier ("T1") to the start of the digest so that we can
50+
We have added a version identifier ("T1") to the start of the digest so that we can
4751
cleary distinguish between different variants of the digest (such as non-standard choices of 3 byte checksum).
4852
This means that we do not rely on the length of the hex string to determine if a hex string is a TLSH digest
4953
(this is a brittle method for identifying TLSH digests).
5054
We are doing this to enable compatibility, especially backwards compatibility of the TLSH approach.
5155

52-
This release will add "T1" to the start of TLSH digests.
5356
The code is backwards compatible, it can still read and interpret 70 hex character strings as TLSH digests.
5457
And data sets can include mixes of the old and new digests.
5558
If you need old style TLSH digests to be outputted, then use the command line option '-old'
@@ -292,12 +295,25 @@ TLSH similarity is expressed as a difference score:
292295
- For the 72 characters hash, there is a detailed table of experimental Detection rates and False Positive rates
293296
based on the threshhold. see [Table II on page 5](https://github.com/trendmicro/tlsh/blob/master/TLSH_CTC_final.pdf)
294297

298+
# Clustering
299+
- See the Python code and Jupyter notebooks in tlshCluster.
300+
- We provide Python code for the HAC-T method.
301+
We also provide code so that users can use DBSCAN.
302+
- We show users how to create dendograms for files, which are a useful diagram showing relationships between files and groups.
303+
- We provide tools for clustering the Malware Bazaar dataset, which contains a few hundred thousand samples.
304+
- The HAC-T method is described in [HAC-T and fast search for similarity in security](https://tlsh.org/papersDir/COINS_2020_camera_ready.pdf)
305+
295306
# Publications
296307

297-
- Jonathan Oliver, Chun Cheng, and Yanggui Chen, [TLSH - A Locality Sensitive Hash](https://github.com/trendmicro/tlsh/blob/master/TLSH_CTC_final.pdf).
298-
4th Cybercrime and Trustworthy Computing Workshop, Sydney, November 2013
299-
- Jonathan Oliver, Scott Forman, and Chun Cheng, [Using Randomization to Attack Similarity Digests](https://github.com/trendmicro/tlsh/blob/master/Attacking_LSH_and_Sim_Dig.pdf).
300-
ATIS 2014, November, 2014, pages 199-210
308+
- Jonathan Oliver, Chun Cheng, and Yanggui Chen,
309+
[TLSH - A Locality Sensitive Hash](https://github.com/trendmicro/tlsh/blob/master/TLSH_CTC_final.pdf).
310+
4th Cybercrime and Trustworthy Computing Workshop, Sydney, November 2013
311+
- Jonathan Oliver, Scott Forman, and Chun Cheng,
312+
[Using Randomization to Attack Similarity Digests](https://github.com/trendmicro/tlsh/blob/master/Attacking_LSH_and_Sim_Dig.pdf).
313+
ATIS 2014, November, 2014, pages 199-210
314+
- Jonathan Oliver, Muqeet Ali, and Josiah Hagen.
315+
[HAC-T and fast search for similarity in security](https://tlsh.org/papersDir/COINS_2020_camera_ready.pdf)
316+
2020 International Conference on Omni-layer Intelligent Systems (COINS). IEEE, 2020.
301317

302318
# Current Version
303319

0 commit comments

Comments
 (0)