You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+48-32Lines changed: 48 additions & 32 deletions
Original file line number
Diff line number
Diff line change
@@ -10,46 +10,49 @@ the detection of similar objects by comparing their hash values. Note that
10
10
the byte stream should have a sufficient amount of complexity. For example,
11
11
a byte stream of identical bytes will not generate a hash value.
12
12
13
-
## What's New in TLSH 4.x.x
14
-
02/July/2021
15
-
Release version 4.7.1
16
-
Updated Python realease with additional functions:
17
-
* lvalue
18
-
* q1ratio
19
-
* q2ratio
20
-
* checksum
21
-
* bucket_value
22
-
* is_valid
23
-
24
-
23/April/2021
25
-
Release version 4.6.0
26
-
Issue 99 raised issues about what to do when evaluating the TLSH for files over 4GB
27
-
We decided to define that TLSH is the TLSH of the first 4GB of a file
28
-
29
-
14/April/2021
30
-
We have written technical material that focuses on 2 topics at [https://tlsh.org](https://tlsh.org/)
31
-
- fast nearest neighbour search and scalable clustering
32
-
- robustness to attack
13
+
## What's New in TLSH 4.10.x
14
+
22/09/2021
15
+
16
+
**4.9.3**
17
+
<PRE>
18
+
13/09/2021
19
+
added options -thread and -private
20
+
-thread the TLSH is evaluated with 2 threads (faster calculation)
21
+
Only done for files / bytestreams >= 10000 bytes
22
+
But this means that it is impossible to calculate the checksum
23
+
So the checksum is set to zero
24
+
-private
25
+
Does not evaluate the checksum
26
+
Useful if you do not want to leak information
27
+
Slightly faster than default TLSH (code was written to optimize this)
28
+
<PRE>
29
+
added Python tools for clustering file
30
+
using DBSCAN
31
+
using HAC-T
32
+
we provide scripts to show people how to cluster the Malware Bazaar dataset using TLSH
33
+
</PRE>
34
+
22/July/2021
35
+
36
+
Release version 4.8.x - merged in pull requests for more stable installation
37
+
Release version 4.9.x - added -thread and -private options
38
+
Both versions are faster than previous versions, but they set the checksum to 00
39
+
This loses a very small part of the functionality
40
+
See 4.9.3 in the Change_History to see timing comparisons.
41
+
Release version 4.10.x - a Python clustering tool
42
+
See the directory tlshCluster
33
43
34
44
2020
35
45
- adopted by [Virus Total](https://developers.virustotal.com/v3.0/reference#files-tlsh)
36
46
- adopted by [Malware Bazaar](https://bazaar.abuse.ch/api/#tlsh)
37
47
38
-
26/March/2020
39
-
- adding version identifier to the digest
40
-
- added output options (-o)
41
-
- added json object output (-ojson)
42
-
- added null digest (TNULL)
43
-
44
48
TLSH has gained some traction. It has been included in STIX 2.1 and been ported to a number of langauges.
45
49
46
-
We are adding a version identifier ("T1") to the start of the digest so that we can
50
+
We have added a version identifier ("T1") to the start of the digest so that we can
47
51
cleary distinguish between different variants of the digest (such as non-standard choices of 3 byte checksum).
48
52
This means that we do not rely on the length of the hex string to determine if a hex string is a TLSH digest
49
53
(this is a brittle method for identifying TLSH digests).
50
54
We are doing this to enable compatibility, especially backwards compatibility of the TLSH approach.
51
55
52
-
This release will add "T1" to the start of TLSH digests.
53
56
The code is backwards compatible, it can still read and interpret 70 hex character strings as TLSH digests.
54
57
And data sets can include mixes of the old and new digests.
55
58
If you need old style TLSH digests to be outputted, then use the command line option '-old'
@@ -292,12 +295,25 @@ TLSH similarity is expressed as a difference score:
292
295
- For the 72 characters hash, there is a detailed table of experimental Detection rates and False Positive rates
293
296
based on the threshhold. see [Table II on page 5](https://github.com/trendmicro/tlsh/blob/master/TLSH_CTC_final.pdf)
294
297
298
+
# Clustering
299
+
- See the Python code and Jupyter notebooks in tlshCluster.
300
+
- We provide Python code for the HAC-T method.
301
+
We also provide code so that users can use DBSCAN.
302
+
- We show users how to create dendograms for files, which are a useful diagram showing relationships between files and groups.
303
+
- We provide tools for clustering the Malware Bazaar dataset, which contains a few hundred thousand samples.
304
+
- The HAC-T method is described in [HAC-T and fast search for similarity in security](https://tlsh.org/papersDir/COINS_2020_camera_ready.pdf)
305
+
295
306
# Publications
296
307
297
-
- Jonathan Oliver, Chun Cheng, and Yanggui Chen, [TLSH - A Locality Sensitive Hash](https://github.com/trendmicro/tlsh/blob/master/TLSH_CTC_final.pdf).
298
-
4th Cybercrime and Trustworthy Computing Workshop, Sydney, November 2013
299
-
- Jonathan Oliver, Scott Forman, and Chun Cheng, [Using Randomization to Attack Similarity Digests](https://github.com/trendmicro/tlsh/blob/master/Attacking_LSH_and_Sim_Dig.pdf).
300
-
ATIS 2014, November, 2014, pages 199-210
308
+
- Jonathan Oliver, Chun Cheng, and Yanggui Chen,
309
+
[TLSH - A Locality Sensitive Hash](https://github.com/trendmicro/tlsh/blob/master/TLSH_CTC_final.pdf).
310
+
4th Cybercrime and Trustworthy Computing Workshop, Sydney, November 2013
311
+
- Jonathan Oliver, Scott Forman, and Chun Cheng,
312
+
[Using Randomization to Attack Similarity Digests](https://github.com/trendmicro/tlsh/blob/master/Attacking_LSH_and_Sim_Dig.pdf).
313
+
ATIS 2014, November, 2014, pages 199-210
314
+
- Jonathan Oliver, Muqeet Ali, and Josiah Hagen.
315
+
[HAC-T and fast search for similarity in security](https://tlsh.org/papersDir/COINS_2020_camera_ready.pdf)
316
+
2020 International Conference on Omni-layer Intelligent Systems (COINS). IEEE, 2020.
0 commit comments